[MRG] Refactor the metric() method #152

wdevazelhes · 2019-01-09T14:48:44Z

Fixes #147

TODO:

Add some tests
Add references to the right parts of documentation (like Mahalanobis Distances) in the docstrings (if possible)
Emphasize a bit more the difference and links between this and score_pairs in the docstring
Be careful that it should work on 1D arrays
Be careful that it should not return a float if given 2D arrays
Remove useless np.atleast2d (those in transformer_from_metric and those just before returning the transformer_)

wdevazelhes · 2019-01-09T16:09:50Z

Maybe this PR would be the occasion for changing score_pairs into some more expressive name ?
What do you think ? compute_distances maybe ?

wdevazelhes · 2019-01-10T09:16:16Z

test/test_mahalanobis_mixin.py

+  n_features = X.shape[1]
+  a, b = (rng.randn(n_features), rng.randn(n_features))
+  euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
+  assert (euc_dist - metric(a, b)) / euc_dist < 1e-15


Here I put 1e-15 because it fails for 1e-16 (I guess the transform plus euclidean distance give slightly different result from my implementation (transform of the difference plus sqrt of sum of squares)). But I think it's still OK right ?

Yeah, this is fine.

Or you could use assert_almost_equal with a custom tolerance. 1e-15 may be a bit brittle?

Yes I agree it's better to use a built-in function. I just saw that assert_almost_equal gives an absolute error whereas numpy.testing.assert_allclose gives a relative error, I guess it's even better to put a relative one in this case ? For the 1e-15, I agree, it's just that I was thinking this way, if we change our implementation and the precision worsens for one reason or another, we could notice it with this test (and it could help solve bugs elsewhere for instance)

yes relative error is probably better

wdevazelhes · 2019-01-10T11:03:29Z

metric_learn/base_metric.py

+        The distance between u and v according to the new metric.
+      """
+      u = _validate_vector(u)
+      v = _validate_vector(v)


Here I use scipy's _validate_vector function (used in functions like scipy.spatial.distance.euclidean), to more easily mimic scipy's distances behaviour (regarding 1D arrays etc). But I see the underscore at the beginning, which would mean that this function is private. Does it mean I am not supposed to use it ?

Yeah, it means it's subject to change and we shouldn't depend on it.

Allright, I'll replace it by something else

# Conflicts: # metric_learn/base_metric.py

wdevazelhes · 2019-01-10T13:19:52Z

I guess if you agree with my comments this PR is ready to merge

terrytangyuan

It would be better to avoid using private function since it may get removed without any prior notices. Perhaps look into _validate_vector's implementation to see if you can implement it using scipy's exposed APIs or utility functions.

perimosocordiae · 2019-01-10T14:51:24Z

metric_learn/base_metric.py

+        The distance between u and v according to the new metric.
+      """
+      u = _validate_vector(u)
+      v = _validate_vector(v)


Yeah, it means it's subject to change and we shouldn't depend on it.

perimosocordiae · 2019-01-10T14:52:18Z

test/test_mahalanobis_mixin.py

+  n_features = X.shape[1]
+  a, b = (rng.randn(n_features), rng.randn(n_features))
+  euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
+  assert (euc_dist - metric(a, b)) / euc_dist < 1e-15


Yeah, this is fine.

perimosocordiae · 2019-01-10T14:53:06Z

test/test_mahalanobis_mixin.py

+  assert metric(a, b) >= 0  # positivity
+  assert metric(a, b) == metric(b, a)  # symmetry
+  # one side of identity indiscernables: x == y => d(x, y) == 0. The other
+  # side is not always true for Mahalanobis distances.


I'm not exactly sure what this comment means. Can you elaborate?

Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y) can be 0 even if x != y

Yes, x == y => d(x, y) == 0 but d(x, y) == 0 !=> x == y

perimosocordiae · 2019-01-10T15:05:36Z

metric_learn/base_metric.py

@@ -177,8 +214,57 @@ def transform(self, X):
                             accept_sparse=True)
    return X_checked.dot(self.transformer_.T)

-  def metric(self):


Maybe we should keep this for now but mark it as deprecated, and point to the new get_mahalanobis_matrix() method.

Yes, I agree, I forgot about that

bellet · 2019-01-11T15:30:58Z

Maybe this PR would be the occasion for changing score_pairs into some more expressive name ?
What do you think ? compute_distances maybe ?

I think #131 is probably more the place

bellet

A few nitpicks, otherwise LGTM

bellet · 2019-01-11T15:42:41Z

metric_learn/base_metric.py

@@ -85,6 +94,24 @@ def _prepare_inputs(self, X, y=None, type_of_inputs='classic',
                       tuple_size=getattr(self, '_tuple_size', None),
                       **kwargs)

+  @abstractmethod
+  def get_metric(self):
+    """Returns a function that returns the learned metric between two points.


Returns a function that takes as input two 1D arrays and outputs the learned metric score on these two points?

Agreed it's clearer indeed

bellet · 2019-01-11T15:43:24Z

metric_learn/base_metric.py

+  def get_metric(self):
+    """Returns a function that returns the learned metric between two points.
+    This function will be independent from the metric learner that learned it
+    (it will not be modified if the initial metric learner is modified).


maybe add that the returned function can be directly plugged into the metric argument of sklearn's estimators

Agreed, will do

bellet · 2019-01-11T15:44:24Z

metric_learn/base_metric.py

+
+    See Also
+    --------
+    score_pairs : a method that returns the metric between several pairs of


the metric score

Agreed, will do

bellet · 2019-01-11T15:44:47Z

metric_learn/base_metric.py

+    See Also
+    --------
+    get_metric : a method that returns a function to compute the metric between
+      two points. The difference is that it works on two 1D arrays and cannot


The difference with score_pairs

Agreed, will do

bellet · 2019-01-11T15:45:09Z

metric_learn/base_metric.py

+    See Also
+    --------
+    score_pairs : a method that returns the metric between several pairs of
+      points. But this is a method of the metric learner and therefore can


-But +Unlike get_metric

Agreed, will do

bellet · 2019-01-11T15:46:35Z

metric_learn/base_metric.py

+
+    See Also
+    --------
+    score_pairs : a method that returns the metric between several pairs of


yes, will do

bellet · 2019-01-11T15:48:35Z

metric_learn/base_metric.py

+      Parameters
+      ----------
+      u : array-like, shape=(n_features,)
+        The first point involved in the distances computation.


Thanks, will do

bellet · 2019-01-11T15:48:41Z

metric_learn/base_metric.py

+      u : array-like, shape=(n_features,)
+        The first point involved in the distances computation.
+      v : array-like, shape=(n_features,)
+        The second point involved in the distances computation.


bellet · 2019-01-11T15:52:48Z

test/test_mahalanobis_mixin.py

+  assert metric(a, b) >= 0  # positivity
+  assert metric(a, b) == metric(b, a)  # symmetry
+  # one side of identity indiscernables: x == y => d(x, y) == 0. The other
+  # side is not always true for Mahalanobis distances.


Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y) can be 0 even if x != y

bellet · 2019-01-11T15:53:44Z

test/test_mahalanobis_mixin.py

+  metric = model.get_metric()
+
+  n_features = X.shape[1]
+  a, b, c = (rng.randn(n_features) for _ in range(3))


perhaps it would be more convincing to test that these are true on a set of random triplets (a, b, c), instead of a single one

Yes I agree, will do

wdevazelhes · 2019-01-15T08:07:40Z

It would be better to avoid using private function since it may get removed without any prior notices. Perhaps look into _validate_vector's implementation to see if you can implement it using scipy's exposed APIs or utility functions.

Yes, indeed, I'll look at this

wdevazelhes · 2019-01-15T10:43:16Z

I adressed all comments. I just need to remove the metric_plotting that I included by mistake and we should be good to merge

wdevazelhes · 2019-01-15T10:50:25Z

It should be good to merge now, as soon as tests pass

wdevazelhes · 2019-01-22T12:46:20Z

What do you mean use it in the tests? We should definitely test our actual method

I agree, I just meant we could replace this test:

@pytest.mark.parametrize('estimator, build_dataset', metric_learners,
                         ids=ids_metric_learners)
def test_get_metric_equivalent_to_transform_and_euclidean(estimator,
                                                          build_dataset):
  """Tests that the get_metric method of mahalanobis metric learners is the
  euclidean distance in the transformed space
  """
  rng = np.random.RandomState(42)
  input_data, labels, _, X = build_dataset()
  model = clone(estimator)
  set_random_state(model)
  model.fit(input_data, labels)
  metric = model.get_metric()
  n_features = X.shape[1]
  a, b = (rng.randn(n_features), rng.randn(n_features))
  euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
  assert_allclose(metric(a, b), euc_dist, rtol=1e-15)

By something like this:

@pytest.mark.parametrize('estimator, build_dataset', metric_learners,
                         ids=ids_metric_learners)
def test_get_metric_equivalent_to_explicit_mahalanobis(estimator,
                                                       build_dataset):
  """Tests that using the get_metric method of mahalanobis metric learners is 
  equivalent to explicitely calling scipy's mahalanobis metric
  """
  rng = np.random.RandomState(42)
  input_data, labels, _, X = build_dataset()
  model = clone(estimator)
  set_random_state(model)
  model.fit(input_data, labels)
  metric = model.get_metric()
  n_features = X.shape[1]
  a, b = (rng.randn(n_features), rng.randn(n_features))
  expected_dist = mahalanobis(a[None], b[None],
                              VI=model.get_mahalanobis_matrix())
  assert_allclose(metric(a, b), expected_dist, rtol=1e-15)

The test would even show more explicitely what the get_metric function does
But it's not very important

wdevazelhes · 2019-01-22T12:46:50Z

I agree that having an option for squaring is useful - in many cases one does not care about the square root and it makes distance computations faster. I think you should add it now

I agree, let's do it while we're at it :)

wdevazelhes · 2019-01-22T15:43:19Z

I just added the squared option, and addressed all the comments, so I think as soon as it is green we should be good to go

- ensure that the transformer_ fitted is always 2D: - in the result returned from transformer_from_metric - in the code of metric learners, for metric learners that don't call transformer_from_metric - for metric learners that cannot work on 1 feature, ensure it when checking the input - add a test to check this behaviour

wdevazelhes · 2019-01-23T16:37:15Z

I just added a commit that should fix the previous error, which does the following:

ensure that the transformer_ fitted is always 2D:
- in the result returned from transformer_from_metric
- in the code of metric learners, for metric learners that don't call transformer_from_metric
for metric learners that cannot work on 1 feature, ensure it when checking the input
add a test to check this behaviour

bellet · 2019-01-23T21:24:53Z

Where was a non-2D transformer generated?

wdevazelhes · 2019-01-24T09:22:52Z

Now that I look back at it, I realize it was only this line that had fixed the test test_get_metric_works_does_not_raise (failing in commit 5e29295) :

+      dist = transformed_diff.dot(transformed_diff.T)
-     dist = np.dot(transformed_diff, transformed_diff.T)

(The problem was that transformed_diff can be a float (if the transformer_ is [[scalar]]), and a float does not have a method dot, but we can do np.dot(float1, float2))
At first sight I thought that transformed_diff should not be float and maybe if transformer_ was 2D (I thought it could happen to be scalar) it would solve the problem, so I ensured it would be 2D and tested, but I was wrong in fact it didn't solve the pb, only the np.dot(a, b) part solved it. And in fact, this 2D transformer check was in fact useless because I checked and in fact returned transformer_s were already all 2D (they pass test_transformer_is_2D even without the np.atleast_2d checks)

So I guess in fact this np.atleast2d checks are useless so I can remove them, but maybe we can keep the test which is always useful ? Also it made me realize that if algorithms don't work on 1 feature we should put the appropriate error message

…sions and replace it by str

bellet · 2019-01-25T14:25:47Z

I am not sure I follow. I think a transformer should always be 2D. Even assuming the data is 1D, I would want the shape to be (1, 1) for consistency

bellet · 2019-01-25T14:26:49Z

I also do not see why algorithms wouldn't work on data with a single feature? (it is useless, sure, but should probably work)

wdevazelhes · 2019-01-28T15:39:15Z

I am not sure I follow. I think a transformer should always be 2D. Even assuming the data is 1D, I would want the shape to be (1, 1) for consistency

Yes, that's the case: even without forcing with np.atleast2d as I did in my previous commit, they in fact passed the test checking that with scalar inputs, the data should be of shape (1, 1) (I checked it by going back to an old commit and copy-pasting this test), except those that fail because they cannot work on scalar arrays (see my next comment)

wdevazelhes · 2019-01-28T15:48:32Z

I also do not see why algorithms wouldn't work on data with a single feature? (it is useless, sure, but should probably work)

I agree, SDML and RCA returned an error because they called respectively pinvh and _inv_sqrtm on a non 2d array. I'll fix them by calling np.atleast2d on the arguments of these methods, that's better than forbidding that the input be scalar...

bellet · 2019-01-28T16:04:19Z

I agree, SDML and RCA returned an error because they called respectively pinvh and _inv_sqrtm on a non 2d array. I'll fix them by calling np.atleast2d on the arguments of these methods, that's better than forbidding that the input be scalar...

Why was the input scalar then? I think we have many tests to make sure the input data has the right shape. So even with only 1 feature, we should always be working with 2D arrays. Am I missing something?

wdevazelhes · 2019-01-28T16:19:56Z

Why was the input scalar then? I think we have many tests to make sure the input data has the right shape. So even with only 1 feature, we should always be working with 2D arrays. Am I missing something?

The input data has the right shape, but some intermediary terms can have various shapes because of numpy's operations

For SDML for instance, we called pinvh(np.cov(X, row_var=False).
For instance if X = np.array([[1], [2], [3]]) then np.cov(X, row=False) equals array(1.0), which has shape (), so we should do instead np.atleast_2d(np.cov(X, rowvar=False)) (equals array([[ 1.]]))

…or RCA and SDML

bellet · 2019-01-28T16:37:33Z

OK looks good

wdevazelhes · 2019-01-29T10:22:34Z

I just removed the useless np.atleast2d (see comments #152 (comment) and #152 (comment)), so I think we should be good to go

bellet · 2019-01-29T10:27:11Z

No doc to update at this point? (I guess not since the part of common methods is currently commented out)

wdevazelhes · 2019-01-29T12:22:03Z

Ah yes, I forgot about the docs, I udpated the commented part not to forget to add it when in other PRs I'll update the doc.

William de Vazelhes added 5 commits January 9, 2019 14:54

MAINT Rename metric() into get_mahalanobis_matrix()

a3384b1

ENH: refactor methods to get the metric

8e0d197

DOC: change description of distance into pseudo-metric

6dd118e

MAINT: make description clearer

c7e40f6

ENH: enhance description

1947ea5

William de Vazelhes added 3 commits January 9, 2019 17:12

MAINT: remove the 1D part in case we allow 2D

bee6902

FIX: fix expression for mahalanobis distance

646cf97

TST: Add tests

00d37c9

wdevazelhes commented Jan 10, 2019

View reviewed changes

William de Vazelhes added 2 commits January 10, 2019 11:56

ENH: deal with the 1D case

c9eefb4

Rename forgotten point 1 and point 2 to u and v

bd6aac0

wdevazelhes commented Jan 10, 2019

View reviewed changes

William de Vazelhes added 2 commits January 10, 2019 12:05

Merge branch 'master' into feat/metric_func

22141f5

# Conflicts: # metric_learn/base_metric.py

STY: Fix PEP8 errors

9e447f6

wdevazelhes changed the title ~~[WIP] Refactor the metric() method~~ [MRG] Refactor the metric() method Jan 10, 2019

terrytangyuan reviewed Jan 10, 2019

View reviewed changes

perimosocordiae reviewed Jan 10, 2019

View reviewed changes

bellet approved these changes Jan 11, 2019

View reviewed changes

This was referenced Jan 15, 2019

Use a threshold for prediction for pairs #131

Closed

doctests not working #156

Open

Address all comments

201320b

William de Vazelhes added 2 commits January 15, 2019 11:45

Revert changes in metric_plotting included by mistake

4b660fa

FIX: use custom validate_vector

61a33cc

TST: fix syntax error for assert in test

72153ed

William de Vazelhes added 2 commits January 22, 2019 15:55

MAINT: address comments from review scikit-learn-contrib#152 (review)

d2c0614

ENH: add squared option

5e29295

FIX: remove message that is not supported anymore by python newer ver…

a2955e0

…sions and replace it by str

wdevazelhes changed the title ~~[WIP] Refactor the metric() method~~ [MRG] Refactor the metric() method Jan 24, 2019

TST: make shape testing more precise

c8708b2

William de Vazelhes added 2 commits January 28, 2019 17:26

TST: enforce the 2d transformer test for everyone, and make it pass f…

7d4efd9

…or RCA and SDML

TST: fix typo in removing

0c7c5dc

wdevazelhes changed the title ~~[MRG] Refactor the metric() method~~ [WIP] Refactor the metric() method Jan 28, 2019

Remove unnecessary calls of np.atleast2d

7dfd874

wdevazelhes changed the title ~~[WIP] Refactor the metric() method~~ [MRG] Refactor the metric() method Jan 29, 2019

Add functions to commented doc

80c2943

bellet merged commit d3620bb into scikit-learn-contrib:master Jan 29, 2019

wdevazelhes deleted the feat/metric_func branch January 29, 2019 14:44

[MRG] Refactor the metric() method #152

[MRG] Refactor the metric() method #152

Uh oh!

Conversation

wdevazelhes commented Jan 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wdevazelhes commented Jan 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdevazelhes Jan 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wdevazelhes commented Jan 10, 2019

Uh oh!

terrytangyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bellet commented Jan 11, 2019

Uh oh!

bellet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wdevazelhes commented Jan 9, 2019 •

edited

Loading

wdevazelhes Jan 15, 2019 •

edited

Loading