Skip to content

[MRG] Refactor the metric() method #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jan 29, 2019

Conversation

wdevazelhes
Copy link
Member

@wdevazelhes wdevazelhes commented Jan 9, 2019

Fixes #147

TODO:

  • Add some tests
  • Add references to the right parts of documentation (like Mahalanobis Distances) in the docstrings (if possible)
  • Emphasize a bit more the difference and links between this and score_pairs in the docstring
  • Be careful that it should work on 1D arrays
  • Be careful that it should not return a float if given 2D arrays
  • Remove useless np.atleast2d (those in transformer_from_metric and those just before returning the transformer_)

@wdevazelhes
Copy link
Member Author

Maybe this PR would be the occasion for changing score_pairs into some more expressive name ?
What do you think ? compute_distances maybe ?

n_features = X.shape[1]
a, b = (rng.randn(n_features), rng.randn(n_features))
euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
assert (euc_dist - metric(a, b)) / euc_dist < 1e-15
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I put 1e-15 because it fails for 1e-16 (I guess the transform plus euclidean distance give slightly different result from my implementation (transform of the difference plus sqrt of sum of squares)). But I think it's still OK right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could use assert_almost_equal with a custom tolerance. 1e-15 may be a bit brittle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree it's better to use a built-in function. I just saw that assert_almost_equal gives an absolute error whereas numpy.testing.assert_allclose gives a relative error, I guess it's even better to put a relative one in this case ? For the 1e-15, I agree, it's just that I was thinking this way, if we change our implementation and the precision worsens for one reason or another, we could notice it with this test (and it could help solve bugs elsewhere for instance)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes relative error is probably better

The distance between u and v according to the new metric.
"""
u = _validate_vector(u)
v = _validate_vector(v)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I use scipy's _validate_vector function (used in functions like scipy.spatial.distance.euclidean), to more easily mimic scipy's distances behaviour (regarding 1D arrays etc). But I see the underscore at the beginning, which would mean that this function is private. Does it mean I am not supposed to use it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it means it's subject to change and we shouldn't depend on it.

Copy link
Member Author

@wdevazelhes wdevazelhes Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allright, I'll replace it by something else

William de Vazelhes added 2 commits January 10, 2019 12:05
# Conflicts:
#	metric_learn/base_metric.py
@wdevazelhes wdevazelhes changed the title [WIP] Refactor the metric() method [MRG] Refactor the metric() method Jan 10, 2019
@wdevazelhes
Copy link
Member Author

I guess if you agree with my comments this PR is ready to merge

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to avoid using private function since it may get removed without any prior notices. Perhaps look into _validate_vector's implementation to see if you can implement it using scipy's exposed APIs or utility functions.

The distance between u and v according to the new metric.
"""
u = _validate_vector(u)
v = _validate_vector(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it means it's subject to change and we shouldn't depend on it.

n_features = X.shape[1]
a, b = (rng.randn(n_features), rng.randn(n_features))
euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
assert (euc_dist - metric(a, b)) / euc_dist < 1e-15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is fine.

assert metric(a, b) >= 0 # positivity
assert metric(a, b) == metric(b, a) # symmetry
# one side of identity indiscernables: x == y => d(x, y) == 0. The other
# side is not always true for Mahalanobis distances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure what this comment means. Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y) can be 0 even if x != y

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, x == y => d(x, y) == 0 but d(x, y) == 0 !=> x == y

@@ -177,8 +214,57 @@ def transform(self, X):
accept_sparse=True)
return X_checked.dot(self.transformer_.T)

def metric(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should keep this for now but mark it as deprecated, and point to the new get_mahalanobis_matrix() method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, I forgot about that

@bellet
Copy link
Member

bellet commented Jan 11, 2019

Maybe this PR would be the occasion for changing score_pairs into some more expressive name ?
What do you think ? compute_distances maybe ?

I think #131 is probably more the place

Copy link
Member

@bellet bellet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks, otherwise LGTM

@@ -85,6 +94,24 @@ def _prepare_inputs(self, X, y=None, type_of_inputs='classic',
tuple_size=getattr(self, '_tuple_size', None),
**kwargs)

@abstractmethod
def get_metric(self):
"""Returns a function that returns the learned metric between two points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a function that takes as input two 1D arrays and outputs the learned metric score on these two points?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's clearer indeed

def get_metric(self):
"""Returns a function that returns the learned metric between two points.
This function will be independent from the metric learner that learned it
(it will not be modified if the initial metric learner is modified).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add that the returned function can be directly plugged into the metric argument of sklearn's estimators

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do


See Also
--------
score_pairs : a method that returns the metric between several pairs of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric score

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do

See Also
--------
get_metric : a method that returns a function to compute the metric between
two points. The difference is that it works on two 1D arrays and cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference with score_pairs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do

See Also
--------
score_pairs : a method that returns the metric between several pairs of
points. But this is a method of the metric learner and therefore can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-But +Unlike get_metric

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will do


See Also
--------
score_pairs : a method that returns the metric between several pairs of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do

Parameters
----------
u : array-like, shape=(n_features,)
The first point involved in the distances computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

u : array-like, shape=(n_features,)
The first point involved in the distances computation.
v : array-like, shape=(n_features,)
The second point involved in the distances computation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distance

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

assert metric(a, b) >= 0 # positivity
assert metric(a, b) == metric(b, a) # symmetry
# one side of identity indiscernables: x == y => d(x, y) == 0. The other
# side is not always true for Mahalanobis distances.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y) can be 0 even if x != y

metric = model.get_metric()

n_features = X.shape[1]
a, b, c = (rng.randn(n_features) for _ in range(3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps it would be more convincing to test that these are true on a set of random triplets (a, b, c), instead of a single one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, will do

@wdevazelhes
Copy link
Member Author

It would be better to avoid using private function since it may get removed without any prior notices. Perhaps look into _validate_vector's implementation to see if you can implement it using scipy's exposed APIs or utility functions.

Yes, indeed, I'll look at this

@wdevazelhes
Copy link
Member Author

I adressed all comments. I just need to remove the metric_plotting that I included by mistake and we should be good to merge

@wdevazelhes
Copy link
Member Author

It should be good to merge now, as soon as tests pass

@wdevazelhes
Copy link
Member Author

What do you mean use it in the tests? We should definitely test our actual method

I agree, I just meant we could replace this test:

@pytest.mark.parametrize('estimator, build_dataset', metric_learners,
                         ids=ids_metric_learners)
def test_get_metric_equivalent_to_transform_and_euclidean(estimator,
                                                          build_dataset):
  """Tests that the get_metric method of mahalanobis metric learners is the
  euclidean distance in the transformed space
  """
  rng = np.random.RandomState(42)
  input_data, labels, _, X = build_dataset()
  model = clone(estimator)
  set_random_state(model)
  model.fit(input_data, labels)
  metric = model.get_metric()
  n_features = X.shape[1]
  a, b = (rng.randn(n_features), rng.randn(n_features))
  euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
  assert_allclose(metric(a, b), euc_dist, rtol=1e-15)

By something like this:

@pytest.mark.parametrize('estimator, build_dataset', metric_learners,
                         ids=ids_metric_learners)
def test_get_metric_equivalent_to_explicit_mahalanobis(estimator,
                                                       build_dataset):
  """Tests that using the get_metric method of mahalanobis metric learners is 
  equivalent to explicitely calling scipy's mahalanobis metric
  """
  rng = np.random.RandomState(42)
  input_data, labels, _, X = build_dataset()
  model = clone(estimator)
  set_random_state(model)
  model.fit(input_data, labels)
  metric = model.get_metric()
  n_features = X.shape[1]
  a, b = (rng.randn(n_features), rng.randn(n_features))
  expected_dist = mahalanobis(a[None], b[None],
                              VI=model.get_mahalanobis_matrix())
  assert_allclose(metric(a, b), expected_dist, rtol=1e-15)

The test would even show more explicitely what the get_metric function does
But it's not very important

@wdevazelhes
Copy link
Member Author

I agree that having an option for squaring is useful - in many cases one does not care about the square root and it makes distance computations faster. I think you should add it now

I agree, let's do it while we're at it :)

@wdevazelhes
Copy link
Member Author

I just added the squared option, and addressed all the comments, so I think as soon as it is green we should be good to go

- ensure that the transformer_ fitted is always 2D:
  - in the result returned from transformer_from_metric
  - in the code of metric learners, for metric learners that don't call transformer_from_metric
- for metric learners that cannot work on 1 feature, ensure it when checking the input
- add a test to check this behaviour
@wdevazelhes
Copy link
Member Author

I just added a commit that should fix the previous error, which does the following:

  • ensure that the transformer_ fitted is always 2D:
    • in the result returned from transformer_from_metric
    • in the code of metric learners, for metric learners that don't call transformer_from_metric
  • for metric learners that cannot work on 1 feature, ensure it when checking the input
  • add a test to check this behaviour

@bellet
Copy link
Member

bellet commented Jan 23, 2019

Where was a non-2D transformer generated?

@wdevazelhes
Copy link
Member Author

Now that I look back at it, I realize it was only this line that had fixed the test test_get_metric_works_does_not_raise (failing in commit 5e29295) :

+      dist = transformed_diff.dot(transformed_diff.T)
-     dist = np.dot(transformed_diff, transformed_diff.T)

(The problem was that transformed_diff can be a float (if the transformer_ is [[scalar]]), and a float does not have a method dot, but we can do np.dot(float1, float2))
At first sight I thought that transformed_diff should not be float and maybe if transformer_ was 2D (I thought it could happen to be scalar) it would solve the problem, so I ensured it would be 2D and tested, but I was wrong in fact it didn't solve the pb, only the np.dot(a, b) part solved it. And in fact, this 2D transformer check was in fact useless because I checked and in fact returned transformer_s were already all 2D (they pass test_transformer_is_2D even without the np.atleast_2d checks)

So I guess in fact this np.atleast2d checks are useless so I can remove them, but maybe we can keep the test which is always useful ? Also it made me realize that if algorithms don't work on 1 feature we should put the appropriate error message

@wdevazelhes wdevazelhes changed the title [WIP] Refactor the metric() method [MRG] Refactor the metric() method Jan 24, 2019
@bellet
Copy link
Member

bellet commented Jan 25, 2019

I am not sure I follow. I think a transformer should always be 2D. Even assuming the data is 1D, I would want the shape to be (1, 1) for consistency

@bellet
Copy link
Member

bellet commented Jan 25, 2019

I also do not see why algorithms wouldn't work on data with a single feature? (it is useless, sure, but should probably work)

@wdevazelhes
Copy link
Member Author

wdevazelhes commented Jan 28, 2019

I am not sure I follow. I think a transformer should always be 2D. Even assuming the data is 1D, I would want the shape to be (1, 1) for consistency

Yes, that's the case: even without forcing with np.atleast2d as I did in my previous commit, they in fact passed the test checking that with scalar inputs, the data should be of shape (1, 1) (I checked it by going back to an old commit and copy-pasting this test), except those that fail because they cannot work on scalar arrays (see my next comment)

@wdevazelhes
Copy link
Member Author

I also do not see why algorithms wouldn't work on data with a single feature? (it is useless, sure, but should probably work)

I agree, SDML and RCA returned an error because they called respectively pinvh and _inv_sqrtm on a non 2d array. I'll fix them by calling np.atleast2d on the arguments of these methods, that's better than forbidding that the input be scalar...

@bellet
Copy link
Member

bellet commented Jan 28, 2019

I agree, SDML and RCA returned an error because they called respectively pinvh and _inv_sqrtm on a non 2d array. I'll fix them by calling np.atleast2d on the arguments of these methods, that's better than forbidding that the input be scalar...

Why was the input scalar then? I think we have many tests to make sure the input data has the right shape. So even with only 1 feature, we should always be working with 2D arrays. Am I missing something?

@wdevazelhes
Copy link
Member Author

Why was the input scalar then? I think we have many tests to make sure the input data has the right shape. So even with only 1 feature, we should always be working with 2D arrays. Am I missing something?

The input data has the right shape, but some intermediary terms can have various shapes because of numpy's operations

For SDML for instance, we called pinvh(np.cov(X, row_var=False).
For instance if X = np.array([[1], [2], [3]]) then np.cov(X, row=False) equals array(1.0), which has shape (), so we should do instead np.atleast_2d(np.cov(X, rowvar=False)) (equals array([[ 1.]]))

@bellet
Copy link
Member

bellet commented Jan 28, 2019

OK looks good

@wdevazelhes wdevazelhes changed the title [MRG] Refactor the metric() method [WIP] Refactor the metric() method Jan 28, 2019
@wdevazelhes wdevazelhes changed the title [WIP] Refactor the metric() method [MRG] Refactor the metric() method Jan 29, 2019
@wdevazelhes
Copy link
Member Author

I just removed the useless np.atleast2d (see comments #152 (comment) and #152 (comment)), so I think we should be good to go

@bellet
Copy link
Member

bellet commented Jan 29, 2019

No doc to update at this point? (I guess not since the part of common methods is currently commented out)

@wdevazelhes
Copy link
Member Author

Ah yes, I forgot about the docs, I udpated the commented part not to forget to add it when in other PRs I'll update the doc.

@bellet bellet merged commit d3620bb into scikit-learn-contrib:master Jan 29, 2019
@wdevazelhes wdevazelhes deleted the feat/metric_func branch January 29, 2019 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants