-
Notifications
You must be signed in to change notification settings - Fork 229
[MRG] Refactor the metric() method #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Refactor the metric() method #152
Conversation
Maybe this PR would be the occasion for changing |
test/test_mahalanobis_mixin.py
Outdated
n_features = X.shape[1] | ||
a, b = (rng.randn(n_features), rng.randn(n_features)) | ||
euc_dist = euclidean(model.transform(a[None]), model.transform(b[None])) | ||
assert (euc_dist - metric(a, b)) / euc_dist < 1e-15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I put 1e-15 because it fails for 1e-16 (I guess the transform plus euclidean distance give slightly different result from my implementation (transform of the difference plus sqrt of sum of squares)). But I think it's still OK right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you could use assert_almost_equal
with a custom tolerance. 1e-15
may be a bit brittle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree it's better to use a built-in function. I just saw that assert_almost_equal
gives an absolute error whereas numpy.testing.assert_allclose
gives a relative error, I guess it's even better to put a relative one in this case ? For the 1e-15, I agree, it's just that I was thinking this way, if we change our implementation and the precision worsens for one reason or another, we could notice it with this test (and it could help solve bugs elsewhere for instance)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes relative error is probably better
metric_learn/base_metric.py
Outdated
The distance between u and v according to the new metric. | ||
""" | ||
u = _validate_vector(u) | ||
v = _validate_vector(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I use scipy's _validate_vector function (used in functions like scipy.spatial.distance.euclidean
), to more easily mimic scipy's distances behaviour (regarding 1D arrays etc). But I see the underscore at the beginning, which would mean that this function is private. Does it mean I am not supposed to use it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it means it's subject to change and we shouldn't depend on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allright, I'll replace it by something else
# Conflicts: # metric_learn/base_metric.py
I guess if you agree with my comments this PR is ready to merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to avoid using private function since it may get removed without any prior notices. Perhaps look into _validate_vector
's implementation to see if you can implement it using scipy's exposed APIs or utility functions.
metric_learn/base_metric.py
Outdated
The distance between u and v according to the new metric. | ||
""" | ||
u = _validate_vector(u) | ||
v = _validate_vector(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it means it's subject to change and we shouldn't depend on it.
test/test_mahalanobis_mixin.py
Outdated
n_features = X.shape[1] | ||
a, b = (rng.randn(n_features), rng.randn(n_features)) | ||
euc_dist = euclidean(model.transform(a[None]), model.transform(b[None])) | ||
assert (euc_dist - metric(a, b)) / euc_dist < 1e-15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is fine.
test/test_mahalanobis_mixin.py
Outdated
assert metric(a, b) >= 0 # positivity | ||
assert metric(a, b) == metric(b, a) # symmetry | ||
# one side of identity indiscernables: x == y => d(x, y) == 0. The other | ||
# side is not always true for Mahalanobis distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure what this comment means. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y)
can be 0
even if x != y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, x == y => d(x, y) == 0
but d(x, y) == 0 !=> x == y
metric_learn/base_metric.py
Outdated
@@ -177,8 +214,57 @@ def transform(self, X): | |||
accept_sparse=True) | |||
return X_checked.dot(self.transformer_.T) | |||
|
|||
def metric(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should keep this for now but mark it as deprecated, and point to the new get_mahalanobis_matrix()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, I forgot about that
I think #131 is probably more the place |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nitpicks, otherwise LGTM
metric_learn/base_metric.py
Outdated
@@ -85,6 +94,24 @@ def _prepare_inputs(self, X, y=None, type_of_inputs='classic', | |||
tuple_size=getattr(self, '_tuple_size', None), | |||
**kwargs) | |||
|
|||
@abstractmethod | |||
def get_metric(self): | |||
"""Returns a function that returns the learned metric between two points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a function that takes as input two 1D arrays and outputs the learned metric score on these two points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed it's clearer indeed
metric_learn/base_metric.py
Outdated
def get_metric(self): | ||
"""Returns a function that returns the learned metric between two points. | ||
This function will be independent from the metric learner that learned it | ||
(it will not be modified if the initial metric learner is modified). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add that the returned function can be directly plugged into the metric
argument of sklearn's estimators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do
metric_learn/base_metric.py
Outdated
|
||
See Also | ||
-------- | ||
score_pairs : a method that returns the metric between several pairs of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metric score
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do
metric_learn/base_metric.py
Outdated
See Also | ||
-------- | ||
get_metric : a method that returns a function to compute the metric between | ||
two points. The difference is that it works on two 1D arrays and cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference with score_pairs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do
metric_learn/base_metric.py
Outdated
See Also | ||
-------- | ||
score_pairs : a method that returns the metric between several pairs of | ||
points. But this is a method of the metric learner and therefore can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-But +Unlike get_metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, will do
metric_learn/base_metric.py
Outdated
|
||
See Also | ||
-------- | ||
score_pairs : a method that returns the metric between several pairs of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, will do
metric_learn/base_metric.py
Outdated
Parameters | ||
---------- | ||
u : array-like, shape=(n_features,) | ||
The first point involved in the distances computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
metric_learn/base_metric.py
Outdated
u : array-like, shape=(n_features,) | ||
The first point involved in the distances computation. | ||
v : array-like, shape=(n_features,) | ||
The second point involved in the distances computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
test/test_mahalanobis_mixin.py
Outdated
assert metric(a, b) >= 0 # positivity | ||
assert metric(a, b) == metric(b, a) # symmetry | ||
# one side of identity indiscernables: x == y => d(x, y) == 0. The other | ||
# side is not always true for Mahalanobis distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mahalanobis distances are only a "pseudo" metric because they do not satisfy the "identity of indiscernables": metric(x, y)
can be 0
even if x != y
test/test_mahalanobis_mixin.py
Outdated
metric = model.get_metric() | ||
|
||
n_features = X.shape[1] | ||
a, b, c = (rng.randn(n_features) for _ in range(3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it would be more convincing to test that these are true on a set of random triplets (a, b, c), instead of a single one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, will do
Yes, indeed, I'll look at this |
I adressed all comments. I just need to remove the metric_plotting that I included by mistake and we should be good to merge |
It should be good to merge now, as soon as tests pass |
I agree, I just meant we could replace this test: @pytest.mark.parametrize('estimator, build_dataset', metric_learners,
ids=ids_metric_learners)
def test_get_metric_equivalent_to_transform_and_euclidean(estimator,
build_dataset):
"""Tests that the get_metric method of mahalanobis metric learners is the
euclidean distance in the transformed space
"""
rng = np.random.RandomState(42)
input_data, labels, _, X = build_dataset()
model = clone(estimator)
set_random_state(model)
model.fit(input_data, labels)
metric = model.get_metric()
n_features = X.shape[1]
a, b = (rng.randn(n_features), rng.randn(n_features))
euc_dist = euclidean(model.transform(a[None]), model.transform(b[None]))
assert_allclose(metric(a, b), euc_dist, rtol=1e-15) By something like this: @pytest.mark.parametrize('estimator, build_dataset', metric_learners,
ids=ids_metric_learners)
def test_get_metric_equivalent_to_explicit_mahalanobis(estimator,
build_dataset):
"""Tests that using the get_metric method of mahalanobis metric learners is
equivalent to explicitely calling scipy's mahalanobis metric
"""
rng = np.random.RandomState(42)
input_data, labels, _, X = build_dataset()
model = clone(estimator)
set_random_state(model)
model.fit(input_data, labels)
metric = model.get_metric()
n_features = X.shape[1]
a, b = (rng.randn(n_features), rng.randn(n_features))
expected_dist = mahalanobis(a[None], b[None],
VI=model.get_mahalanobis_matrix())
assert_allclose(metric(a, b), expected_dist, rtol=1e-15) The test would even show more explicitely what the get_metric function does |
I agree, let's do it while we're at it :) |
I just added the squared option, and addressed all the comments, so I think as soon as it is green we should be good to go |
- ensure that the transformer_ fitted is always 2D: - in the result returned from transformer_from_metric - in the code of metric learners, for metric learners that don't call transformer_from_metric - for metric learners that cannot work on 1 feature, ensure it when checking the input - add a test to check this behaviour
I just added a commit that should fix the previous error, which does the following:
|
Where was a non-2D transformer generated? |
Now that I look back at it, I realize it was only this line that had fixed the test + dist = transformed_diff.dot(transformed_diff.T)
- dist = np.dot(transformed_diff, transformed_diff.T) (The problem was that So I guess in fact this |
…sions and replace it by str
I am not sure I follow. I think a transformer should always be 2D. Even assuming the data is 1D, I would want the shape to be |
I also do not see why algorithms wouldn't work on data with a single feature? (it is useless, sure, but should probably work) |
Yes, that's the case: even without forcing with |
I agree, |
Why was the input scalar then? I think we have many tests to make sure the input data has the right shape. So even with only 1 feature, we should always be working with 2D arrays. Am I missing something? |
The input data has the right shape, but some intermediary terms can have various shapes because of numpy's operations For SDML for instance, we called |
OK looks good |
I just removed the useless |
No doc to update at this point? (I guess not since the part of common methods is currently commented out) |
Ah yes, I forgot about the docs, I udpated the commented part not to forget to add it when in other PRs I'll update the doc. |
Fixes #147
TODO:
Mahalanobis Distances
) in the docstrings (if possible)score_pairs
in the docstringnp.atleast2d
(those intransformer_from_metric
and those just before returning thetransformer_
)