Skip to content

LMNN: fix mistake and improve performances #78

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Nov 27, 2017
Merged

LMNN: fix mistake and improve performances #78

merged 4 commits into from
Nov 27, 2017

Conversation

toto6
Copy link
Contributor

@toto6 toto6 commented Nov 24, 2017

Fix mistake in LMNN
Issue in function _find_impostors:

  • the squared euclidean distance is used to compute the margins in variable "margin_radii"
  • the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist"
  • the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances.

I propose to solve this problem by computing always the squared distances.

Faster LMNN
The use of the function sklearn.metrics.pairwise_distances gives horrible performances.
Replace it by a faster function.

Below is a little script to show the computation time of LMNN

import time
from sklearn.datasets import load_breast_cancer
from metric_learn import lmnn
dataset = load_breast_cancer()
s = time.time()
l1 = lmnn.LMNN(k=3)
l1.fit(dataset["data"], dataset["target"])
e = time.time()
print(e - s)

Before this commit, this script executes in 32 seconds
After this commit, this script executes in 1 second

Issue in function _find_impostors:
 - the squared euclidean distance is used to compute the margins in variable "margin_radii"
 - the euclidean distance is used (through the function sklearn.metrics.pairwise.pairwise_distances) to compute distances between samples of different labels in variable "dist"
 - the issue is that the impostors are found by testing "dist < margin_radii" which is wrong because "dist" represent distances, and "margin_radii" represent squared distances.

I propose to solve this problem by computing always the squared distances.
The use of the function sklearn.metrics.pairwise_distances gives horrible performances.
Replace it by a faster function.

Below is a little script to show the computation time of LMNN

import time
from sklearn.datasets import load_breast_cancer
from metric_learn import lmnn
dataset = load_breast_cancer()
s = time.time()
l1 = lmnn.LMNN(k=3)
l1.fit(dataset["data"], dataset["target"])
e = time.time()
print(e - s)

Before this commit, this script executes in 32 seconds
After this commit, this script executes in 1 second
@perimosocordiae
Copy link
Contributor

Thanks for the contribution! I agree with your analysis of the bug, though I was surprised to hear that sklearn's pairwise_distances was so slow.

So I looked a little closer and found that the issue is actually metric='seuclidean'. Regular L2 distance uses essentially the same algorithm as your pairwiseEuclidean:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/metrics/pairwise.py#L246

But if you use the seuclidean metric, it ends up calling Scipy's cdist function instead, which then has to make a copy of both input arrays to make them C-contiguous:
https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2361

So I think the easiest solution is, instead of rolling our own distance function, importing euclidean_distances from sklearn and calling it with squared=True anywhere that we want squared L2 distances. I did some quick testing and found that this also prevents the very slow running time.

@@ -185,7 +185,7 @@ def _select_targets(self):
target_neighbors = np.empty((self.X_.shape[0], self.k), dtype=int)
for label in self.labels_:
inds, = np.nonzero(self.label_inds_ == label)
dd = pairwise_distances(self.X_[inds])
dd = euclidean_distances(self.X_[inds], self.X_[inds], squared=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave out the second argument, similar to how pairwise_distances works.

@perimosocordiae perimosocordiae merged commit 4b889d4 into scikit-learn-contrib:master Nov 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants