Implementation of MMC #61

Callidior · 2017-05-23T07:40:37Z

This PR implements Probabilistic Global Distance Metric Learning (PGDM) mentioned in #13, sometimes also referred to as MMC, originally proposed in the following paper:

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell
"Distance metric learning with application to clustering with side-information."
NIPS. Vol. 15. No. 505-512. 2002.

The implementation is mainly a translation of the Matlab reference code into Python, only optimized a bit to make use of vectorization and the power of numpy. This makes this implementation 6 times faster than the reference code on the iris dataset.

Both variants of PGDM proposed in the original paper are supported: learning a full or a diagonal Mahalanobis matrix.

The implementation has been validated against the results obtained from the reference code on the iris dataset. Corresponding unit tests are provided.

The class separation score of less than 0.15 on the iris dataset is one of the best scores currently reported in the unit tests for this dataset.

Note that PGDM learns only a positive semi-definite metric M, which precludes obtaining a transformer L with L.T*L=M using the Cholesky decomposition. Instead, PGDM overrides transformer() and uses an eigenvector decomposition M = V * W * V.T to obtain a transformer L = V.T * W^(-1/2).

Important note: PGDM learns a distance function (x-y)^T * M * (x-y), as opposed to (x-y)^T * inv(M) * (x-y) as currently stated in the documentation of metric_learn. Thus, this PR depends on solving issue #57, e.g., by merging PR #60. If you decide to stick to the current definition of a "metric", PGDM.metric() would need to be changed in order to return the inverse of the learned matrix.

Callidior · 2017-05-23T13:52:51Z

I'm not sure whether PGDM is the correct name for this algorithm. The survey of Yang introduces this method by Xing et al. as "Supervised Global Distance Metric Learning" and then extends this approach in a subsection to "Probabilistic Global Distance Metric Learning". But the paper and the reference implementation referred to in #13 are not probabilistic at all.

Maybe we should change the name to just "Global Distance Metric Learning", which could be abbreviated as GDML, or to MMC, as suggested by Weinberger et al.

How do you think about this?

perimosocordiae · 2017-05-24T13:58:07Z

I agree that the wish list was wrongly conflating this approach with the probabilistic version from the survey, and I'm happy to rename this version to something better.

Despite the referenced usage, I think MMC is a bit of a bad name, as the original paper uses clustering as an example application, not the entire basis for the work. Do any other citations propose nice names for this algorithm? In lieu of any better ideas, GDML can work, even though it's pretty uninformative.

Callidior · 2017-05-24T14:22:07Z

I agree with you that both MMC and GDML are not the most informative names. Unfortunately, I haven't spotted any other term for this algorithm in the existing literature yet and Xing et al. completely avoid giving their algorithm a name (besides just "metric learning").

Although this algorithm is, for sure, applicable to more applications than just clustering, one could also argue that learning a metric that is particularly suitable for clustering is an essential part of the method's objective. It minimizes the total intra-class distance, while keeping the total inter-class distance larger than a given margin. Thus, the classes would be tightly clustered together in the space enforced by such a metric. However, that might, of course, also be the case for many other metric learning algorithms.

I leave this decision to you. As soon as you notify me about the final name, I will rename everything and update this PR.

perimosocordiae

Very nice addition! I especially appreciate the thorough comments and docstrings.

perimosocordiae · 2017-05-24T14:01:09Z

test/metric_learn_test.py

+          b.append(j)
+        else:
+          c.append(i)
+          d.append(j)


This can be made a bit nicer with vectorization:

mask = self.iris_labels[None] == self.iris_labels[:,None] a, b = np.nonzero(np.triu(mask, k=1)) c, d = np.nonzero(np.triu(~mask, k=1))

perimosocordiae · 2017-05-24T14:01:57Z

test/metric_learn_test.py

+          d.append(j)
+
+    # Full metric
+    pgdm = PGDM(convergence_threshold = 0.01)


Style nitpick: don't leave spaces around the = in keyword arguments. (here and throughout)

perimosocordiae · 2017-05-24T14:05:39Z

metric_learn/pgdm.py

+from .constraints import Constraints
+
+
+# hack around lack of axis kwarg in older numpy versions


This code is used in ITML as well, so let's pull this hack out into a new file, metric_learn/_util.py.

perimosocordiae · 2017-05-24T14:07:25Z

metric_learn/pgdm.py

+
+    # check to make sure that no two constrained vectors are identical
+    a,b,c,d = constraints
+    ident = _vector_norm(X[a] - X[b]) > 1e-9


In this usage, ident is actually a mask where True implies non-identical. Maybe non_ident instead?

I totally agree. I've copied this part from ITML. We should also change it there.

perimosocordiae · 2017-05-24T14:08:36Z

metric_learn/pgdm.py

+    ident = _vector_norm(X[a] - X[b]) > 1e-9
+    a, b = a[ident], b[ident]
+    ident = _vector_norm(X[c] - X[d]) > 1e-9
+    c, d = c[ident], d[ident]


We should also ensure that our constraint vectors aren't empty after this sanity check.

perimosocordiae · 2017-05-24T14:25:25Z

metric_learn/pgdm.py

+Adapted from Matlab code at http://www.cs.cmu.edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz
+"""
+
+from __future__ import print_function, absolute_import


Importing division is a good idea as well.

perimosocordiae · 2017-05-24T14:26:47Z

metric_learn/pgdm.py

+    t = w.dot(A.ravel() / 100.0)
+
+    w1 = w / np.linalg.norm(w) # make `w` a unit vector
+    t1 = t / np.linalg.norm(w) # distance from origin to `w^T*x=t` plane


We should compute the norm of w once and reuse it.

Also, style nitpick: use two spaces between the end of the code and the comment character:

foo() # comment

perimosocordiae · 2017-05-24T14:37:09Z

metric_learn/pgdm.py

+      obj_previous = self._fD(X, c, d, A_old) # g(A_old)
+      obj = self._fD(X, c, d, A)              # g(A)
+
+      if ((obj > obj_previous) or (cycle == 0)) and (satisfy):


Don't need so many parens here:

if satisfy and (obj > obj_previous or cycle == 0):

perimosocordiae · 2017-05-24T14:37:40Z

metric_learn/pgdm.py

+      if ((obj > obj_previous) or (cycle == 0)) and (satisfy):
+
+        # If projection of 1 and 2 is successful, and such projection
+        # imprives objective function, slightly increase learning rate


imprives -> improves

perimosocordiae · 2017-05-24T14:41:47Z

metric_learn/pgdm.py

+    dim = X.shape[1]
+    diff = X[c] - X[d]
+    M = np.einsum('ij,ik->ijk', diff, diff) # outer products of all rows in `diff`
+    dist = np.sqrt(np.sum(M * A[None,:,:], axis = (1,2)))


This line might also benefit from conversion to an np.einsum call.

perimosocordiae · 2017-05-24T14:46:05Z

Okay, I can live with MMC.

Callidior · 2017-05-24T14:54:50Z

Thanks for your valuable comments!

I hope to find time for implementing the requested changes until next week.

Callidior · 2017-05-25T09:59:58Z

@perimosocordiae I've addressed your change requests and renamed the algorithm to MMC.

perimosocordiae

Looking good, just a few more small comments.

perimosocordiae · 2017-05-25T13:12:15Z

metric_learn/mmc.py

+    if len(a) == 0:
+      raise RuntimeError('No similarity constraints given for MMC.')
+    if len(c) == 0:
+      raise RuntimeError('No dissimilarity constraints given for MMC.')


I think a ValueError makes more sense here. We should also specify that no non-trivial constraints were provided, otherwise the user might be confused.

perimosocordiae · 2017-05-25T13:14:01Z

metric_learn/mmc.py

-    sum_deri = np.sum(M / (2 * (dist[:,None,None] + 1e-6)), axis = 0)
+    M = np.einsum('ij,ik->ijk', diff, diff)    # outer products of all rows in `diff`
+    dist = np.sqrt(np.einsum('ijk,jk', M, A))  # equivalent to: np.sqrt(np.sum(M * A[None,:,:], axis=(1,2)))
+    sum_deri = np.einsum('ijk,i->jk', M, 0.5 / (dist + 1e-6))  # equivalent to: np.sum(M / (2 * (dist[:,None,None] + 1e-6)), axis=0)


Lines shouldn't go longer than 80 chars, so I'd put the comments on the previous line.

perimosocordiae · 2017-05-25T13:23:44Z

metric_learn/mmc.py

+    sum_deri2 = np.einsum(
+      'ijk,i',
+      np.einsum('ij,ik->ijk', diff_sq, diff_sq),
+      -0.25 / np.maximum(1e-6, dist**3)


Nested calls to np.einsum are pretty hard to read. I think the previous version was fine, but you could also pull out the inner einsum to a local variable if this version is significantly faster.

It's also possible to do einsum with >2 arrays, but that can have unintended consequences (see numpy/numpy#5366). As of numpy 1.12 the situation is a bit better (see numpy/numpy#5488), but I don't think we need to go down that route.

Neither do I think so. However, this version is about 4 times faster, since it does not require broadcasting dist from 1-D to 3-D. So, I will just pull out the inner einsum to a local variable.

I have found an even 10 times faster way to compute this without intermediate result in a single einsum call:

sum_deri2 = np.einsum( 'ij,ik->jk', diff_sq, diff_sq / (-4 * np.maximum(1e-6, dist**3))[:,None] )

Callidior · 2017-05-25T14:57:04Z

@perimosocordiae Thanks, fixed :)

perimosocordiae · 2017-05-26T14:04:03Z

Merged. Thanks for the PR, @Callidior!

Callidior · 2017-05-26T15:00:00Z

Thanks for your helpful comments!

But please be aware that the metric learned by this algorithm defines a distance (x-y)^T * M * (x-y) and not (x-y)^T * M^(-1) * (x-y) as stated in the readme. But some other existing metric learner implementations do the same and hopefully we can get rid of that inconsistency soon :)

Björn Barz added 3 commits May 23, 2017 09:18

Implementation of PGDM

f2cb2e9

Python2 compatibility

fc1d026

Speed up PGDM on high-dimensional data

23067e4

perimosocordiae requested changes May 24, 2017

View reviewed changes

perimosocordiae mentioned this pull request May 24, 2017

Algorithm wish list #13

Open

9 tasks

Björn Barz added 3 commits May 25, 2017 10:34

Optimized some summations using np.einsum

7582761

Addressed requests from review by perimosocordiae

9a29405

Renamed PGDM to MMC

e9893c3

Callidior changed the title ~~Implementation of PGDM~~ Implementation of MMC May 25, 2017

perimosocordiae reviewed May 25, 2017

View reviewed changes

Addressed 2nd review by perimosocordiae

11d7d0a

perimosocordiae approved these changes May 26, 2017

View reviewed changes

perimosocordiae added the enhancement label May 26, 2017

perimosocordiae merged commit 932de85 into scikit-learn-contrib:master May 26, 2017

Callidior deleted the pgdm branch May 30, 2017 06:03

		from .constraints import Constraints


		# hack around lack of axis kwarg in older numpy versions

Implementation of MMC #61

Implementation of MMC #61

Uh oh!

Conversation

Callidior commented May 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Callidior commented May 23, 2017

Uh oh!

perimosocordiae commented May 24, 2017

Uh oh!

Callidior commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perimosocordiae left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

perimosocordiae commented May 24, 2017

Uh oh!

Callidior commented May 24, 2017

Uh oh!

Callidior commented May 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

perimosocordiae left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Callidior commented May 25, 2017

Uh oh!

perimosocordiae commented May 26, 2017

Uh oh!

Callidior commented May 26, 2017

Uh oh!

Uh oh!

Callidior commented May 23, 2017 •

edited

Loading

Callidior commented May 24, 2017 •

edited

Loading

Callidior commented May 25, 2017 •

edited

Loading