Skip to content

Implementation of MMC #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 26, 2017
Merged

Conversation

Callidior
Copy link
Contributor

@Callidior Callidior commented May 23, 2017

This PR implements Probabilistic Global Distance Metric Learning (PGDM) mentioned in #13, sometimes also referred to as MMC, originally proposed in the following paper:

Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and Stuart Russell
"Distance metric learning with application to clustering with side-information."
NIPS. Vol. 15. No. 505-512. 2002.

The implementation is mainly a translation of the Matlab reference code into Python, only optimized a bit to make use of vectorization and the power of numpy. This makes this implementation 6 times faster than the reference code on the iris dataset.

Both variants of PGDM proposed in the original paper are supported: learning a full or a diagonal Mahalanobis matrix.

The implementation has been validated against the results obtained from the reference code on the iris dataset. Corresponding unit tests are provided.

The class separation score of less than 0.15 on the iris dataset is one of the best scores currently reported in the unit tests for this dataset.


Note that PGDM learns only a positive semi-definite metric M, which precludes obtaining a transformer L with L.T*L=M using the Cholesky decomposition. Instead, PGDM overrides transformer() and uses an eigenvector decomposition M = V * W * V.T to obtain a transformer L = V.T * W^(-1/2).


Important note: PGDM learns a distance function (x-y)^T * M * (x-y), as opposed to (x-y)^T * inv(M) * (x-y) as currently stated in the documentation of metric_learn. Thus, this PR depends on solving issue #57, e.g., by merging PR #60. If you decide to stick to the current definition of a "metric", PGDM.metric() would need to be changed in order to return the inverse of the learned matrix.

@Callidior
Copy link
Contributor Author

I'm not sure whether PGDM is the correct name for this algorithm. The survey of Yang introduces this method by Xing et al. as "Supervised Global Distance Metric Learning" and then extends this approach in a subsection to "Probabilistic Global Distance Metric Learning". But the paper and the reference implementation referred to in #13 are not probabilistic at all.

Maybe we should change the name to just "Global Distance Metric Learning", which could be abbreviated as GDML, or to MMC, as suggested by Weinberger et al.

How do you think about this?

@perimosocordiae
Copy link
Contributor

I agree that the wish list was wrongly conflating this approach with the probabilistic version from the survey, and I'm happy to rename this version to something better.

Despite the referenced usage, I think MMC is a bit of a bad name, as the original paper uses clustering as an example application, not the entire basis for the work. Do any other citations propose nice names for this algorithm? In lieu of any better ideas, GDML can work, even though it's pretty uninformative.

@Callidior
Copy link
Contributor Author

Callidior commented May 24, 2017

I agree with you that both MMC and GDML are not the most informative names. Unfortunately, I haven't spotted any other term for this algorithm in the existing literature yet and Xing et al. completely avoid giving their algorithm a name (besides just "metric learning").

Although this algorithm is, for sure, applicable to more applications than just clustering, one could also argue that learning a metric that is particularly suitable for clustering is an essential part of the method's objective. It minimizes the total intra-class distance, while keeping the total inter-class distance larger than a given margin. Thus, the classes would be tightly clustered together in the space enforced by such a metric. However, that might, of course, also be the case for many other metric learning algorithms.

I leave this decision to you. As soon as you notify me about the final name, I will rename everything and update this PR.

Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice addition! I especially appreciate the thorough comments and docstrings.

b.append(j)
else:
c.append(i)
d.append(j)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be made a bit nicer with vectorization:

mask = self.iris_labels[None] == self.iris_labels[:,None]
a, b = np.nonzero(np.triu(mask, k=1))
c, d = np.nonzero(np.triu(~mask, k=1))

d.append(j)

# Full metric
pgdm = PGDM(convergence_threshold = 0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nitpick: don't leave spaces around the = in keyword arguments. (here and throughout)

from .constraints import Constraints


# hack around lack of axis kwarg in older numpy versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is used in ITML as well, so let's pull this hack out into a new file, metric_learn/_util.py.


# check to make sure that no two constrained vectors are identical
a,b,c,d = constraints
ident = _vector_norm(X[a] - X[b]) > 1e-9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this usage, ident is actually a mask where True implies non-identical. Maybe non_ident instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree. I've copied this part from ITML. We should also change it there.

ident = _vector_norm(X[a] - X[b]) > 1e-9
a, b = a[ident], b[ident]
ident = _vector_norm(X[c] - X[d]) > 1e-9
c, d = c[ident], d[ident]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also ensure that our constraint vectors aren't empty after this sanity check.

Adapted from Matlab code at http://www.cs.cmu.edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz
"""

from __future__ import print_function, absolute_import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing division is a good idea as well.

t = w.dot(A.ravel() / 100.0)

w1 = w / np.linalg.norm(w) # make `w` a unit vector
t1 = t / np.linalg.norm(w) # distance from origin to `w^T*x=t` plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should compute the norm of w once and reuse it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, style nitpick: use two spaces between the end of the code and the comment character:

foo()  # comment

obj_previous = self._fD(X, c, d, A_old) # g(A_old)
obj = self._fD(X, c, d, A) # g(A)

if ((obj > obj_previous) or (cycle == 0)) and (satisfy):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need so many parens here:

if satisfy and (obj > obj_previous or cycle == 0):

if ((obj > obj_previous) or (cycle == 0)) and (satisfy):

# If projection of 1 and 2 is successful, and such projection
# imprives objective function, slightly increase learning rate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imprives -> improves

dim = X.shape[1]
diff = X[c] - X[d]
M = np.einsum('ij,ik->ijk', diff, diff) # outer products of all rows in `diff`
dist = np.sqrt(np.sum(M * A[None,:,:], axis = (1,2)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line might also benefit from conversion to an np.einsum call.

@perimosocordiae
Copy link
Contributor

Okay, I can live with MMC.

@Callidior
Copy link
Contributor Author

Thanks for your valuable comments!

I hope to find time for implementing the requested changes until next week.

@perimosocordiae perimosocordiae mentioned this pull request May 24, 2017
9 tasks
@Callidior Callidior changed the title Implementation of PGDM Implementation of MMC May 25, 2017
@Callidior
Copy link
Contributor Author

Callidior commented May 25, 2017

@perimosocordiae I've addressed your change requests and renamed the algorithm to MMC.

Copy link
Contributor

@perimosocordiae perimosocordiae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just a few more small comments.

if len(a) == 0:
raise RuntimeError('No similarity constraints given for MMC.')
if len(c) == 0:
raise RuntimeError('No dissimilarity constraints given for MMC.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a ValueError makes more sense here. We should also specify that no non-trivial constraints were provided, otherwise the user might be confused.

sum_deri = np.sum(M / (2 * (dist[:,None,None] + 1e-6)), axis = 0)
M = np.einsum('ij,ik->ijk', diff, diff) # outer products of all rows in `diff`
dist = np.sqrt(np.einsum('ijk,jk', M, A)) # equivalent to: np.sqrt(np.sum(M * A[None,:,:], axis=(1,2)))
sum_deri = np.einsum('ijk,i->jk', M, 0.5 / (dist + 1e-6)) # equivalent to: np.sum(M / (2 * (dist[:,None,None] + 1e-6)), axis=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines shouldn't go longer than 80 chars, so I'd put the comments on the previous line.

sum_deri2 = np.einsum(
'ijk,i',
np.einsum('ij,ik->ijk', diff_sq, diff_sq),
-0.25 / np.maximum(1e-6, dist**3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested calls to np.einsum are pretty hard to read. I think the previous version was fine, but you could also pull out the inner einsum to a local variable if this version is significantly faster.

It's also possible to do einsum with >2 arrays, but that can have unintended consequences (see numpy/numpy#5366). As of numpy 1.12 the situation is a bit better (see numpy/numpy#5488), but I don't think we need to go down that route.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither do I think so. However, this version is about 4 times faster, since it does not require broadcasting dist from 1-D to 3-D. So, I will just pull out the inner einsum to a local variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have found an even 10 times faster way to compute this without intermediate result in a single einsum call:

sum_deri2 = np.einsum(
    'ij,ik->jk',
    diff_sq,
    diff_sq / (-4 * np.maximum(1e-6, dist**3))[:,None]
)

@Callidior
Copy link
Contributor Author

@perimosocordiae Thanks, fixed :)

@perimosocordiae perimosocordiae merged commit 932de85 into scikit-learn-contrib:master May 26, 2017
@perimosocordiae
Copy link
Contributor

Merged. Thanks for the PR, @Callidior!

@Callidior
Copy link
Contributor Author

Thanks for your helpful comments!

But please be aware that the metric learned by this algorithm defines a distance (x-y)^T * M * (x-y) and not (x-y)^T * M^(-1) * (x-y) as stated in the readme. But some other existing metric learner implementations do the same and hopefully we can get rid of that inconsistency soon :)

@Callidior Callidior deleted the pgdm branch May 30, 2017 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants