Skip to content

[WIP] New API proposal #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3e5fbc3
Add new class structure
Feb 26, 2018
0cbf1ae
Put back TransformerMixin in BaseEstimator to inherit Transformer beh…
Feb 26, 2018
300dada
add ConstrainedDataset object
Feb 27, 2018
8615634
simplify constraints to always keep a view on X
Feb 28, 2018
a478baa
add check for input formats
Mar 2, 2018
3744bec
add basic testing to ConstrainedDataset
Mar 2, 2018
214d991
correct asterisk bug
Mar 2, 2018
4f4ce8b
begin work to dissociate classes
Mar 5, 2018
ac00b8b
update MMC with constrained_dataset
Mar 5, 2018
33561ab
Fixes according to review https://github.com/metric-learn/metric-lear…
Mar 6, 2018
7f40c56
make mixins rather than classes hierarchy for inheriting special methods
Mar 6, 2018
402f397
Merge branch 'new_api' into feat/class_dissociation
Mar 6, 2018
47a9372
Make changes according to review https://github.com/metric-learn/metr…
Mar 13, 2018
41dc123
Finalize class dissociation into mixins
Mar 6, 2018
5f63f24
Merge branch 'feat/class_dissociation' into new_api
Mar 19, 2018
fb0d118
separate valid and invalid input testing
Mar 20, 2018
df8a340
correct too long line syntax
Mar 20, 2018
e3e7e0c
clarify definition of variables in tests
Mar 20, 2018
5a9c2e5
simplify unwrap pairs and make it more robust to y dimension
Mar 20, 2018
cf94740
fix bug due to bad refactoring of c_shape
Mar 20, 2018
52f4516
simplify wrap pairs
Mar 20, 2018
079bb13
make QuadrupletsMixin inherit from WeaklySupervisedMixin
Mar 21, 2018
da7c8e7
add NotImplementedError for abstract mixins
Mar 21, 2018
8192d11
put TransformerMixin inline
Mar 21, 2018
2d0f1ca
put random state at top of file
Mar 21, 2018
6c59a1a
add transform, predict, decision_function, and scoring for weakly sup…
Mar 6, 2018
b70163a
Add tests
Mar 19, 2018
a12eb9a
Add documentation
Mar 23, 2018
b1f6c23
fix typo or/of
Mar 30, 2018
b0ec33b
Add tests for sparse matrices, dataframes and lists
Apr 12, 2018
64f5762
Fix Transformer interface (cf. review https://github.com/metric-learn…
Apr 12, 2018
2cf78dd
Do not separate classes if not needed (cf. https://github.com/metric-…
Apr 12, 2018
11a8ff1
Fix ascii invisible character
Apr 12, 2018
a768cbf
Fix test attribute error and numerical problems with new dataset
Apr 12, 2018
335d8f4
Fix unittest hierarchy of classes
Apr 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 79 additions & 2 deletions metric_learn/base_metric.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,43 @@
from numpy.linalg import inv, cholesky
from sklearn.base import BaseEstimator, TransformerMixin
from numpy.linalg import cholesky
from sklearn.base import BaseEstimator
from sklearn.utils.validation import check_array


class TransformerMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any class other than BaseMetricLearner use this mixin? If not, I'd just inline it into BaseMetricLearner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""Mixin class for all transformers in metric-learn. Same as the one in
scikit-learn, but the documentation is changed: this Transformer is
allowed to take as y a non array-like input"""

def fit_transform(self, X, y=None, **fit_params):
"""Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params
and returns a transformed version of X.

Parameters
----------
X : numpy array of shape [n_samples, n_features]
Training set.

y : numpy array of shape [n_samples] or 4-tuple of arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new API y should only be constraint labels or None, shouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, indeed I forgot to update the 4-tuple part
Originally I wanted to allow fit_transform to be used in a classical way (so with labels as y) but also in the weakly supervised way (with constraints labels) (and it could be None in both cases). So something like

y : None or numpy array of shape [n_samples] or [n_constraints]
        Target values, or constraints labels

However now I am not sure it it a good idea: maybe we should make this fit_transform specific to weakly supervised learning (so with y being constraints labels or None as you say) and inline it into WeaklySupervisedMixin rather than BaseEstimator, and make SupervisedMixin inherit from scikit-learn's TransformerMixin to have the classical version of fit_transform for supervised estimators, like this:

class WeaklySupervisedMixin(object):
     def fit_transform(self, X_constrained, y=None, **fit_params):
    """
[...]
    y : None, or numpy array of shape [n_constraints]
        Constraints labels.
    """
class SupervisedMixin(TransformerMixin):

Target values, or constraints (a, b, c, d) indices into X, with
(a, b) specifying similar and (c,d) dissimilar pairs).

Returns
-------
X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.

"""
# non-optimized default implementation; override when a better
# method is possible for a given clustering algorithm
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)

class BaseMetricLearner(BaseEstimator, TransformerMixin):
def __init__(self):
raise NotImplementedError('BaseMetricLearner should not be instantiated')
Expand Down Expand Up @@ -49,3 +84,45 @@ def transform(self, X=None):
X = check_array(X, accept_sparse=True)
L = self.transformer()
return X.dot(L.T)


class SupervisedMixin(object):

def fit(self, X, y):
return NotImplementedError


class UnsupervisedMixin(object):

def fit(self, X, y=None):
return NotImplementedError


class WeaklySupervisedMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add NotImplementedError for the __init__ methods of this and the above abstract mixins.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


def fit(self, X, constraints, **kwargs):
return self._fit(X, constraints, **kwargs)


class PairsMixin(WeaklySupervisedMixin):

def __init__(self):
raise NotImplementedError('PairsMixin should not be instantiated')
# TODO: introduce specific scoring functions etc


class TripletsMixin(WeaklySupervisedMixin):

def __init__(self):
raise NotImplementedError('TripletsMixin should not be '
'instantiated')
# TODO: introduce specific scoring functions etc


class QuadrupletsMixin(UnsupervisedMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quadruplets imply weak supervision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree
Done


def __init__(self):
raise NotImplementedError('QuadrupletsMixin should not be '
'instantiated')
# TODO: introduce specific scoring functions etc

99 changes: 87 additions & 12 deletions metric_learn/constraints.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
import warnings
from six.moves import xrange
from scipy.sparse import coo_matrix
from sklearn.utils import check_array

__all__ = ['Constraints']
__all__ = ['Constraints', 'ConstrainedDataset']


class Constraints(object):
Expand All @@ -18,17 +19,6 @@ def __init__(self, partial_labels):
self.known_label_idx, = np.where(partial_labels >= 0)
self.known_labels = partial_labels[self.known_label_idx]

def adjacency_matrix(self, num_constraints, random_state=np.random):
a, b, c, d = self.positive_negative_pairs(num_constraints,
random_state=random_state)
row = np.concatenate((a, c))
col = np.concatenate((b, d))
data = np.ones_like(row, dtype=int)
data[len(a):] = -1
adj = coo_matrix((data, (row, col)), shape=(self.num_points,)*2)
# symmetrize
return adj + adj.T

def positive_negative_pairs(self, num_constraints, same_length=False,
random_state=np.random):
a, b = self._pairs(num_constraints, same_label=True,
Expand Down Expand Up @@ -100,3 +90,88 @@ def random_subset(all_labels, num_preserved=np.inf, random_state=np.random):
partial_labels = np.array(all_labels, copy=True)
partial_labels[idx] = -1
return Constraints(partial_labels)


class ConstrainedDataset(object):

def __init__(self, X, c):
# we convert the data to a suitable format
self.X = check_array(X, accept_sparse=True, dtype=None, warn_on_dtype=True)
self.c = check_array(c, dtype=['int'] + np.sctypes['int']
+ np.sctypes['uint'],
# we add 'int' at the beginning to tell it is the
# default format we want in case of conversion
ensure_2d=False, ensure_min_samples=False,
ensure_min_features=False, warn_on_dtype=True)
self._check_index(self.X.shape[0], self.c)
self.shape = (len(c) if hasattr(c, '__len__') else 0, self.X.shape[1])

def __getitem__(self, item):
return ConstrainedDataset(self.X, self.c[item])

def __len__(self):
return self.shape

def __str__(self):
return self.toarray().__str__()

def __repr__(self):
return self.toarray().__repr__()

def toarray(self):
return self.X[self.c]

@staticmethod
def _check_index(length, indices):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could also check for potential duplicates? could simply show a warning when this is the case. (one could also remove them but this might create problems later when constraint labels are used)

Copy link
Member Author

@wdevazelhes wdevazelhes Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will implement it in a next commit

max_index = np.max(indices)
min_index = np.min(indices)
pb_index = None
if max_index >= length:
pb_index = max_index
elif min_index > length + 1:
pb_index = min_index
if pb_index is not None:
raise IndexError("ConstrainedDataset cannot be created: the length of "
"the dataset is {}, so index {} is out of range."
.format(length, pb_index))

@staticmethod
def pairs_from_labels(y):
# TODO: to be implemented
raise NotImplementedError

@staticmethod
def triplets_from_labels(y):
# TODO: to be implemented
raise NotImplementedError


def unwrap_pairs(X_constrained, y):
a = X_constrained.c[(y == 0)[:, 0]][:, 0]
b = X_constrained.c[(y == 0)[:, 0]][:, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant. How about:

y_zero = (y == 0).ravel()
a, b = X_constrained[y_zero].T
c, d = X_constrained[~y_zero].T

Copy link
Member Author

@wdevazelhes wdevazelhes Mar 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do

c = X_constrained.c[(y == 1)[:, 0]][:, 0]
d = X_constrained.c[(y == 1)[:, 0]][:, 1]
X = X_constrained.X
return X, [a, b, c, d]

def wrap_pairs(X, constraints):
a = np.array(constraints[0])
b = np.array(constraints[1])
c = np.array(constraints[2])
d = np.array(constraints[3])
constraints = np.vstack([np.hstack([a[:, None], b[:, None]]),
np.hstack([c[:, None], d[:, None]])])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a, b, c, d = constraints
constraints = np.vstack((np.column_stack((a, b)), np.column_stack((c, d))))
# or if we have numpy 1.13+
constraints = np.block([[a, b], [c, d]])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will do

y = np.vstack([np.zeros((len(a), 1)), np.ones((len(c), 1))])
X_constrained = ConstrainedDataset(X, constraints)
return X_constrained, y

def unwrap_to_graph(X_constrained, y):

X, [a, b, c, d] = unwrap_pairs(X_constrained, y)
row = np.concatenate((a, c))
col = np.concatenate((b, d))
data = np.ones_like(row, dtype=int)
data[len(a):] = -1
adj = coo_matrix((data, (row, col)), shape=(X_constrained.X.shape[0],)
* 2)
return X_constrained.X, adj + adj.T
11 changes: 8 additions & 3 deletions metric_learn/covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,17 @@
import numpy as np
from sklearn.utils.validation import check_array

from .base_metric import BaseMetricLearner
from .base_metric import BaseMetricLearner, UnsupervisedMixin


class Covariance(BaseMetricLearner):
class _Covariance(BaseMetricLearner):
def __init__(self):
pass

def metric(self):
return self.M_

def fit(self, X, y=None):
def _fit(self, X, y=None):
"""
X : data matrix, (n x d)
y : unused
Expand All @@ -34,3 +34,8 @@ def fit(self, X, y=None):
else:
self.M_ = np.linalg.inv(self.M_)
return self

class Covariance(_Covariance, UnsupervisedMixin):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else will use _Covariance, I'd prefer to eliminate it and use the UnsupervisedMixin directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't think it's worth separating out the algorithm implementation unless/until we have >1 class that needs it.


def fit(self, X, y=None):
return self._fit(X, y)
29 changes: 17 additions & 12 deletions metric_learn/itml.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@
from sklearn.metrics import pairwise_distances
from sklearn.utils.validation import check_array, check_X_y

from .base_metric import BaseMetricLearner
from .constraints import Constraints
from .base_metric import BaseMetricLearner, PairsMixin, SupervisedMixin
from .constraints import Constraints, unwrap_pairs, wrap_pairs
from ._util import vector_norm


class ITML(BaseMetricLearner):
class _ITML(BaseMetricLearner):
"""Information Theoretic Metric Learning (ITML)"""
def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
A0=None, verbose=False):
Expand Down Expand Up @@ -73,19 +73,19 @@ def _process_inputs(self, X, constraints, bounds):
self.A_ = check_array(self.A0)
return a,b,c,d

def fit(self, X, constraints, bounds=None):
def _fit(self, X_constrained, y, bounds=None):
"""Learn the ITML model.

Parameters
----------
X : (n x d) data matrix
each row corresponds to a single instance
constraints : 4-tuple of arrays
(a,b,c,d) indices into X, with (a,b) specifying positive and (c,d)
negative pairs
X_constrained : ConstrainedDataset
with constraints being an array of shape [n_constraints, 2]
y : array-like, shape (n_constraints x 1)
labels of the constraints
bounds : list (pos,neg) pairs, optional
bounds on similarity, s.t. d(X[a],X[b]) < pos and d(X[c],X[d]) > neg
"""
X, constraints = unwrap_pairs(X_constrained, y)
a,b,c,d = self._process_inputs(X, constraints, bounds)
gamma = self.gamma
num_pos = len(a)
Expand Down Expand Up @@ -140,7 +140,7 @@ def metric(self):
return self.A_


class ITML_Supervised(ITML):
class ITML_Supervised(_ITML, SupervisedMixin):
"""Information Theoretic Metric Learning (ITML)"""
def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
num_labeled=np.inf, num_constraints=None, bounds=None, A0=None,
Expand All @@ -164,7 +164,7 @@ def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
verbose : bool, optional
if True, prints information while learning
"""
ITML.__init__(self, gamma=gamma, max_iter=max_iter,
_ITML.__init__(self, gamma=gamma, max_iter=max_iter,
convergence_threshold=convergence_threshold,
A0=A0, verbose=verbose)
self.num_labeled = num_labeled
Expand Down Expand Up @@ -195,4 +195,9 @@ def fit(self, X, y, random_state=np.random):
random_state=random_state)
pos_neg = c.positive_negative_pairs(num_constraints,
random_state=random_state)
return ITML.fit(self, X, pos_neg, bounds=self.bounds)
X_constrained, y = wrap_pairs(X, pos_neg)
return _ITML._fit(self, X_constrained, y, bounds=self.bounds)

class ITML(_ITML, PairsMixin):

pass
22 changes: 19 additions & 3 deletions metric_learn/lfda.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
from sklearn.metrics import pairwise_distances
from sklearn.utils.validation import check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import BaseMetricLearner, SupervisedMixin


class LFDA(BaseMetricLearner):
class _LFDA(BaseMetricLearner):
'''
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction
Sugiyama, ICML 2006
Expand Down Expand Up @@ -77,7 +77,7 @@ def _process_inputs(self, X, y):

return self.X_, y, num_classes, n, d, dim, k

def fit(self, X, y):
def _fit(self, X, y):
'''Fit the LFDA model.

Parameters
Expand Down Expand Up @@ -146,3 +146,19 @@ def _eigh(a, b, dim):
except np.linalg.LinAlgError:
pass
return scipy.linalg.eig(a, b)


class LFDA(_LFDA, SupervisedMixin):

def fit(self, X, y):
'''Fit the LFDA model.

Parameters
----------
X : (n, d) array-like
Input data.

y : (n,) array-like
Class labels, one per point of data.
'''
return self._fit(X, y)
Loading