Skip to content

[WIP] New API proposal #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 35 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3e5fbc3
Add new class structure
Feb 26, 2018
0cbf1ae
Put back TransformerMixin in BaseEstimator to inherit Transformer beh…
Feb 26, 2018
300dada
add ConstrainedDataset object
Feb 27, 2018
8615634
simplify constraints to always keep a view on X
Feb 28, 2018
a478baa
add check for input formats
Mar 2, 2018
3744bec
add basic testing to ConstrainedDataset
Mar 2, 2018
214d991
correct asterisk bug
Mar 2, 2018
4f4ce8b
begin work to dissociate classes
Mar 5, 2018
ac00b8b
update MMC with constrained_dataset
Mar 5, 2018
33561ab
Fixes according to review https://github.com/metric-learn/metric-lear…
Mar 6, 2018
7f40c56
make mixins rather than classes hierarchy for inheriting special methods
Mar 6, 2018
402f397
Merge branch 'new_api' into feat/class_dissociation
Mar 6, 2018
47a9372
Make changes according to review https://github.com/metric-learn/metr…
Mar 13, 2018
41dc123
Finalize class dissociation into mixins
Mar 6, 2018
5f63f24
Merge branch 'feat/class_dissociation' into new_api
Mar 19, 2018
fb0d118
separate valid and invalid input testing
Mar 20, 2018
df8a340
correct too long line syntax
Mar 20, 2018
e3e7e0c
clarify definition of variables in tests
Mar 20, 2018
5a9c2e5
simplify unwrap pairs and make it more robust to y dimension
Mar 20, 2018
cf94740
fix bug due to bad refactoring of c_shape
Mar 20, 2018
52f4516
simplify wrap pairs
Mar 20, 2018
079bb13
make QuadrupletsMixin inherit from WeaklySupervisedMixin
Mar 21, 2018
da7c8e7
add NotImplementedError for abstract mixins
Mar 21, 2018
8192d11
put TransformerMixin inline
Mar 21, 2018
2d0f1ca
put random state at top of file
Mar 21, 2018
6c59a1a
add transform, predict, decision_function, and scoring for weakly sup…
Mar 6, 2018
b70163a
Add tests
Mar 19, 2018
a12eb9a
Add documentation
Mar 23, 2018
b1f6c23
fix typo or/of
Mar 30, 2018
b0ec33b
Add tests for sparse matrices, dataframes and lists
Apr 12, 2018
64f5762
Fix Transformer interface (cf. review https://github.com/metric-learn…
Apr 12, 2018
2cf78dd
Do not separate classes if not needed (cf. https://github.com/metric-…
Apr 12, 2018
11a8ff1
Fix ascii invisible character
Apr 12, 2018
a768cbf
Fix test attribute error and numerical problems with new dataset
Apr 12, 2018
335d8f4
Fix unittest hierarchy of classes
Apr 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions metric_learn/base_metric.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,31 @@ def transform(self, X=None):
X = check_array(X, accept_sparse=True)
L = self.transformer()
return X.dot(L.T)


class SupervisedMetricLearner(BaseMetricLearner):

def fit(self, X, y):
return NotImplementedError


class WeaklySupervisedMetricLearner(BaseMetricLearner):

def fit(self, X, constraints):
return NotImplementedError


class PairsMetricLearner(WeaklySupervisedMetricLearner):

def __init__(self):
raise NotImplementedError('PairsMetricLearner should not be instantiated')
# TODO: introduce specific scoring functions etc


class TripletsMetricLearner(WeaklySupervisedMetricLearner):

def __init__(self):
raise NotImplementedError('TripletsMetricLearner should not be '
'instantiated')
# TODO: introduce specific scoring functions etc

57 changes: 56 additions & 1 deletion metric_learn/constraints.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@
import warnings
from six.moves import xrange
from scipy.sparse import coo_matrix
from sklearn.utils import check_array

__all__ = ['Constraints']
__all__ = ['Constraints', 'ConstrainedDataset']


class Constraints(object):
Expand Down Expand Up @@ -100,3 +101,57 @@ def random_subset(all_labels, num_preserved=np.inf, random_state=np.random):
partial_labels = np.array(all_labels, copy=True)
partial_labels[idx] = -1
return Constraints(partial_labels)


class ConstrainedDataset(object):

def __init__(self, X, c):
# we convert the data to a suitable format
self.X = check_array(X, accept_sparse=True, dtype=None, warn_on_dtype=True)
self.c = check_array(c, dtype=['int'] + np.sctypes['int']
+ np.sctypes['uint'],
# we add 'int' at the beginning to tell it is the
# default format we want in case of conversion
ensure_2d=False, ensure_min_samples=False,
ensure_min_features=False, warn_on_dtype=True)
self._check_index(self.X.shape[0], self.c)
self.shape = (len(c) if hasattr(c, '__len__') else 0, self.X.shape[1])

def __getitem__(self, item):
return ConstrainedDataset(self.X, self.c[item])

def __len__(self):
return self.shape

def __str__(self):
return self.toarray().__str__()

def __repr__(self):
return self.toarray().__repr__()

def toarray(self):
return self.X[self.c]

@staticmethod
def _check_index(length, indices):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this could also check for potential duplicates? could simply show a warning when this is the case. (one could also remove them but this might create problems later when constraint labels are used)

Copy link
Member Author

@wdevazelhes wdevazelhes Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will implement it in a next commit

max_index = np.max(indices)
min_index = np.min(indices)
pb_index = None
if max_index >= length:
pb_index = max_index
elif min_index > length + 1:
pb_index = min_index
if pb_index is not None:
raise IndexError("ConstrainedDataset cannot be created: the length of "
"the dataset is {}, so index {} is out of range."
.format(length, pb_index))

@staticmethod
def pairs_from_labels(y):
# TODO: to be implemented
raise NotImplementedError

@staticmethod
def triplets_from_labels(y):
# TODO: to be implemented
raise NotImplementedError
4 changes: 2 additions & 2 deletions metric_learn/covariance.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
import numpy as np
from sklearn.utils.validation import check_array

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner


class Covariance(BaseMetricLearner):
class Covariance(SupervisedMetricLearner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why this was chosen, but this particular base class made me stop and consider a moment. We may eventually want a base class for unsupervised methods as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this simple covariance method should not be a SupervisedMetricLearner (as it is completely unsupervised). Whether we will really need an unsupervised class in the long run is unclear, but maybe the best for now is to create an UnsupervisedMetricLearner class which takes only X in fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review ! I agree, I did not notice but indeed Covariance is unsupervised, so I will change this in a following PR

def __init__(self):
pass

Expand Down
6 changes: 3 additions & 3 deletions metric_learn/itml.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@
from sklearn.metrics import pairwise_distances
from sklearn.utils.validation import check_array, check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import PairsMetricLearner, SupervisedMetricLearner
from .constraints import Constraints
from ._util import vector_norm


class ITML(BaseMetricLearner):
class ITML(PairsMetricLearner):
"""Information Theoretic Metric Learning (ITML)"""
def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
A0=None, verbose=False):
Expand Down Expand Up @@ -140,7 +140,7 @@ def metric(self):
return self.A_


class ITML_Supervised(ITML):
class ITML_Supervised(ITML, SupervisedMetricLearner):
"""Information Theoretic Metric Learning (ITML)"""
def __init__(self, gamma=1., max_iter=1000, convergence_threshold=1e-3,
num_labeled=np.inf, num_constraints=None, bounds=None, A0=None,
Expand Down
4 changes: 2 additions & 2 deletions metric_learn/lfda.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@
from sklearn.metrics import pairwise_distances
from sklearn.utils.validation import check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner


class LFDA(BaseMetricLearner):
class LFDA(SupervisedMetricLearner):
'''
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction
Sugiyama, ICML 2006
Expand Down
4 changes: 2 additions & 2 deletions metric_learn/lmnn.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
from sklearn.utils.validation import check_X_y, check_array
from sklearn.metrics import euclidean_distances

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner


# commonality between LMNN implementations
class _base_LMNN(BaseMetricLearner):
class _base_LMNN(SupervisedMetricLearner):
def __init__(self, k=3, min_iter=50, max_iter=1000, learn_rate=1e-7,
regularization=0.5, convergence_tol=0.001, use_pca=True,
verbose=False):
Expand Down
6 changes: 3 additions & 3 deletions metric_learn/lsml.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@
from six.moves import xrange
from sklearn.utils.validation import check_array, check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import PairsMetricLearner, SupervisedMetricLearner
from .constraints import Constraints


class LSML(BaseMetricLearner):
class LSML(PairsMetricLearner):
def __init__(self, tol=1e-3, max_iter=1000, prior=None, verbose=False):
"""Initialize LSML.

Expand Down Expand Up @@ -131,7 +131,7 @@ def _gradient(self, metric):
return dMetric


class LSML_Supervised(LSML):
class LSML_Supervised(LSML, SupervisedMetricLearner):
def __init__(self, tol=1e-3, max_iter=1000, prior=None, num_labeled=np.inf,
num_constraints=None, weights=None, verbose=False):
"""Initialize the learner.
Expand Down
4 changes: 2 additions & 2 deletions metric_learn/mlkr.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@
from sklearn.decomposition import PCA
from sklearn.utils.validation import check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner

EPS = np.finfo(float).eps


class MLKR(BaseMetricLearner):
class MLKR(SupervisedMetricLearner):
"""Metric Learning for Kernel Regression (MLKR)"""
def __init__(self, num_dims=None, A0=None, epsilon=0.01, alpha=0.0001,
max_iter=1000):
Expand Down
6 changes: 3 additions & 3 deletions metric_learn/mmc.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,13 @@
from sklearn.metrics import pairwise_distances
from sklearn.utils.validation import check_array, check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import PairsMetricLearner, SupervisedMetricLearner
from .constraints import Constraints
from ._util import vector_norm



class MMC(BaseMetricLearner):
class MMC(PairsMetricLearner):
"""Mahalanobis Metric for Clustering (MMC)"""
def __init__(self, max_iter=100, max_proj=10000, convergence_threshold=1e-3,
A0=None, diagonal=False, diagonal_c=1.0, verbose=False):
Expand Down Expand Up @@ -380,7 +380,7 @@ def transformer(self):
return V.T * np.sqrt(np.maximum(0, w[:,None]))


class MMC_Supervised(MMC):
class MMC_Supervised(MMC, SupervisedMetricLearner):
"""Mahalanobis Metric for Clustering (MMC)"""
def __init__(self, max_iter=100, max_proj=10000, convergence_threshold=1e-6,
num_labeled=np.inf, num_constraints=None,
Expand Down
4 changes: 2 additions & 2 deletions metric_learn/nca.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
from six.moves import xrange
from sklearn.utils.validation import check_X_y

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner

EPS = np.finfo(float).eps


class NCA(BaseMetricLearner):
class NCA(SupervisedMetricLearner):
def __init__(self, num_dims=None, max_iter=100, learning_rate=0.01):
self.num_dims = num_dims
self.max_iter = max_iter
Expand Down
4 changes: 2 additions & 2 deletions metric_learn/rca.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from sklearn import decomposition
from sklearn.utils.validation import check_array

from .base_metric import BaseMetricLearner
from .base_metric import SupervisedMetricLearner
from .constraints import Constraints


Expand All @@ -35,7 +35,7 @@ def _chunk_mean_centering(data, chunks):
return chunk_mask, chunk_data


class RCA(BaseMetricLearner):
class RCA(SupervisedMetricLearner):
"""Relevant Components Analysis (RCA)"""
def __init__(self, num_dims=None, pca_comps=None):
"""Initialize the learner.
Expand Down
6 changes: 3 additions & 3 deletions metric_learn/sdml.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,11 @@
from sklearn.utils.extmath import pinvh
from sklearn.utils.validation import check_array

from .base_metric import BaseMetricLearner
from .base_metric import PairsMetricLearner, SupervisedMetricLearner
from .constraints import Constraints


class SDML(BaseMetricLearner):
class SDML(PairsMetricLearner):
def __init__(self, balance_param=0.5, sparsity_param=0.01, use_cov=True,
verbose=False):
"""
Expand Down Expand Up @@ -80,7 +80,7 @@ def fit(self, X, W):
return self


class SDML_Supervised(SDML):
class SDML_Supervised(SDML, SupervisedMetricLearner):
def __init__(self, balance_param=0.5, sparsity_param=0.01, use_cov=True,
num_labeled=np.inf, num_constraints=None, verbose=False):
"""
Expand Down
74 changes: 74 additions & 0 deletions test/test_constrained_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import unittest
import numpy as np
from metric_learn.constraints import ConstrainedDataset
from numpy.testing import assert_array_equal
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.utils.testing import assert_raise_message

X = np.random.randn(20, 5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test sparse X as well. To do so, I'd write something like this:

class _BaseTestConstrainedDataset(unittest.TestCase):
   # everything currently under TestConstrainedDataset, but using self.X instead of X

class TestDenseConstrainedDataset(_BaseTestConstrainedDataset):
  def setUp(self):
    self.X = np.random.randn(20, 5)
    self.c = ... # and so on

class TestSparseConstrainedDataset(_BaseTestConstrainedDataset):
   # similar, but setUp creates a dense X

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would make sense to also test other data types? like lists instead of numpy arrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I will add these tests in a next commit

c = np.random.randint(0, X.shape[0], (15, 2))
cd = ConstrainedDataset(X, c)
y = np.random.randint(0, 2, c.shape[0])
group = np.random.randint(0, 3, c.shape[0])

c_shape = c.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by test_getitem, so I'd move it there. (And probably name it num_constraints)

Copy link
Member

@bellet bellet Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this is also used in test_shape (although not through this variable)
it could be defined before y and group since they also use c.shape[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I will rename it and put it before the definitions that use it



class TestConstrainedDataset(unittest.TestCase):

@staticmethod
def check_indexing(idx):
# checks that an indexing returns the data we expect
np.testing.assert_array_equal(cd[idx].c, c[idx])
np.testing.assert_array_equal(cd[idx].toarray(), X[c[idx]])
np.testing.assert_array_equal(cd[idx].toarray(), X[c][idx])

def test_inputs(self):
# test the allowed and forbidden ways to create a ConstrainedDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd split this into two separate tests, test_allowed_inputs and test_invalid_inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, will do

ConstrainedDataset(X, c)
two_points = [[1, 2], [3, 5]]
out_of_range_constraints = [[1, 2], [0, 1]]
msg = "ConstrainedDataset cannot be created: the length of " \
"the dataset is 2, so index 2 is out of " \
"range."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to imports, use parentheses here:

msg = ("First part of string. "
       "Second part of string.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will do

assert_raise_message(IndexError, msg, ConstrainedDataset, two_points,
out_of_range_constraints)

def test_getitem(self):
# test different types of slicing
i = np.random.randint(1, c_shape - 1)
begin = np.random.randint(1, c_shape - 1)
end = np.random.randint(begin + 1, c_shape)
fancy_index = np.random.randint(0, c_shape, 20)
binary_index = np.random.randint(0, 2, c_shape)
boolean_index = binary_index.astype(bool)
items = [0, c_shape - 1, i, slice(i), slice(0, begin), slice(begin,
end), slice(end, c_shape), slice(0, c_shape), fancy_index,
binary_index, boolean_index]
for item in items:
self.check_indexing(item)

def test_repr(self):
assert repr(cd) == repr(X[c])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use self.assertEqual(...) instead of bare asserts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !


def test_str(self):
assert str(cd) == str(X[c])

def test_shape(self):
assert cd.shape == (c.shape[0], X.shape[1])
assert cd[0, 0].shape == (0, X.shape[1])

def test_toarray(self):
assert_array_equal(cd.toarray(), cd.X[c])

def test_folding(self):
# test that ConstrainedDataset is compatible with scikit-learn folding
shuffle_list = [True, False]
groups_list = [group, None]
for alg in [KFold, StratifiedKFold]:
for shuffle_i in shuffle_list:
for group_i in groups_list:
for train_idx, test_idx in alg(
shuffle=shuffle_i).split(cd, y, group_i):
self.check_indexing(train_idx)
self.check_indexing(test_idx)