Skip to content

Commit 010b34a

Browse files
wdevazelhesperimosocordiae
authored andcommitted
[MRG] Add preprocessor option (#117)
* WIP create MahalanobisMixin * ENH Update algorithms with Mahalanobis Mixin: - Make them inherit from Mahalanobis Mixin, and implement the metric_ property - Improve metric_ property by checking if it exists and raising the appropriate warning if not - Make tests work, by replacing metric() with metric_ * FIX: add missing import * FIX: update sklearn's function check_no_fit_attributes_set_in_init to new check_no_attributes_set_in_init" This new function was introduced through PR scikit-learn/scikit-learn#9450 in scikit-learn. It also allows to pass tests that would otherwise not pass: indeed having abstract attributes as properties threw an error. But the new test functions handles well this property inheritance. * FIX: take function ``_get_args`` from scikit-learn's PR scikit-learn/scikit-learn#9450 Indeed, in the PR this function is modified to support python 2. This should solve the CI error. * ENH: add transformer_ attribute and improve docstring * WIP: move transform() in BaseMetricLearner to transformer_from_metric() in MahalanobisMixin * WIP: refactor metric to original formulation: a function, with result computed from the transformer * WIP: make all Mahalanobis Metric Learner algorithms have transformer_ and metric() * ENH Add score_pairs function - Make MahalanobisMixin inherit from BaseMetricLearner to give a concrete implementation of score_pairs - Use score_pairs to compute more easily predict - Add docstring - TST: for every algorithm: - test that using score_pairs pairwise returns an euclidean distance matrix - test that score_pairs works for 3D arrays of several pairs as well as 2D arrays of one pair (and there returns only a scalar) - test that score_pairs always returns a finite output * TST add test on toy example for score_pairs * ENH Add embed function - add the function and docstring - use it for score_pairs - TST : - should be finite - have right output dimension - embedding should be linear - should work on a toy example * FIX fix error in slicing of quadruplets * FIX minor corrections * FIX minor corrections - remove unusual s to test functions - remove redundant parenthesis * FIX fix PEP8 errors * FIX remove possible one-sample scoring from docstring for now * REF rename n_features_out to num_dims to be more coherent with current algorithms * MAINT: Adress #96 (review) - replace embed by transform and add always the input X in calling the function - mutualize _transformer_from_metric not to be overwritten in MMC - improve test_mahalanobis_mixin.test_score_pairs_pairwise according to #96 (comment) - improve test_mahalanobis_mixin.check_is_distance_matrix - correct typos and nitpicks * ENH: Add check_tuples * FIX: fix parenthesis * ENH: First commit adding a preprocessor * ENH: Improve check_tuples with more comments and deal better with ensure_min_features * STY: remove unexpected spaces * FIX: Raise more appropriate error message The previous error message would have said "[...], shape=(shape_of_the_2D_array_extracted_from_3D)" But it is clearer to print the shape of the actual 3D initial array (tuples in input) * FIX: fix string formatting and refactor name and context to use context in custom functions but to give name to check_array * FIX: only allow 2D if preprocessor and 3D if not preprocessor * FIX: put format arguments in the right order * MAINT: better to say the preprocessor than a preprocessor in messages * FIX: numeric should be default if NO preprocessor * FIX: preprocessor argument has to be a boolean (before this change was a callable or an array) * FIX: fix preprocessor argument passed in check_tuples in base_metric * MAINT: say a preprocessor rather than the preprocessor * DOC: fix docstring of t in check_tuples * MAINT: make error messages better by only printing presence of preprocessor if the error is because not 2D or 3D shape * TST: Add tests for check_tuples * TST: simplify tests by removing the test for messages with the estimator name, but incorporating it in all other tests through parametrization * STY: remove unnecessary parenthesis * FIX: put back else statement that probably was wrongfully merged * TST: add tests for weakly supervised estimators and preprocessor that fetches indices * TST: add tests for preprocessor * FIX: remove redundant metric and transformer function, wrongly merged * MAINT: rename format_input into preprocess_tuples and input into tuples * MAINT: fixes and enhancements - fix the format_input previous incomplete refactoring - mututalize check_tuples for Weakly Supervised Algorithms - fix test of string representations * MAINT: mutualize check_tuples * MAINT: refactor SimplePreprocessor into ArrayIndexer * MAINT: improve check_tuples and tests * TST: add random seed for _Supervised classes * TST: Adapt test pipeline - use random state for _Supervised classes - call transform only for pipeline with a TransformerMixin * TST: fix test_progress_message_preprocessor_tuples by making func return an np.array * Remove deprecated cross_validation import and put model_selection instead * WIP replace checks by unique check_input function * Fixes some tests: - fixes dtype checks and conversion by ensuring the checks and conversions are made on the preprocessed array and not the input array (which can be an array of indices) - fix tests that were failing due to the wrong error message * TST: Cherry pick from new sklearn version ac0e230 [MRG] Quick fix of failed tests due to new scikit-learn version (0.20.0) (#130) * TST: Quick fix of failed tests due to new scikit-learn version (0.20.0) * FIX update values to pass test * FIX: get changes from master to pass test iris for NCA * FIX fix tests that were failing due to the error message FIX check_tuples at the end and only at the end, because the number of tuples should not be modified from the beginning, and we want to check them also at the end * TST: fix test_check_input_invalid_t that changed since we test t at the end now * TST fix NCA's iris test taking code from master * FIX fix tests: - use check_array instead of conversion to numpy array to ensure that sparse array are kept as such and not wrapped into a regular numpy array - check_array the metric with the right arguments for covariance in order not to fail if the array is a scalar (for one-feature samples) * FIX fix previous modification that removed self.X_ but was modifying the input at fit time * FIX ensure at least 2d only for checking the metric because after check_array of inputs the metric should be a scalar or an array so we only need to ensure it is an array (2D array) * STY: Fix PEP8 violations * MAINT: Refactor error messages with the help of numerical codes * MAINT: mutualize check_preprocessor and check_input for every estimator * FIX: remove format_map for python2.7 compatibility * DOC: Add docstring for check_input and fix some bugs * DOC: add docstrings * MAINT: Removing changes not related to this PR, and fixing previous probable unsuccessfull merge * STY: Fix PEP8 errors * STY: fix indent problems * Fixing docstring spaces * DOC: add preprocessor docstring when missing * STY: PEP8 fixes * MAINT: refactor the global check function into _prepare_input * FIX: fix quadruplets scoring and delete useless comments * MAINT: remove some enhancements to be coherent with previous code and simplify review * MAINT: Improve test messages * MAINT: reorganize tests * FIX: fix typo in LMNN shogun and clean todo for the equivalent code in python_LMNN * MAINT: Rename inputs and input into input_data * STY: add backticks to None * MAINT: add more detailed comment of first checks and remove old comment * MAINT: improve comments for checking num_features * MAINT: Refactor t into tuple_size * MAINT: Fix small PEP8 error * MAINT: FIX remaining t into tuple_size and replace hasattr if None by getattr with default None * MAINT: remove misplaced comment * MAINT: Put back/add docstrings for decision_function/predict * MAINT: remove unnecessary ellipsis and upadate docstring of decision_function * Add comments in LMNN for arguments useful for the shogun version that are not used in the python version * MAINT: Remove useless mock_preprocessor * MAINT: Remove useless loop * MAINT: refactor test_dict_unchanged * MAINT: remove _get_args copied from scikit-learn and replace it by an import * MAINT: Fragment check_input by extracting blocks into check_input_classic and check_input_tuples * MAINT: ensure min_samples=2 for supervised learning algorithms (we should at least have two datapoints to learn something) * ENH: Return custom error when some error is due to the preprocessor * MAINT: Refactor algorithms preprocessing steps - extract preprocessing steps in main fit function - remove self.X_ when found and replace it by X (Fixes #134) - extract the function to test collapsed pairs as _utils.check_collapsed_pairs and test it - note that the function is now not used where it was used before but the user should be responsible for having coherent input (if he wants he can use the helper function as a preprocessing step) * MAINT: finish the work of the previous commit * TST: add test for cross-validation: comparison of manual cross-val and scikit-learn cross-val * ENH: put y=None by default in LSML for better compatibility. This also allows to simplify some tests by removing the need for a separate case for lsml * ENH: add error message when type of inputs is not some expected type * TST: add test that checks that 'classic' is the default behaviour * TST: remove unnecessary conversion to vertical vector of y * FIX: remove wrong condition hasattr 'score' at top of loop * MAINT: Add comment to explain why we return twice X for build_regression and build_classification * ENH: improve test for preprocessor and return error message if the given preprocessor has the bad type * FIX: fix wrong type_of_inputs in a test * FIX: deal with the case where preprocessor is None * WIP refactor build_dataset * MAINT: refactor bool preprocessor to with_preprocessor * FIX: fix build_pairs and build_quadruplets because 'only named arguments should follow expression' * STY: fix PEP8 error * MAINT: mututalize test_same_with_or_without_preprocessor_tuples and test_same_with_or_without_preprocessor_classic * TST: give better names in test_same_with_or_without_preprocessor * MAINT: refactor list_estimators into metric_learners * TST: uniformize names input_data - tuples, labels - y * FIX: fix build_pairs and build_quadruplets * MAINT: remove forgotten code duplication * MAINT: address #117 (review)
1 parent e4685b1 commit 010b34a

19 files changed

+2035
-690
lines changed

metric_learn/_util.py

Lines changed: 303 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
import numpy as np
2-
2+
import six
3+
from sklearn.utils import check_array
4+
from sklearn.utils.validation import check_X_y
5+
from metric_learn.exceptions import PreprocessorError
36

47
# hack around lack of axis kwarg in older numpy versions
58
try:
@@ -12,39 +15,310 @@ def vector_norm(X):
1215
return np.linalg.norm(X, axis=1)
1316

1417

15-
def check_tuples(tuples):
16-
"""Check that the input is a valid 3D array representing a dataset of tuples.
17-
18-
Equivalent of `check_array` in scikit-learn.
18+
def check_input(input_data, y=None, preprocessor=None,
19+
type_of_inputs='classic', tuple_size=None, accept_sparse=False,
20+
dtype='numeric', order=None,
21+
copy=False, force_all_finite=True,
22+
multi_output=False, ensure_min_samples=1,
23+
ensure_min_features=1, y_numeric=False,
24+
warn_on_dtype=False, estimator=None):
25+
"""Checks that the input format is valid, and converts it if specified
26+
(this is the equivalent of scikit-learn's `check_array` or `check_X_y`).
27+
All arguments following tuple_size are scikit-learn's `check_X_y`
28+
arguments that will be enforced on the data and labels array. If
29+
indicators are given as an input data array, the returned data array
30+
will be the formed points/tuples, using the given preprocessor.
1931
2032
Parameters
2133
----------
22-
tuples : object
23-
The tuples to check.
34+
input : array-like
35+
The input data array to check.
36+
37+
y : array-like
38+
The input labels array to check.
39+
40+
preprocessor : callable (default=`None`)
41+
The preprocessor to use. If None, no preprocessor is used.
42+
43+
type_of_inputs : `str` {'classic', 'tuples'}
44+
The type of inputs to check. If 'classic', the input should be
45+
a 2D array-like of points or a 1D array like of indicators of points. If
46+
'tuples', the input should be a 3D array-like of tuples or a 2D
47+
array-like of indicators of tuples.
48+
49+
accept_sparse : `bool`
50+
Set to true to allow sparse inputs (only works for sparse inputs with
51+
dim < 3).
52+
53+
tuple_size : int
54+
The number of elements in a tuple (e.g. 2 for pairs).
55+
56+
dtype : string, type, list of types or None (default='numeric')
57+
Data type of result. If None, the dtype of the input is preserved.
58+
If 'numeric', dtype is preserved unless array.dtype is object.
59+
If dtype is a list of types, conversion on the first type is only
60+
performed if the dtype of the input is not in the list.
61+
62+
order : 'F', 'C' or None (default=`None`)
63+
Whether an array will be forced to be fortran or c-style.
64+
65+
copy : boolean (default=False)
66+
Whether a forced copy will be triggered. If copy=False, a copy might
67+
be triggered by a conversion.
68+
69+
force_all_finite : boolean or 'allow-nan', (default=True)
70+
Whether to raise an error on np.inf and np.nan in X. This parameter
71+
does not influence whether y can have np.inf or np.nan values.
72+
The possibilities are:
73+
- True: Force all values of X to be finite.
74+
- False: accept both np.inf and np.nan in X.
75+
- 'allow-nan': accept only np.nan values in X. Values cannot be
76+
infinite.
77+
78+
ensure_min_samples : int (default=1)
79+
Make sure that X has a minimum number of samples in its first
80+
axis (rows for a 2D array).
81+
82+
ensure_min_features : int (default=1)
83+
Make sure that the 2D array has some minimum number of features
84+
(columns). The default value of 1 rejects empty datasets.
85+
This check is only enforced when X has effectively 2 dimensions or
86+
is originally 1D and ``ensure_2d`` is True. Setting to 0 disables
87+
this check.
88+
89+
warn_on_dtype : boolean (default=False)
90+
Raise DataConversionWarning if the dtype of the input data structure
91+
does not match the requested dtype, causing a memory copy.
92+
93+
estimator : str or estimator instance (default=`None`)
94+
If passed, include the name of the estimator in warning messages.
2495
2596
Returns
2697
-------
27-
tuples_valid : object
28-
The validated input.
98+
X : `numpy.ndarray`
99+
The checked input data array.
100+
101+
y: `numpy.ndarray` (optional)
102+
The checked input labels array.
29103
"""
30-
# If input is scalar raise error
31-
if np.isscalar(tuples):
32-
raise ValueError(
33-
"Expected 3D array, got scalar instead. Cannot apply this function on "
34-
"scalars.")
35-
# If input is 1D raise error
36-
if len(tuples.shape) == 1:
37-
raise ValueError(
38-
"Expected 3D array, got 1D array instead:\ntuples={}.\n"
39-
"Reshape your data using tuples.reshape(1, -1, 1) if it contains a "
40-
"single tuple and the points in the tuple have a single "
41-
"feature.".format(tuples))
42-
# If input is 2D raise error
43-
if len(tuples.shape) == 2:
44-
raise ValueError(
45-
"Expected 3D array, got 2D array instead:\ntuples={}.\n"
46-
"Reshape your data either using tuples.reshape(-1, {}, 1) if "
47-
"your data has a single feature or tuples.reshape(1, {}, -1) "
48-
"if it contains a single tuple.".format(tuples, tuples.shape[1],
49-
tuples.shape[0]))
104+
105+
context = make_context(estimator)
106+
107+
args_for_sk_checks = dict(accept_sparse=accept_sparse,
108+
dtype=dtype, order=order,
109+
copy=copy, force_all_finite=force_all_finite,
110+
ensure_min_samples=ensure_min_samples,
111+
ensure_min_features=ensure_min_features,
112+
warn_on_dtype=warn_on_dtype, estimator=estimator)
113+
114+
# We need to convert input_data into a numpy.ndarray if possible, before
115+
# any further checks or conversions, and deal with y if needed. Therefore
116+
# we use check_array/check_X_y with fixed permissive arguments.
117+
if y is None:
118+
input_data = check_array(input_data, ensure_2d=False, allow_nd=True,
119+
copy=False, force_all_finite=False,
120+
accept_sparse=True, dtype=None,
121+
ensure_min_features=0, ensure_min_samples=0)
122+
else:
123+
input_data, y = check_X_y(input_data, y, ensure_2d=False, allow_nd=True,
124+
copy=False, force_all_finite=False,
125+
accept_sparse=True, dtype=None,
126+
ensure_min_features=0, ensure_min_samples=0,
127+
multi_output=multi_output,
128+
y_numeric=y_numeric)
129+
130+
if type_of_inputs == 'classic':
131+
input_data = check_input_classic(input_data, context, preprocessor,
132+
args_for_sk_checks)
133+
134+
elif type_of_inputs == 'tuples':
135+
input_data = check_input_tuples(input_data, context, preprocessor,
136+
args_for_sk_checks, tuple_size)
137+
138+
else:
139+
raise ValueError("Unknown value {} for type_of_inputs. Valid values are "
140+
"'classic' or 'tuples'.".format(type_of_inputs))
141+
142+
return input_data if y is None else (input_data, y)
143+
144+
145+
def check_input_tuples(input_data, context, preprocessor, args_for_sk_checks,
146+
tuple_size):
147+
preprocessor_has_been_applied = False
148+
if input_data.ndim == 2:
149+
if preprocessor is not None:
150+
input_data = preprocess_tuples(input_data, preprocessor)
151+
preprocessor_has_been_applied = True
152+
else:
153+
make_error_input(201, input_data, context)
154+
elif input_data.ndim == 3:
155+
pass
156+
else:
157+
if preprocessor is not None:
158+
make_error_input(420, input_data, context)
159+
else:
160+
make_error_input(200, input_data, context)
161+
input_data = check_array(input_data, allow_nd=True, ensure_2d=False,
162+
**args_for_sk_checks)
163+
# we need to check num_features because check_array does not check it
164+
# for 3D inputs:
165+
if args_for_sk_checks['ensure_min_features'] > 0:
166+
n_features = input_data.shape[2]
167+
if n_features < args_for_sk_checks['ensure_min_features']:
168+
raise ValueError("Found array with {} feature(s) (shape={}) while"
169+
" a minimum of {} is required{}."
170+
.format(n_features, input_data.shape,
171+
args_for_sk_checks['ensure_min_features'],
172+
context))
173+
# normally we don't need to check_tuple_size too because tuple_size
174+
# shouldn't be able to be modified by any preprocessor
175+
if input_data.ndim != 3:
176+
# we have to ensure this because check_array above does not
177+
if preprocessor_has_been_applied:
178+
make_error_input(211, input_data, context)
179+
else:
180+
make_error_input(201, input_data, context)
181+
check_tuple_size(input_data, tuple_size, context)
182+
return input_data
183+
184+
185+
def check_input_classic(input_data, context, preprocessor, args_for_sk_checks):
186+
preprocessor_has_been_applied = False
187+
if input_data.ndim == 1:
188+
if preprocessor is not None:
189+
input_data = preprocess_points(input_data, preprocessor)
190+
preprocessor_has_been_applied = True
191+
else:
192+
make_error_input(101, input_data, context)
193+
elif input_data.ndim == 2:
194+
pass # OK
195+
else:
196+
if preprocessor is not None:
197+
make_error_input(320, input_data, context)
198+
else:
199+
make_error_input(100, input_data, context)
200+
201+
input_data = check_array(input_data, allow_nd=True, ensure_2d=False,
202+
**args_for_sk_checks)
203+
if input_data.ndim != 2:
204+
# we have to ensure this because check_array above does not
205+
if preprocessor_has_been_applied:
206+
make_error_input(111, input_data, context)
207+
else:
208+
make_error_input(101, input_data, context)
209+
return input_data
210+
211+
212+
def make_error_input(code, input_data, context):
213+
code_str = {'expected_input': {'1': '2D array of formed points',
214+
'2': '3D array of formed tuples',
215+
'3': ('1D array of indicators or 2D array of '
216+
'formed points'),
217+
'4': ('2D array of indicators or 3D array '
218+
'of formed tuples')},
219+
'additional_context': {'0': '',
220+
'2': ' when using a preprocessor',
221+
'1': (' after the preprocessor has been '
222+
'applied')},
223+
'possible_preprocessor': {'0': '',
224+
'1': ' and/or use a preprocessor'
225+
}}
226+
code_list = str(code)
227+
err_args = dict(expected_input=code_str['expected_input'][code_list[0]],
228+
additional_context=code_str['additional_context']
229+
[code_list[1]],
230+
possible_preprocessor=code_str['possible_preprocessor']
231+
[code_list[2]],
232+
input_data=input_data, context=context,
233+
found_size=input_data.ndim)
234+
err_msg = ('{expected_input} expected'
235+
'{context}{additional_context}. Found {found_size}D array '
236+
'instead:\ninput={input_data}. Reshape your data'
237+
'{possible_preprocessor}.\n')
238+
raise ValueError(err_msg.format(**err_args))
239+
240+
241+
def preprocess_tuples(tuples, preprocessor):
242+
try:
243+
tuples = np.column_stack([preprocessor(tuples[:, i])[:, np.newaxis] for
244+
i in range(tuples.shape[1])])
245+
except Exception as e:
246+
raise PreprocessorError(e)
50247
return tuples
248+
249+
250+
def preprocess_points(points, preprocessor):
251+
"""form points if there is a preprocessor else keep them as such (assumes
252+
that check_points has already been called)"""
253+
try:
254+
points = preprocessor(points)
255+
except Exception as e:
256+
raise PreprocessorError(e)
257+
return points
258+
259+
260+
def make_context(estimator):
261+
"""Helper function to create a string with the estimator name.
262+
Taken from check_array function in scikit-learn.
263+
Will return the following for instance:
264+
NCA: ' by NCA'
265+
'NCA': ' by NCA'
266+
None: ''
267+
"""
268+
estimator_name = make_name(estimator)
269+
context = (' by ' + estimator_name) if estimator_name is not None else ''
270+
return context
271+
272+
273+
def make_name(estimator):
274+
"""Helper function that returns the name of estimator or the given string
275+
if a string is given
276+
"""
277+
if estimator is not None:
278+
if isinstance(estimator, six.string_types):
279+
estimator_name = estimator
280+
else:
281+
estimator_name = estimator.__class__.__name__
282+
else:
283+
estimator_name = None
284+
return estimator_name
285+
286+
287+
def check_tuple_size(tuples, tuple_size, context):
288+
"""Helper function to check that the number of points in each tuple is
289+
equal to tuple_size (e.g. 2 for pairs), and raise a `ValueError` otherwise"""
290+
if tuple_size is not None and tuples.shape[1] != tuple_size:
291+
msg_t = (("Tuples of {} element(s) expected{}. Got tuples of {} "
292+
"element(s) instead (shape={}):\ninput={}.\n")
293+
.format(tuple_size, context, tuples.shape[1], tuples.shape,
294+
tuples))
295+
raise ValueError(msg_t)
296+
297+
298+
class ArrayIndexer:
299+
300+
def __init__(self, X):
301+
# we check the array-like preprocessor here, and we as much permissive
302+
# as possible (because the user will check for the desired
303+
# format with arguments in check_input, and only this latter function
304+
# should return the appropriate errors). We do this only to have a numpy
305+
# array object which can be indexed by another numpy array object.
306+
X = check_array(X,
307+
accept_sparse=True, dtype=None,
308+
force_all_finite=False,
309+
ensure_2d=False, allow_nd=True,
310+
ensure_min_samples=0,
311+
ensure_min_features=0,
312+
warn_on_dtype=False, estimator=None)
313+
self.X = X
314+
315+
def __call__(self, indices):
316+
return self.X[indices]
317+
318+
319+
def check_collapsed_pairs(pairs):
320+
num_ident = (vector_norm(pairs[:, 0] - pairs[:, 1]) < 1e-9).sum()
321+
if num_ident:
322+
raise ValueError("{} collapsed pairs found (where the left element is "
323+
"the same as the right element), out of {} pairs "
324+
"in total.".format(num_ident, pairs.shape[0]))

0 commit comments

Comments
 (0)