Skip to content

Break chunks generation in RCA when not enough possible chunks, fixes issue #200 #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 13, 2019
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions metric_learn/constraints.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@ def chunks(self, num_chunks=100, chunk_size=2, random_state=None):
chunks = -np.ones_like(self.known_label_idx, dtype=int)
uniq, lookup = np.unique(self.known_labels, return_inverse=True)
all_inds = [set(np.where(lookup == c)[0]) for c in xrange(len(uniq))]
max_chunks = int(np.sum([len(s) // chunk_size for s in all_inds]))
if max_chunks < num_chunks:
raise ValueError(('Not enough examples in each class to form %d chunks '
'of %d examples - maximum number of chunks is %d'
) % (num_chunks, chunk_size, max_chunks))
idx = 0
while idx < num_chunks and all_inds:
if len(all_inds) == 1:
Expand All @@ -93,9 +98,6 @@ def chunks(self, num_chunks=100, chunk_size=2, random_state=None):
inds.difference_update(ii)
chunks[ii] = idx
idx += 1
if idx < num_chunks:
raise ValueError('Unable to make %d chunks of %d examples each' %
(num_chunks, chunk_size))
return chunks


Expand Down
57 changes: 57 additions & 0 deletions test/test_constraints.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
import pytest
import numpy as np
from sklearn.utils import shuffle
from metric_learn.constraints import Constraints

SEED = 42


def gen_labels_for_chunks(num_chunks, chunk_size,
n_classes=10, n_unknown_labels=5):
"""Generates num_chunks*chunk_size labels that split in num_chunks chunks,
that are homogeneous in the label."""
assert min(num_chunks, chunk_size) > 0
classes = shuffle(np.arange(n_classes), random_state=SEED)
n_per_class = chunk_size * (num_chunks // n_classes)
n_maj_class = n_per_class + chunk_size * num_chunks - n_per_class * n_classes

first_labels = classes[0] * np.ones(n_maj_class, dtype=int)
remaining_labels = np.concatenate([k * np.ones(n_per_class, dtype=int)
for k in classes[1:]])
unknown_labels = -1 * np.ones(n_unknown_labels, dtype=int)

labels = np.concatenate([first_labels, remaining_labels, unknown_labels])
return shuffle(labels, random_state=SEED)


@pytest.mark.parametrize('num_chunks, chunk_size', [(11, 5), (115, 12)])
def test_chunk_case_exact_num_points(num_chunks, chunk_size,
n_classes=10, n_unknown_labels=5):
"""Checks that the chunk generation works well with just enough points."""
labels = gen_labels_for_chunks(num_chunks, chunk_size,
n_classes=n_classes,
n_unknown_labels=n_unknown_labels)
constraints = Constraints(labels)
chunks = constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size,
random_state=SEED)
return chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of simply returning chunks, it would be nice to test that chunks contains num_chunks chunks, each composed of chunk_size points

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it.

A remark, I expected chunks to map the datapoint index (input order) to the chunk number, but chunks removes -1 instances (unknown labels, according to line 21), which means that one has to consider datapoint indexes when removing -1's. But it must be taken into account in the implementation of RCA.

Copy link
Member

@bellet bellet Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? Not sure I understand what you mean.

Do we test at all RCA_Supervised in the presence of unlabeled points? At first sight, it looks like it might fail due to unlabeled points being removed from the chunk array as pointed out by @RobinVogel. @wdevazelhes, any comment?

Copy link
Contributor Author

@RobinVogel RobinVogel Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from Constraints.chunks, i.e. expected_chunk and the one that I obtain from Constraints.chunks, i.e. chunks:

In [1]: from metric_learn.constraints import Constraints                                                                                                                                                        

In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]                                                        

In [3]: cons = Constraints(partial_labels)                                                             

In [4]: cons.chunks(num_chunks=3, chunk_size=2)                                                        
Out[4]: array([0, 1, 1, 0, 2, 2])

In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)                                               

In [6]: len(chunks), len(partial_labels)                                                               
Out[6]: (6, 7)

In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2]

The output is not a map such that expected_chunk[i] is the chunk of element i, which I expected, but might not matter in RCA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. And this is indeed a problem for RCA_Supervised, see the issue #260 that I just added. I have also added another issue (#261) to make clear in the doc that -1 is interpreted as unlabeled. You're very welcome to address those @RobinVogel while you have everything well in mind :-)



@pytest.mark.parametrize('num_chunks, chunk_size', [(5, 10), (10, 50)])
def test_chunk_case_one_miss_point(num_chunks, chunk_size,
n_classes=10, n_unknown_labels=5):
"""Checks that the chunk generation breaks when one point is missing."""
labels = gen_labels_for_chunks(num_chunks, chunk_size,
n_classes=n_classes,
n_unknown_labels=n_unknown_labels)
assert len(labels) >= 1
constraints = Constraints(labels[1:])
with pytest.raises(ValueError) as e:
constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size,
random_state=SEED)

expected_message = (('Not enough examples in each class to form %d chunks '
'of %d examples - maximum number of chunks is %d'
) % (num_chunks, chunk_size, num_chunks - 1))

assert str(e.value) == expected_message