-
Notifications
You must be signed in to change notification settings - Fork 229
Break chunks generation in RCA when not enough possible chunks, fixes issue #200 #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
c36d998
a286d5c
93fbe80
0d8a521
d663608
053da02
4f8c247
8a6d8bc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import pytest | ||
import numpy as np | ||
from sklearn.utils import shuffle | ||
from metric_learn.constraints import Constraints | ||
|
||
SEED = 42 | ||
|
||
|
||
def gen_labels_for_chunks(num_chunks, chunk_size, | ||
n_classes=10, n_unknown_labels=5): | ||
"""Generates num_chunks*chunk_size labels that split in num_chunks chunks, | ||
that are homogeneous in the label.""" | ||
assert min(num_chunks, chunk_size) > 0 | ||
classes = shuffle(np.arange(n_classes), random_state=SEED) | ||
n_per_class = chunk_size * (num_chunks // n_classes) | ||
n_maj_class = n_per_class + chunk_size * num_chunks - n_per_class * n_classes | ||
|
||
first_labels = classes[0] * np.ones(n_maj_class, dtype=int) | ||
remaining_labels = np.concatenate([k * np.ones(n_per_class, dtype=int) | ||
for k in classes[1:]]) | ||
unknown_labels = -1 * np.ones(n_unknown_labels, dtype=int) | ||
|
||
labels = np.concatenate([first_labels, remaining_labels, unknown_labels]) | ||
return shuffle(labels, random_state=SEED) | ||
|
||
|
||
@pytest.mark.parametrize('num_chunks, chunk_size', [(11, 5), (115, 12)]) | ||
def test_chunk_case_exact_num_points(num_chunks, chunk_size, | ||
n_classes=10, n_unknown_labels=5): | ||
"""Checks that the chunk generation works well with just enough points.""" | ||
labels = gen_labels_for_chunks(num_chunks, chunk_size, | ||
n_classes=n_classes, | ||
n_unknown_labels=n_unknown_labels) | ||
constraints = Constraints(labels) | ||
chunks = constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size, | ||
random_state=SEED) | ||
return chunks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of simply returning There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will add it. A remark, I expected chunks to map the datapoint index (input order) to the chunk number, but chunks removes -1 instances (unknown labels, according to line 21), which means that one has to consider datapoint indexes when removing -1's. But it must be taken into account in the implementation of RCA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate on this? Not sure I understand what you mean. Do we test at all There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll just add an example to make it clearer. Below is a comparison in the presence of unknown labels between the output that I expected from In [1]: from metric_learn.constraints import Constraints
In [2]: partial_labels = [1, 2, 2, 1, -1, 3, 3]
In [3]: cons = Constraints(partial_labels)
In [4]: cons.chunks(num_chunks=3, chunk_size=2)
Out[4]: array([0, 1, 1, 0, 2, 2])
In [5]: chunks = cons.chunks(num_chunks=3, chunk_size=2)
In [6]: len(chunks), len(partial_labels)
Out[6]: (6, 7)
In [7]: expected_chunk = [0, 1, 1, 0, -1, 2, 2] The output is not a map such that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. And this is indeed a problem for |
||
|
||
|
||
@pytest.mark.parametrize('num_chunks, chunk_size', [(5, 10), (10, 50)]) | ||
def test_chunk_case_one_miss_point(num_chunks, chunk_size, | ||
n_classes=10, n_unknown_labels=5): | ||
"""Checks that the chunk generation breaks when one point is missing.""" | ||
labels = gen_labels_for_chunks(num_chunks, chunk_size, | ||
n_classes=n_classes, | ||
n_unknown_labels=n_unknown_labels) | ||
assert len(labels) >= 1 | ||
constraints = Constraints(labels[1:]) | ||
with pytest.raises(ValueError) as e: | ||
constraints.chunks(num_chunks=num_chunks, chunk_size=chunk_size, | ||
random_state=SEED) | ||
|
||
expected_message = (('Not enough examples in each class to form %d chunks ' | ||
'of %d examples - maximum number of chunks is %d' | ||
) % (num_chunks, chunk_size, num_chunks - 1)) | ||
|
||
assert str(e.value) == expected_message |
Uh oh!
There was an error while loading. Please reload this page.