Skip to content

BUG: Fix duplicates in intersection of multiindexes #36927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Nov 29, 2020
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cdefaae
Fix duplicates in intersectin of multiindexes
phofl Oct 6, 2020
fbd63f2
Fix duplicates in index intersection
phofl Oct 6, 2020
53a37d1
Modify test and avoid None issues
phofl Oct 6, 2020
5675a4e
Fix failing test
phofl Oct 6, 2020
134936c
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 9, 2020
582c0b9
Change comment
phofl Oct 9, 2020
7805de5
Add unique after intersection
phofl Oct 11, 2020
67691df
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 11, 2020
8fb0055
Merge branch '31326' into 36915
phofl Oct 11, 2020
66b519f
Fix merge bug
phofl Oct 11, 2020
cb1477b
Add tests and whatsnew
phofl Oct 11, 2020
0fb2561
Add rename
phofl Oct 26, 2020
3c19d57
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Oct 26, 2020
10524fd
Fix check in merge operation
phofl Oct 26, 2020
3dde0ee
Exit set ops when nonunique
phofl Nov 12, 2020
a0a1a33
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 12, 2020
45dfb84
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 22, 2020
d71a499
Roll back to initial version
phofl Nov 22, 2020
d873d5a
Change whatsnew
phofl Nov 22, 2020
e90239a
Move whatsnew
phofl Nov 27, 2020
c2b448a
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36915
phofl Nov 27, 2020
742716e
Change gh reference
phofl Nov 27, 2020
321797a
Remove pd
phofl Nov 27, 2020
a980ec0
Remove whatsnew from 1.2
phofl Nov 28, 2020
972fd48
Fix test
phofl Nov 28, 2020
fe1ded4
Make condition more clear and add assert
phofl Nov 28, 2020
8e4d47b
Use shape for equality check
phofl Nov 28, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -405,7 +405,8 @@ MultiIndex

- Bug in :meth:`DataFrame.xs` when used with :class:`IndexSlice` raises ``TypeError`` with message ``"Expected label or tuple of labels"`` (:issue:`35301`)
- Bug in :meth:`DataFrame.reset_index` with ``NaT`` values in index raises ``ValueError`` with message ``"cannot convert float NaN to integer"`` (:issue:`36541`)
-
- Bug in :meth:`MultiIndex.intersection` returned duplicates when at least one of the indexes had duplicates (:issue:`36915`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #36915 (comment) This is fixing a regression from 1.0.5? Can we target 1.1.5 with this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the whatsnew, but this may be tricky to backport.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet reviewed this PR. are there any changes other than those required to fix the regression here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge and setops relied partially on the wrong behavior, had to fix this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not investigated fully, but this PR may sit on top of #37171 (which incidentally caused a perf regression #37171 (comment) so maybe that we want to revert that first before committing the changes here anyhow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this may be tricky to backport.

opened #38133 to test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not checked all the envs, but a quick check on a couple shows the same tests failing on the backport as here. so that's encouraging.

I think we should target as a backport. and if issues down the road will just have to move the release note back.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thats a good sign. Lets try this then, when everything is green here



I/O
^^^
Expand Down Expand Up @@ -485,6 +486,7 @@ Other
- Fixed metadata propagation in the :class:`Series.dt` and :class:`Series.str` accessors (:issue:`28283`)
- Bug in :meth:`Index.union` behaving differently depending on whether operand is a :class:`Index` or other list-like (:issue:`36384`)
- Passing an array with 2 or more dimensions to the :class:`Series` constructor now raises the more specific ``ValueError``, from a bare ``Exception`` previously (:issue:`35744`)
- Bug in :meth:`Index.intersection` returned duplicates when at least one of the indexes had duplicates (:issue:`31326`)

.. ---------------------------------------------------------------------------

Expand Down
5 changes: 3 additions & 2 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2727,7 +2727,7 @@ def intersection(self, other, sort=False):
self._assert_can_do_setop(other)
other = ensure_index(other)

if self.equals(other):
if self.equals(other) and not self.has_duplicates:
return self._get_reconciled_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
Expand All @@ -2745,6 +2745,7 @@ def intersection(self, other, sort=False):
except TypeError:
pass
else:
result = algos.unique1d(result)
return self._wrap_setop_result(other, result)

try:
Expand All @@ -2756,7 +2757,7 @@ def intersection(self, other, sort=False):
indexer = algos.unique1d(Index(rvals).get_indexer_non_unique(lvals)[0])
indexer = indexer[indexer != -1]

result = other.take(indexer)
result = other.take(indexer).unique()

if sort is None:
result = algos.safe_sort(result.values)
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -3475,6 +3475,8 @@ def intersection(self, other, sort=False):
other, result_names = self._convert_can_do_setop(other)

if self.equals(other):
if self.has_duplicates:
return self.unique()
return self.rename(result_names)

if not is_object_dtype(other.dtype):
Expand All @@ -3493,10 +3495,12 @@ def intersection(self, other, sort=False):
uniq_tuples = None # flag whether _inner_indexer was successful
if self.is_monotonic and other.is_monotonic:
try:
uniq_tuples = self._inner_indexer(lvals, rvals)[0]
sort = False # uniq_tuples is already sorted
inner_tuples = self._inner_indexer(lvals, rvals)[0]
sort = False # inner_tuples is already sorted
except TypeError:
pass
else:
uniq_tuples = algos.unique(inner_tuples)

if uniq_tuples is None:
other_uniq = set(rvals)
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/ops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,9 @@ def _should_reindex_frame_op(
# TODO: any other cases we should handle here?
cols = left.columns.intersection(right.columns)

if len(cols) and not (cols.equals(left.columns) and cols.equals(right.columns)):
if len(cols) and not (
cols.equals(left.columns.unique()) and cols.equals(right.columns.unique())
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test that fails without this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the builds are gone. I will run the tests in the evening to find the relevant tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_column_dups_operations(self):

This one is crashing in line 255 because of memory issues. The condition is never fullfilled, so it blows up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition in master is never fulfilled or the condition in the PR?

btw, usually let the person who asked the question hit the "resolve conversation" button

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not know that. Will do that in the future.

The input df has columns [A,A]. When intersection is unique, then cols=[A], while left.columns=[A,A] and right.columns=[A,A]. Without the change adding the unique, cols.equals(left.columns) and cols.equals(right.columns) will always be False. So we always return True, which results in duplicating the columns for every passthrough, hence the memory explosion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed?

we now guaranteee that intersection returns unique columns, right so this should no longer be the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is the problem. Is intersection is unique, cols.equals(left.columns) won't be True. This leads to a recursion in the test mentioned which blows up the memory, because we can not exit this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok pls add some comments to this effect then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prob calculate left_uniques and right_uniques and make them variables as a bit simpler to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, changed it and added a comment

# TODO: is there a shortcut available when len(cols) == 0?
return True

Expand Down
23 changes: 23 additions & 0 deletions pandas/tests/indexes/multi/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,3 +376,26 @@ def test_setops_disallow_true(method):

with pytest.raises(ValueError, match="The 'sort' keyword only takes"):
getattr(idx1, method)(idx2, sort=True)


@pytest.mark.parametrize(
("tuples", "exp_tuples"),
[
([("val1", "test1")], [("val1", "test1")]),
([("val1", "test1"), ("val1", "test1")], [("val1", "test1")]),
(
[("val2", "test2"), ("val1", "test1")],
[("val2", "test2"), ("val1", "test1")],
),
],
)
def test_intersect_with_duplicates(tuples, exp_tuples):
# GH: 36915
left = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
right = pd.MultiIndex.from_tuples(
[("val1", "test1"), ("val1", "test1"), ("val2", "test2")],
names=["first", "second"],
)
result = left.intersection(right)
expected = pd.MultiIndex.from_tuples(exp_tuples, names=["first", "second"])
tm.assert_index_equal(result, expected)
2 changes: 1 addition & 1 deletion pandas/tests/indexes/test_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -683,7 +683,7 @@ def test_intersection_monotonic(self, index2, keeps_name, sort):

@pytest.mark.parametrize(
"index2,expected_arr",
[(Index(["B", "D"]), ["B"]), (Index(["B", "D", "A"]), ["A", "B", "A"])],
[(Index(["B", "D"]), ["B"]), (Index(["B", "D", "A"]), ["A", "B"])],
)
def test_intersection_non_monotonic_non_unique(self, index2, expected_arr, sort):
# non-monotonic non-unique
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/indexes/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,3 +95,13 @@ def test_union_dtypes(left, right, expected):
b = pd.Index([], dtype=right)
result = (a | b).dtype
assert result == expected


@pytest.mark.parametrize("values", [[1, 2, 2, 3], [3, 3]])
def test_intersection_duplicates(values):
# GH: 31326
a = pd.Index(values)
b = pd.Index([3, 3])
result = a.intersection(b)
expected = pd.Index([3])
tm.assert_index_equal(result, expected)