Skip to content

Bug in DataFrame.drop_duplicates for empty DataFrame throws error #22394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 23, 2018
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -711,7 +711,7 @@ Reshaping
- Bug in :func:`get_dummies` with Unicode attributes in Python 2 (:issue:`22084`)
- Bug in :meth:`DataFrame.replace` raises ``RecursionError`` when replacing empty lists (:issue:`22083`)
- Bug in :meth:`Series.replace` and meth:`DataFrame.replace` when dict is used as the `to_replace` value and one key in the dict is is another key's value, the results were inconsistent between using integer key and using string key (:issue:`20656`)
-
- Bug in :meth:`DataFrame.drop_duplicates` for empty ``DataFrame`` which incorrectly raises error (:issue:`20516`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raises an error


Build Changes
^^^^^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4369,6 +4369,9 @@ def duplicated(self, subset=None, keep='first'):
from pandas.core.sorting import get_group_index
from pandas._libs.hashtable import duplicated_int64, _SIZE_HINT_LIMIT

if self.empty:
return Series()

def f(vals):
labels, shape = algorithms.factorize(
vals, size_hint=min(len(self), _SIZE_HINT_LIMIT))
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/frame/test_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,21 @@ def test_drop_duplicates_tuple():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize('df', [
DataFrame(),
DataFrame(columns=[]),
DataFrame(columns=['A', 'B', 'C']),
DataFrame(index=[]),
DataFrame(index=['A', 'B', 'C'])
])
def test_drop_duplicates_empty(df):
# GH 20516
result = df.drop_duplicates()
if df.columns.empty is False:
df = DataFrame(columns=[])
tm.assert_frame_equal(result, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests feels a bit tricky to me. So, when there are columns, we set them to an empty list. So, that would be the same as simply using as expected value DataFrame(columns=[]), or am I missing something?

But not sure if I'm missing something, but I'd expect DataFrame(columns=['A', 'B', 'C']).drop_duplicates() to keep the columns. Any reason to drop them?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the closed issue of #22409. DataFrame(columns=['A', 'B', 'C']).drop_duplicates() passes through empty selection, and whilst index remains, column infos do not as the selection itself is done on the column basis if I understood the comment correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem may be your approach implementing this story. I agree on what @WillAyd said in #22409, but not in the side effect in this PR.

I guess df[pandas.Series()] is the same as df[[]], so you're telling the DataFrame to select an empty list of columns, so none of them. So, if we have a DataFrame with 4 columns and 10 rows, I expect this to return the 10 rows, but 0 columns.

But the case here with drop_duplicates, if I understand correctly, is that you have a DataFrame with 4 columns and 0 rows, you remove the duplicates, that there are none, as the DataFrame has 0 rows, so I should still have 4 columns and 0 rows.

And in your test you're asserting that the return will be 0 columns and 0 rows.

Am I right? Does it makes sense?

Copy link
Contributor Author

@HyunTruth HyunTruth Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. However, if we want to leave the column when there exists no value, then we'd need to change the DataFrame.drop_duplicates itself to not use the logic of self[~duplicated], as it currently does, as this is what is causing this problem at least at this moment with the empty return of DataFrame.duplicated... I thought that DataFrame.duplicated returns a Series, and then if DataFrame has no value, returning an empty Series is logical. However, since DataFrame.drop_duplicates use the logic of self[~duplicated] (which works for every other case except on empty DataFrame with only columns) returning empty Series leads to df[[]], resulting in a DataFrame with no value without the columns (as it is used in selection which has no value in it). Maybe returning an empty wasn't a good idea. Completely stuck then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I understand the problem now. There is some magic on df[something] in pandas. In some cases something would be understood as the list of columns, like in df[[]] or df[['A', 'B', 'C']], but if something is a list of boolean values, then it filters on rows instead of columns, like in df[df['A'].isnull()] (df['A'].isnull() returns a list (Series) of booleans).

That's why your approach is not working as expected. You may try to check with df.iloc[something], which should always filter on rows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test case with inplace=True please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.



def test_drop_duplicates_NA():
# none
df = DataFrame({'A': [None, None, 'foo', 'bar',
Expand Down