Skip to content

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 14, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Mar 8, 2020

@MarcoGorelli MarcoGorelli changed the title first attempt (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 8, 2020
@WillAyd WillAyd added Categorical Categorical Data Type Groupby labels Mar 9, 2020
@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 9, 2020
@@ -1452,6 +1453,12 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame:
# for each col, reshape to to size of original frame
# by take operation
ids, _, ngroup = self.grouper.group_info

if any(is_categorical(ping.grouper) for ping in self.grouper.groupings):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in .group_info itself

@MarcoGorelli
Copy link
Member Author

@WillAyd @jreback thanks for your reviews.

The solution I've pushed can either be:

if any(is_categorical(ping.grouper) for ping in self.grouper.groupings):
    result = result.reindex(self.grouper.result_index)

(as this is only necessary if there's a categorical grouping), or, if the if statement adds unnecessary complexity, just

result = result.reindex(self.grouper.result_index)

(which works all the time but in some cases isn't necessary), as I've currently done.

@MarcoGorelli MarcoGorelli changed the title BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 24, 2020
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Mar 24, 2020

This is currently wrong: it works for

>>> df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']), "B": [1, 2, 3], "C": ['a', 'b', 'a']})
>>> df.groupby(['A', 'C']).transform('sum')

but not for

>>> df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']), "B": [1, 2, 3], "C": ['a', 'b', 'a']})
>>> df.groupby(['A', 'C'])['B'].transform('sum')

will update when this is addressed

UPDATE

I think it's fine now

@MarcoGorelli MarcoGorelli changed the title (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 24, 2020
@MarcoGorelli MarcoGorelli requested a review from WillAyd May 9, 2020 07:25
@@ -1475,6 +1481,11 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame:
# for each col, reshape to to size of original frame
# by take operation
ids, _, ngroup = self.grouper.group_info

# in categorical case there may be unobserved categories in index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this depend on if observed=True is passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary if observed is True - have taken care of that now, thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we *always just do
result = result.reindex(self.groupber.result_index, copy=False) ?

what breaks if we do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback that's fine, we can do that always, I just thought it would be better not to if it wasn't necessary. Green now

@jreback jreback added this to the 1.1 milestone May 18, 2020
@jreback
Copy link
Contributor

jreback commented May 18, 2020

ok looks good. can you run the groupby benchmarks to make sure this doesn't regress anything. ping on green (and post results)

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented May 18, 2020

@jreback here are the results of

$ asv continuous -f 1.1 upstream/master HEAD -b ^groupby

:

       before           after         ratio
     [dede2c7a]       [d6043ecb]
     <32494^2>        <32494>   
+         185±7μs          216±8μs     1.17  groupby.GroupByMethods.time_dtype_as_field('object', 'count', 'direct')
-        826±60μs          746±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'cumprod', 'transformation')
-        208±20μs          186±1μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct')
-      1.02±0.1ms         862±10μs     0.84  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@MarcoGorelli
Copy link
Member Author

@jreback

I just merged master and this time got

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

running the same command

@MarcoGorelli MarcoGorelli requested a review from jreback June 12, 2020 13:08
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. comment on using a fixture, merge master and ping on green.

@MarcoGorelli MarcoGorelli requested a review from jreback June 13, 2020 11:50
@jreback jreback merged commit 624a1be into pandas-dev:master Jun 14, 2020
@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

thanks @MarcoGorelli

@MarcoGorelli MarcoGorelli deleted the 32494 branch June 14, 2020 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataframe.groupby aggregations with categorical columns lead to incorrect results.
3 participants