BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

MarcoGorelli · 2020-03-08T19:51:00Z

closes Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32494
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

pandas/tests/groupby/test_transform.py

pandas/core/groupby/generic.py

jreback · 2020-03-11T03:38:42Z

pandas/core/groupby/generic.py

@@ -1452,6 +1453,12 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame:
        # for each col, reshape to to size of original frame
        # by take operation
        ids, _, ngroup = self.grouper.group_info
+
+        if any(is_categorical(ping.grouper) for ping in self.grouper.groupings):


this should be in .group_info itself

MarcoGorelli · 2020-03-22T22:15:53Z

@WillAyd @jreback thanks for your reviews.

The solution I've pushed can either be:

if any(is_categorical(ping.grouper) for ping in self.grouper.groupings):
    result = result.reindex(self.grouper.result_index)

(as this is only necessary if there's a categorical grouping), or, if the if statement adds unnecessary complexity, just

result = result.reindex(self.grouper.result_index)

(which works all the time but in some cases isn't necessary), as I've currently done.

MarcoGorelli · 2020-03-24T21:14:56Z

This is currently wrong: it works for

>>> df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']), "B": [1, 2, 3], "C": ['a', 'b', 'a']})
>>> df.groupby(['A', 'C']).transform('sum')

but not for

>>> df = pd.DataFrame({"A": pd.Categorical(['a', 'b', 'a'], categories=['a', 'b', 'c']), "B": [1, 2, 3], "C": ['a', 'b', 'a']})
>>> df.groupby(['A', 'C'])['B'].transform('sum')

will update when this is addressed

UPDATE

I think it's fine now

pandas/tests/groupby/transform/test_transform.py

jreback · 2020-05-09T19:46:07Z

pandas/core/groupby/generic.py

@@ -1475,6 +1481,11 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame:
        # for each col, reshape to to size of original frame
        # by take operation
        ids, _, ngroup = self.grouper.group_info
+
+        # in categorical case there may be unobserved categories in index


does this depend on if observed=True is passed?

It's unnecessary if observed is True - have taken care of that now, thanks

can we *always just do
result = result.reindex(self.groupber.result_index, copy=False) ?

what breaks if we do that

@jreback that's fine, we can do that always, I just thought it would be better not to if it wasn't necessary. Green now

…metrize over observed

jreback · 2020-05-18T13:06:01Z

ok looks good. can you run the groupby benchmarks to make sure this doesn't regress anything. ping on green (and post results)

MarcoGorelli · 2020-05-18T17:23:37Z

@jreback here are the results of

$ asv continuous -f 1.1 upstream/master HEAD -b ^groupby

:

       before           after         ratio
     [dede2c7a]       [d6043ecb]
     <32494^2>        <32494>   
+         185±7μs          216±8μs     1.17  groupby.GroupByMethods.time_dtype_as_field('object', 'count', 'direct')
-        826±60μs          746±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'cumprod', 'transformation')
-        208±20μs          186±1μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct')
-      1.02±0.1ms         862±10μs     0.84  groupby.GroupByMethods.time_dtype_as_group('int', 'rank', 'transformation')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

MarcoGorelli · 2020-05-18T18:08:51Z

@jreback

I just merged master and this time got

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

running the same command

jreback

lgtm. comment on using a fixture, merge master and ping on green.

pandas/tests/groupby/transform/test_transform.py

jreback · 2020-06-14T15:02:08Z

thanks @MarcoGorelli

first attempt

16cb58e

MarcoGorelli changed the title ~~first attempt~~ (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 8, 2020

MarcoGorelli commented Mar 8, 2020

View reviewed changes

pandas/tests/groupby/test_transform.py Outdated Show resolved Hide resolved

MarcoGorelli added 2 commits March 8, 2020 22:18

condition for categorical

fff989f

gh number

1356da9

WillAyd requested changes Mar 9, 2020

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

WillAyd added Categorical Categorical Data Type Groupby labels Mar 9, 2020

Merge remote-tracking branch 'upstream/master' into 32494

f0a3cb1

MarcoGorelli changed the title ~~(wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results.~~ BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 9, 2020

Marco Gorelli added 3 commits March 9, 2020 12:28

use is_categorical

690591b

whatsnew

fa0cb85

Merge remote-tracking branch 'upstream/master' into 32494

967c2e8

jreback requested changes Mar 11, 2020

View reviewed changes

MarcoGorelli added 3 commits March 22, 2020 21:45

Merge remote-tracking branch 'upstream/master' into 32494

08632f2

reindex result

9d4d86c

remove blank lines

69c9513

Merge remote-tracking branch 'upstream/master' into 32494

da62141

MarcoGorelli changed the title ~~BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results.~~ (wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 24, 2020

fix for series case too

c9d9f81

MarcoGorelli changed the title ~~(wip) BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results.~~ BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. Mar 24, 2020

MarcoGorelli added 4 commits March 24, 2020 21:31

correct test

5586631

Merge remote-tracking branch 'upstream/master' into 32494

dd14e52

add comment about unobserved categories in categorical case

fc66150

Merge remote-tracking branch 'upstream/master' into 32494

32bd5b6

MarcoGorelli requested a review from WillAyd May 9, 2020 07:25

MarcoGorelli added 2 commits May 9, 2020 18:20

Merge remote-tracking branch 'upstream/master' into 32494

cc5022e

is_categorical -> is_categorical_dtype

b9cdde9

jreback requested changes May 9, 2020

View reviewed changes

MarcoGorelli added 4 commits May 10, 2020 11:19

dont reindex if observed is True, add short description of test, para…

016f5fa

…metrize over observed

assert frame equal -> assert series equal

869e1f5

don't special case the reindexing

c3db9a7

Merge remote-tracking branch 'upstream/master' into 32494

d6043ec

jreback added this to the 1.1 milestone May 18, 2020

Merge remote-tracking branch 'upstream/master' into 32494

0725ddf

MarcoGorelli requested a review from jreback June 12, 2020 13:08

jreback requested changes Jun 12, 2020

View reviewed changes

pandas/tests/groupby/transform/test_transform.py Show resolved Hide resolved

MarcoGorelli added 3 commits June 12, 2020 19:03

use observed fixture

c9b9881

Merge remote-tracking branch 'upstream/master' into 32494

c8a248b

Merge remote-tracking branch 'upstream/master' into 32494

c335221

MarcoGorelli requested a review from jreback June 13, 2020 11:50

jreback approved these changes Jun 14, 2020

View reviewed changes

jreback merged commit 624a1be into pandas-dev:master Jun 14, 2020

MarcoGorelli deleted the 32494 branch June 14, 2020 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

Uh oh!

MarcoGorelli commented Mar 8, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jreback Mar 11, 2020

Uh oh!

MarcoGorelli commented Mar 22, 2020

Uh oh!

MarcoGorelli commented Mar 24, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jreback May 9, 2020

Uh oh!

MarcoGorelli May 10, 2020

Uh oh!

jreback May 17, 2020

Uh oh!

MarcoGorelli May 18, 2020

Uh oh!

jreback commented May 18, 2020

Uh oh!

MarcoGorelli commented May 18, 2020 •

edited

Loading

Uh oh!

MarcoGorelli commented May 18, 2020

Uh oh!

jreback left a comment

Uh oh!

Uh oh!

jreback commented Jun 14, 2020

Uh oh!

Uh oh!

Uh oh!

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

Uh oh!

Conversation

MarcoGorelli commented Mar 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback Mar 11, 2020

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli commented Mar 22, 2020

Uh oh!

MarcoGorelli commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

UPDATE

Uh oh!

Uh oh!

Uh oh!

jreback May 9, 2020

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli May 10, 2020

Choose a reason for hiding this comment

Uh oh!

jreback May 17, 2020

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli May 18, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented May 18, 2020

Uh oh!

MarcoGorelli commented May 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented May 18, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback commented Jun 14, 2020

Uh oh!

Uh oh!

MarcoGorelli commented Mar 8, 2020 •

edited

Loading

MarcoGorelli commented Mar 24, 2020 •

edited

Loading

MarcoGorelli commented May 18, 2020 •

edited

Loading