-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/generic.py
Outdated
@@ -1452,6 +1453,12 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame: | |||
# for each col, reshape to to size of original frame | |||
# by take operation | |||
ids, _, ngroup = self.grouper.group_info | |||
|
|||
if any(is_categorical(ping.grouper) for ping in self.grouper.groupings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in .group_info itself
@WillAyd @jreback thanks for your reviews. The solution I've pushed can either be:
(as this is only necessary if there's a categorical grouping), or, if the
(which works all the time but in some cases isn't necessary), as I've currently done. |
This is currently wrong: it works for
but not for
will update when this is addressed UPDATEI think it's fine now |
pandas/core/groupby/generic.py
Outdated
@@ -1475,6 +1481,11 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame: | |||
# for each col, reshape to to size of original frame | |||
# by take operation | |||
ids, _, ngroup = self.grouper.group_info | |||
|
|||
# in categorical case there may be unobserved categories in index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this depend on if observed=True is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary if observed
is True - have taken care of that now, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we *always just do
result = result.reindex(self.groupber.result_index, copy=False)
?
what breaks if we do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback that's fine, we can do that always, I just thought it would be better not to if it wasn't necessary. Green now
ok looks good. can you run the groupby benchmarks to make sure this doesn't regress anything. ping on green (and post results) |
@jreback here are the results of $ asv continuous -f 1.1 upstream/master HEAD -b ^groupby :
|
I just merged master and this time got
running the same command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. comment on using a fixture, merge master and ping on green.
thanks @MarcoGorelli |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff