Skip to content

BUG: Dataframe.groupby aggregations with categorical columns lead to incorrect results. #32546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
16cb58e
first attempt
MarcoGorelli Mar 8, 2020
fff989f
condition for categorical
MarcoGorelli Mar 8, 2020
1356da9
gh number
MarcoGorelli Mar 8, 2020
f0a3cb1
Merge remote-tracking branch 'upstream/master' into 32494
Mar 9, 2020
690591b
use is_categorical
Mar 9, 2020
fa0cb85
whatsnew
Mar 9, 2020
967c2e8
Merge remote-tracking branch 'upstream/master' into 32494
Mar 10, 2020
08632f2
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli Mar 22, 2020
9d4d86c
reindex result
MarcoGorelli Mar 22, 2020
69c9513
remove blank lines
MarcoGorelli Mar 22, 2020
da62141
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli Mar 24, 2020
c9d9f81
fix for series case too
MarcoGorelli Mar 24, 2020
5586631
correct test
MarcoGorelli Mar 24, 2020
dd14e52
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli Mar 25, 2020
fc66150
add comment about unobserved categories in categorical case
MarcoGorelli Mar 25, 2020
32bd5b6
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli May 9, 2020
cc5022e
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli May 9, 2020
b9cdde9
is_categorical -> is_categorical_dtype
MarcoGorelli May 9, 2020
016f5fa
dont reindex if observed is True, add short description of test, para…
MarcoGorelli May 10, 2020
869e1f5
assert frame equal -> assert series equal
MarcoGorelli May 10, 2020
c3db9a7
don't special case the reindexing
MarcoGorelli May 18, 2020
d6043ec
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli May 18, 2020
0725ddf
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli May 18, 2020
c9b9881
use observed fixture
MarcoGorelli Jun 12, 2020
c8a248b
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli Jun 12, 2020
c335221
Merge remote-tracking branch 'upstream/master' into 32494
MarcoGorelli Jun 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -976,6 +976,7 @@ Groupby/resample/rolling

- Bug in :meth:`GroupBy.apply` raises ``ValueError`` when the ``by`` axis is not sorted and has duplicates and the applied ``func`` does not mutate passed in objects (:issue:`30667`)
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`Groupby.transform` was returning the wrong result when grouping by multiple keys of which some were categorical and others not (:issue:`32494`)
- Bug in :meth:`GroupBy.count` causes segmentation fault when grouped-by column contains NaNs (:issue:`32841`)
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` produces inconsistent type when aggregating Boolean series (:issue:`32894`)
- Bug in :meth:`DataFrameGroupBy.sum` and :meth:`SeriesGroupBy.sum` where a large negative number would be returned when the number of non-null values was below ``min_count`` for nullable integer dtypes (:issue:`32861`)
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -546,6 +546,7 @@ def _transform_fast(self, result, func_nm: str) -> Series:
builtin/cythonizable functions
"""
ids, _, ngroup = self.grouper.group_info
result = result.reindex(self.grouper.result_index, copy=False)
cast = self._transform_should_cast(func_nm)
out = algorithms.take_1d(result._values, ids)
if cast:
Expand Down Expand Up @@ -1496,6 +1497,7 @@ def _transform_fast(self, result: DataFrame, func_nm: str) -> DataFrame:
# for each col, reshape to to size of original frame
# by take operation
ids, _, ngroup = self.grouper.group_info
result = result.reindex(self.grouper.result_index, copy=False)
output = []
for i, _ in enumerate(result.columns):
res = algorithms.take_1d(result.iloc[:, i].values, ids)
Expand Down
33 changes: 33 additions & 0 deletions pandas/tests/groupby/transform/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -1205,3 +1205,36 @@ def test_transform_lambda_indexing():
),
)
tm.assert_frame_equal(result, expected)


def test_categorical_and_not_categorical_key(observed):
# Checks that groupby-transform, when grouping by both a categorical
# and a non-categorical key, doesn't try to expand the output to include
# non-observed categories but instead matches the input shape.
# GH 32494
df_with_categorical = pd.DataFrame(
{
"A": pd.Categorical(["a", "b", "a"], categories=["a", "b", "c"]),
"B": [1, 2, 3],
"C": ["a", "b", "a"],
}
)
df_without_categorical = pd.DataFrame(
{"A": ["a", "b", "a"], "B": [1, 2, 3], "C": ["a", "b", "a"]}
)

# DataFrame case
result = df_with_categorical.groupby(["A", "C"], observed=observed).transform("sum")
expected = df_without_categorical.groupby(["A", "C"]).transform("sum")
tm.assert_frame_equal(result, expected)
expected_explicit = pd.DataFrame({"B": [4, 2, 4]})
tm.assert_frame_equal(result, expected_explicit)

# Series case
result = df_with_categorical.groupby(["A", "C"], observed=observed)["B"].transform(
"sum"
)
expected = df_without_categorical.groupby(["A", "C"])["B"].transform("sum")
tm.assert_series_equal(result, expected)
expected_explicit = pd.Series([4, 2, 4], name="B")
tm.assert_series_equal(result, expected_explicit)