Skip to content

BUG: groupby.describe on a frame with duplicate column names #50846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Feb 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
aa9c9e1
REF: groupby Series selection with as_index=False
rhshadrach Dec 28, 2022
7d00d07
GH#
rhshadrach Jan 14, 2023
fd62b4e
Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…
rhshadrach Jan 14, 2023
c0891db
Merge branch 'main' into series_as_index_false
rhshadrach Jan 14, 2023
6bcfb12
Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…
rhshadrach Jan 16, 2023
41399ad
type-hinting fixes
rhshadrach Jan 16, 2023
c26957d
WIP
rhshadrach Jan 17, 2023
f2b538e
Merge branch 'main' of https://github.com/pandas-dev/pandas into owe_…
rhshadrach Jan 17, 2023
1860c4d
WIP
rhshadrach Jan 18, 2023
e42e222
WIP
rhshadrach Jan 18, 2023
0bdf009
BUG: groupby.describe on a frame with duplicate column names
rhshadrach Dec 28, 2022
185e4f8
cleanup
rhshadrach Jan 18, 2023
d2b965f
test fixup
rhshadrach Jan 19, 2023
932e3c8
Fix type-hint for _group_selection
rhshadrach Jan 19, 2023
5139df8
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Jan 19, 2023
8f132cd
Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach Jan 20, 2023
eeea6fc
Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach Jan 20, 2023
feb6661
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Jan 20, 2023
83f12b7
Speedup
rhshadrach Jan 20, 2023
c37a1ab
refinement
rhshadrach Jan 20, 2023
973b893
Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach Jan 24, 2023
78a3d5f
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Jan 25, 2023
4dafe5a
cleanup, faster implementation
rhshadrach Jan 25, 2023
0959c1b
Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach Jan 29, 2023
2fc97b2
Merge branch 'main' into groupby_select_obj_dup_cols
rhshadrach Jan 30, 2023
d5df78c
Make group_selection a Boolean flag
rhshadrach Jan 31, 2023
62bb1fb
Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach Jan 31, 2023
8d6df54
Avoid resetting cache
rhshadrach Jan 31, 2023
62540af
Improve test
rhshadrach Feb 1, 2023
f7a6973
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Feb 1, 2023
615d9c6
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Feb 1, 2023
88a9ec9
Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…
rhshadrach Feb 3, 2023
359d7ff
Rework test
rhshadrach Feb 3, 2023
d1d2610
Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…
rhshadrach Feb 3, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1263,6 +1263,7 @@ Groupby/resample/rolling
- Bug in :meth:`.SeriesGroupBy.value_counts` did not respect ``sort=False`` (:issue:`50482`)
- Bug in :meth:`.DataFrameGroupBy.resample` raises ``KeyError`` when getting the result from a key list when resampling on time index (:issue:`50840`)
- Bug in :meth:`.DataFrameGroupBy.transform` and :meth:`.SeriesGroupBy.transform` would raise incorrectly when grouper had ``axis=1`` for ``"ngroup"`` argument (:issue:`45986`)
- Bug in :meth:`.DataFrameGroupBy.describe` produced incorrect results when data had duplicate columns (:issue:`50806`)
-

Reshaping
Expand Down
40 changes: 40 additions & 0 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -1256,6 +1256,27 @@ def test_describe_with_duplicate_output_column_names(as_index, keys):
tm.assert_frame_equal(result, expected)


def test_describe_duplicate_columns():
# GH#50806
df = DataFrame([[0, 1, 2, 3]])
df.columns = [0, 1, 2, 0]
gb = df.groupby(df[1])
result = gb.describe(percentiles=[])

columns = ["count", "mean", "std", "min", "50%", "max"]
frames = [
DataFrame([[1.0, val, np.nan, val, val, val]], index=[1], columns=columns)
for val in (0.0, 2.0, 3.0)
]
expected = pd.concat(frames, axis=1)
expected.columns = MultiIndex(
levels=[[0, 2], columns],
codes=[6 * [0] + 6 * [1] + 6 * [0], 3 * list(range(6))],
)
expected.index.names = [1]
tm.assert_frame_equal(result, expected)


def test_groupby_mean_no_overflow():
# Regression test for (#22487)
df = DataFrame(
Expand Down Expand Up @@ -1596,3 +1617,22 @@ def test_multiindex_group_all_columns_when_empty(groupby_func):
result = method(*args).index
expected = df.index
tm.assert_index_equal(result, expected)


def test_duplicate_columns(request, groupby_func, as_index):
# GH#50806
if groupby_func == "corrwith":
msg = "GH#50845 - corrwith fails when there are duplicate columns"
request.node.add_marker(pytest.mark.xfail(reason=msg))
df = DataFrame([[1, 3, 6], [1, 4, 7], [2, 5, 8]], columns=list("abb"))
args = get_groupby_method_args(groupby_func, df)
gb = df.groupby("a", as_index=as_index)
result = getattr(gb, groupby_func)(*args)

expected_df = df.set_axis(["a", "b", "c"], axis=1)
expected_args = get_groupby_method_args(groupby_func, expected_df)
expected_gb = expected_df.groupby("a", as_index=as_index)
expected = getattr(expected_gb, groupby_func)(*expected_args)
if groupby_func not in ("size", "ngroup", "cumcount"):
expected = expected.rename(columns={"c": "b"})
tm.assert_equal(result, expected)
10 changes: 10 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2828,3 +2828,13 @@ def test_groupby_reduce_period():
expected = ser[:10]
expected.index = Index(range(10), dtype=np.int_)
tm.assert_series_equal(res, expected)


def test_obj_with_exclusions_duplicate_columns():
# GH#50806
df = DataFrame([[0, 1, 2, 3]])
df.columns = [0, 1, 2, 0]
gb = df.groupby(df[1])
result = gb._obj_with_exclusions
expected = df.take([0, 2, 3], axis=1)
tm.assert_frame_equal(result, expected)