BUG: groupby.describe on a frame with duplicate column names #50846

rhshadrach · 2023-01-18T22:30:27Z

closes BUG: groupby.describe on a frame with duplicate column names #50806 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

ASVs are below.

       before           after         ratio
     [a063af0e]       [185e4f8e]
     <groupby_select_obj_dup_cols~2>       <groupby_select_obj_dup_cols>
-        769±10μs          694±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('uint', 'first', 'transformation', 5)
-         758±3μs          683±9μs     0.90  groupby.GroupByMethods.time_dtype_as_group('int16', 'last', 'transformation', 5)
-     11.9±0.04ms       10.7±0.2ms     0.90  groupby.AggEngine.time_dataframe_cython(True)
-         846±4μs          759±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation', 5)
-         857±5μs          769±3μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'transformation', 5)
-         758±3μs          678±2μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation', 5)
-         763±3μs          682±8μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'transformation', 5)
-         766±1μs          683±4μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'max', 'transformation', 5)
-         775±2μs          691±7μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'transformation', 5)
-         860±8μs          766±7μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation', 5)
-         775±3μs          689±8μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'transformation', 5)
-         847±4μs          752±9μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation', 5)
-         759±2μs          674±2μs     0.89  groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'transformation', 5)
-         709±3μs         627±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'transformation', 5)
-         765±2μs          676±4μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'prod', 'transformation', 5)
-        854±10μs         754±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'transformation', 5)
-         766±4μs         675±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'max', 'transformation', 5)
-         775±3μs          683±6μs     0.88  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'transformation', 5)
-         670±1μs          590±9μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'cumcount', 'transformation', 5)
-         767±4μs         674±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'sum', 'transformation', 5)
-         777±2μs         683±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'transformation', 5)
-         858±7μs          753±7μs     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'last', 'transformation', 5)
-         772±5μs          678±1μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'last', 'transformation', 5)
-         661±2μs          579±1μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumcount', 'transformation', 5)
-         588±2μs          513±1μs     0.87  groupby.GroupByMethods.time_dtype_as_group('object', 'cumcount', 'transformation', 5)
-         779±3μs          680±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int16', 'max', 'transformation', 5)
-         781±3μs         677±20μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'max', 'transformation', 5)
-        772±10μs          669±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'transformation', 5)
-         663±3μs          574±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'cumcount', 'transformation', 5)
-         769±7μs         665±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('uint', 'sum', 'transformation', 5)
-         770±7μs         666±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'transformation', 5)
-         645±2μs          557±1μs     0.86  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cumcount', 'transformation', 5)
-         729±4μs         629±10μs     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'transformation', 5)
-         779±2μs         673±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'transformation', 5)
-         717±3μs         618±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'transformation', 5)
-         781±2μs         665±20μs     0.85  groupby.GroupByMethods.time_dtype_as_group('int', 'min', 'transformation', 5)
-        782±20μs         665±20μs     0.85  groupby.GroupByMethods.time_dtype_as_group('uint', 'last', 'transformation', 5)
-         580±6μs          491±7μs     0.85  groupby.TransformNaN.time_first
-         448±2μs         353±10μs     0.79  groupby.SumBools.time_groupby_sum_booleans
-         374±1μs          292±1μs     0.78  groupby.GroupByMethods.time_dtype_as_group('int', 'cumcount', 'direct', 5)
-         372±3μs          284±2μs     0.76  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cumcount', 'direct', 5)
-         358±3μs          272±1μs     0.76  groupby.GroupByMethods.time_dtype_as_group('object', 'cumcount', 'direct', 5)
-         376±7μs        283±0.7μs     0.75  groupby.GroupByMethods.time_dtype_as_group('uint', 'cumcount', 'direct', 5)
-         374±2μs          279±3μs     0.75  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct', 5)
-         382±7μs          283±7μs     0.74  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumcount', 'direct', 5)
-         181±2μs         85.9±1μs     0.47  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'direct', 5)
-         175±3μs       82.9±0.3μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'direct', 5)
-         175±2μs       82.9±0.2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'direct', 5)
-       178±0.2μs       84.1±0.9μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'direct', 5)
-         180±3μs         83.8±2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'direct', 5)
-       177±0.9μs       82.3±0.2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'direct', 5)
-         175±1μs         81.4±1μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int16', 'sum', 'direct', 5)
-         179±1μs       83.2±0.9μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int16', 'max', 'direct', 5)
-         179±3μs         82.7±1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'max', 'direct', 5)
-         181±2μs       83.6±0.6μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'direct', 5)
-         180±1μs       82.9±0.6μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'first', 'direct', 5)
-         178±1μs       81.9±0.3μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int', 'min', 'direct', 5)
-         178±1μs       81.4±0.7μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'last', 'direct', 5)
-         177±1μs       81.3±0.5μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'sum', 'direct', 5)
-         180±2μs       82.4±0.1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int', 'max', 'direct', 5)
-         176±1μs         80.6±2μs     0.46  groupby.GroupByMethods.time_dtype_as_group('datetime', 'max', 'direct', 5)
-       170±0.4μs         77.5±1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'direct', 5)
-         175±3μs       79.7±0.2μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct', 5)
-         179±1μs       81.5±0.3μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'direct', 5)
-         179±4μs         81.5±2μs     0.45  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct', 5)
-       174±0.5μs         79.0±3μs     0.45  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'direct', 5)
-         175±1μs         79.4±2μs     0.45  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'direct', 5)
-         172±2μs         77.2±1μs     0.45  groupby.GroupByMethods.time_dtype_as_group('int', 'last', 'direct', 5)
-         170±2μs       75.6±0.3μs     0.44  groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'direct', 5)
-       170±0.9μs       75.5±0.3μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'direct', 5)
-         170±2μs       75.5±0.7μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int16', 'prod', 'direct', 5)
-         172±4μs       76.2±0.2μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int16', 'last', 'direct', 5)
-       175±0.6μs         75.7±2μs     0.43  groupby.GroupByMethods.time_dtype_as_group('uint', 'last', 'direct', 5)
-         158±1μs         66.0±2μs     0.42  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'direct', 5)
-         160±2μs         65.5±1μs     0.41  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct', 5)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

…es_as_index_false

…vs_so

pandas/core/groupby/groupby.py

…pby_select_obj_dup_cols

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

…pby_select_obj_dup_cols

rhshadrach · 2023-01-24T21:47:55Z

@mroeschke - what do you think about the conditional logic based on the number of columns here?

pandas/tests/groupby/test_groupby_dropna.py

pandas/core/groupby/groupby.py

…pby_select_obj_dup_cols

jbrockmendel · 2023-01-31T02:46:24Z

pandas/core/groupby/groupby.py

@@ -726,7 +726,9 @@ def _selected_obj(self):

        if self._selection is None or isinstance(self.obj, Series):
            if self._group_selection is not None:
-                return self.obj[self._group_selection]
+                return self.obj._take(


I think this is the same as _obj_with_exclusions, which should already be cached, so we could avoid making a copy

This is great - not only that, but we can also avoid all the code that determines _group_selection. I've turned _group_selection into a Boolean flag.

should we expect this change to affect the timings in the OP?

ASVs updated; essentially the same results.

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

jbrockmendel · 2023-01-31T20:12:12Z

pandas/tests/groupby/test_function.py

+
+    if groupby_func in ("size", "ngroup", "cumcount"):
+        expected = getattr(
+            df.take([0, 1], axis=1).groupby("a", as_index=as_index), groupby_func


nitpick: can you avoid chaining take/gropby/getattr here (and in L1639)? easier to grok if something goes wrong

…pby_select_obj_dup_cols

…pby_select_obj_dup_cols � Conflicts: � pandas/core/groupby/groupby.py

rhshadrach · 2023-02-03T02:53:31Z

@mroeschke @jbrockmendel - friendly ping.

…pby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

mroeschke · 2023-02-03T18:13:01Z

Thanks @rhshadrach

rhshadrach and others added 11 commits January 14, 2023 10:39

REF: groupby Series selection with as_index=False

aa9c9e1

GH#

7d00d07

Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…

fd62b4e

…es_as_index_false

Merge branch 'main' into series_as_index_false

c0891db

Merge branch 'main' of https://github.com/pandas-dev/pandas into seri…

6bcfb12

…es_as_index_false

type-hinting fixes

41399ad

WIP

c26957d

Merge branch 'main' of https://github.com/pandas-dev/pandas into owe_…

f2b538e

…vs_so

WIP

1860c4d

WIP

e42e222

BUG: groupby.describe on a frame with duplicate column names

0bdf009

rhshadrach added Bug Groupby labels Jan 18, 2023

cleanup

185e4f8

rhshadrach commented Jan 18, 2023

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

rhshadrach marked this pull request as draft January 19, 2023 14:44

rhshadrach added 3 commits January 19, 2023 16:22

test fixup

d2b965f

Fix type-hint for _group_selection

932e3c8

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

5139df8

…pby_select_obj_dup_cols

rhshadrach mentioned this pull request Jan 19, 2023

Remove obj_with_exclusions #50878

Closed

5 tasks

rhshadrach added 5 commits January 19, 2023 22:07

Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…

8f132cd

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…

eeea6fc

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

feb6661

…pby_select_obj_dup_cols

Speedup

83f12b7

refinement

c37a1ab

rhshadrach marked this pull request as ready for review January 20, 2023 21:47

Merge branch 'main' into groupby_select_obj_dup_cols

973b893

mroeschke reviewed Jan 25, 2023

View reviewed changes

pandas/tests/groupby/test_groupby_dropna.py Outdated Show resolved Hide resolved

mroeschke reviewed Jan 25, 2023

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

rhshadrach added 2 commits January 25, 2023 15:46

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

78a3d5f

…pby_select_obj_dup_cols

cleanup, faster implementation

4dafe5a

rhshadrach requested a review from mroeschke January 26, 2023 22:39

rhshadrach added 2 commits January 29, 2023 07:57

Merge branch 'main' into groupby_select_obj_dup_cols

0959c1b

Merge branch 'main' into groupby_select_obj_dup_cols

2fc97b2

rhshadrach mentioned this pull request Jan 31, 2023

CLN/DOC: _selected_obj vs _obj_with_exclusions #46944

Open

jbrockmendel reviewed Jan 31, 2023

View reviewed changes

rhshadrach added 3 commits January 30, 2023 22:51

Make group_selection a Boolean flag

d5df78c

Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…

62bb1fb

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

Avoid resetting cache

8d6df54

jbrockmendel reviewed Jan 31, 2023

View reviewed changes

rhshadrach added 3 commits January 31, 2023 20:19

Improve test

62540af

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

f7a6973

…pby_select_obj_dup_cols

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

615d9c6

…pby_select_obj_dup_cols � Conflicts: � pandas/core/groupby/groupby.py

rhshadrach added 3 commits February 2, 2023 22:49

Merge branch 'main' of https://github.com/pandas-dev/pandas into grou…

88a9ec9

…pby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

Rework test

359d7ff

Merge branch 'groupby_select_obj_dup_cols' of https://github.com/rhsh…

d1d2610

…adrach/pandas into groupby_select_obj_dup_cols # Conflicts: # pandas/core/groupby/groupby.py

mroeschke approved these changes Feb 3, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Feb 3, 2023

mroeschke merged commit 50d288e into pandas-dev:main Feb 3, 2023

rhshadrach deleted the groupby_select_obj_dup_cols branch April 2, 2023 14:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: groupby.describe on a frame with duplicate column names #50846

BUG: groupby.describe on a frame with duplicate column names #50846

Uh oh!

rhshadrach commented Jan 18, 2023 •

edited

Loading

Uh oh!

Uh oh!

rhshadrach commented Jan 24, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jbrockmendel Jan 31, 2023

Uh oh!

rhshadrach Jan 31, 2023

Uh oh!

jbrockmendel Jan 31, 2023

Uh oh!

rhshadrach Feb 1, 2023

Uh oh!

jbrockmendel Jan 31, 2023

Uh oh!

rhshadrach Feb 1, 2023

Uh oh!

rhshadrach commented Feb 3, 2023

Uh oh!

mroeschke commented Feb 3, 2023

Uh oh!

Uh oh!

Uh oh!

BUG: groupby.describe on a frame with duplicate column names #50846

BUG: groupby.describe on a frame with duplicate column names #50846

Uh oh!

Conversation

rhshadrach commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rhshadrach commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jbrockmendel Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

rhshadrach Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

rhshadrach Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

jbrockmendel Jan 31, 2023

Choose a reason for hiding this comment

Uh oh!

rhshadrach Feb 1, 2023

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Feb 3, 2023

Uh oh!

mroeschke commented Feb 3, 2023

Uh oh!

Uh oh!

rhshadrach commented Jan 18, 2023 •

edited

Loading

rhshadrach commented Jan 24, 2023 •

edited

Loading