Skip to content

BUG: groupby.describe on a frame with duplicate column names #50846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Feb 3, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Jan 18, 2023

ASVs are below.

       before           after         ratio
     [a063af0e]       [185e4f8e]
     <groupby_select_obj_dup_cols~2>       <groupby_select_obj_dup_cols>
-        769±10μs          694±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('uint', 'first', 'transformation', 5)
-         758±3μs          683±9μs     0.90  groupby.GroupByMethods.time_dtype_as_group('int16', 'last', 'transformation', 5)
-     11.9±0.04ms       10.7±0.2ms     0.90  groupby.AggEngine.time_dataframe_cython(True)
-         846±4μs          759±6μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'transformation', 5)
-         857±5μs          769±3μs     0.90  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'transformation', 5)
-         758±3μs          678±2μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'transformation', 5)
-         763±3μs          682±8μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'transformation', 5)
-         766±1μs          683±4μs     0.89  groupby.GroupByMethods.time_dtype_as_group('datetime', 'max', 'transformation', 5)
-         775±2μs          691±7μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'transformation', 5)
-         860±8μs          766±7μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'transformation', 5)
-         775±3μs          689±8μs     0.89  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'transformation', 5)
-         847±4μs          752±9μs     0.89  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'transformation', 5)
-         759±2μs          674±2μs     0.89  groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'transformation', 5)
-         709±3μs         627±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'transformation', 5)
-         765±2μs          676±4μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'prod', 'transformation', 5)
-        854±10μs         754±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'transformation', 5)
-         766±4μs         675±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'max', 'transformation', 5)
-         775±3μs          683±6μs     0.88  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'transformation', 5)
-         670±1μs          590±9μs     0.88  groupby.GroupByMethods.time_dtype_as_group('uint', 'cumcount', 'transformation', 5)
-         767±4μs         674±20μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'sum', 'transformation', 5)
-         777±2μs         683±10μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'transformation', 5)
-         858±7μs          753±7μs     0.88  groupby.GroupByMethods.time_dtype_as_group('float', 'last', 'transformation', 5)
-         772±5μs          678±1μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int', 'last', 'transformation', 5)
-         661±2μs          579±1μs     0.88  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumcount', 'transformation', 5)
-         588±2μs          513±1μs     0.87  groupby.GroupByMethods.time_dtype_as_group('object', 'cumcount', 'transformation', 5)
-         779±3μs          680±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int16', 'max', 'transformation', 5)
-         781±3μs         677±20μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'max', 'transformation', 5)
-        772±10μs          669±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'transformation', 5)
-         663±3μs          574±9μs     0.87  groupby.GroupByMethods.time_dtype_as_group('int', 'cumcount', 'transformation', 5)
-         769±7μs         665±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('uint', 'sum', 'transformation', 5)
-         770±7μs         666±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'transformation', 5)
-         645±2μs          557±1μs     0.86  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cumcount', 'transformation', 5)
-         729±4μs         629±10μs     0.86  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'transformation', 5)
-         779±2μs         673±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'transformation', 5)
-         717±3μs         618±20μs     0.86  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'transformation', 5)
-         781±2μs         665±20μs     0.85  groupby.GroupByMethods.time_dtype_as_group('int', 'min', 'transformation', 5)
-        782±20μs         665±20μs     0.85  groupby.GroupByMethods.time_dtype_as_group('uint', 'last', 'transformation', 5)
-         580±6μs          491±7μs     0.85  groupby.TransformNaN.time_first
-         448±2μs         353±10μs     0.79  groupby.SumBools.time_groupby_sum_booleans
-         374±1μs          292±1μs     0.78  groupby.GroupByMethods.time_dtype_as_group('int', 'cumcount', 'direct', 5)
-         372±3μs          284±2μs     0.76  groupby.GroupByMethods.time_dtype_as_group('datetime', 'cumcount', 'direct', 5)
-         358±3μs          272±1μs     0.76  groupby.GroupByMethods.time_dtype_as_group('object', 'cumcount', 'direct', 5)
-         376±7μs        283±0.7μs     0.75  groupby.GroupByMethods.time_dtype_as_group('uint', 'cumcount', 'direct', 5)
-         374±2μs          279±3μs     0.75  groupby.GroupByMethods.time_dtype_as_group('float', 'cumcount', 'direct', 5)
-         382±7μs          283±7μs     0.74  groupby.GroupByMethods.time_dtype_as_group('int16', 'cumcount', 'direct', 5)
-         181±2μs         85.9±1μs     0.47  groupby.GroupByMethods.time_dtype_as_group('float', 'sum', 'direct', 5)
-         175±3μs       82.9±0.3μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int', 'sum', 'direct', 5)
-         175±2μs       82.9±0.2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int16', 'first', 'direct', 5)
-       178±0.2μs       84.1±0.9μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int', 'first', 'direct', 5)
-         180±3μs         83.8±2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('float', 'max', 'direct', 5)
-       177±0.9μs       82.3±0.2μs     0.47  groupby.GroupByMethods.time_dtype_as_group('uint', 'min', 'direct', 5)
-         175±1μs         81.4±1μs     0.47  groupby.GroupByMethods.time_dtype_as_group('int16', 'sum', 'direct', 5)
-         179±1μs       83.2±0.9μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int16', 'max', 'direct', 5)
-         179±3μs         82.7±1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'max', 'direct', 5)
-         181±2μs       83.6±0.6μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'min', 'direct', 5)
-         180±1μs       82.9±0.6μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'first', 'direct', 5)
-         178±1μs       81.9±0.3μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int', 'min', 'direct', 5)
-         178±1μs       81.4±0.7μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'last', 'direct', 5)
-         177±1μs       81.3±0.5μs     0.46  groupby.GroupByMethods.time_dtype_as_group('uint', 'sum', 'direct', 5)
-         180±2μs       82.4±0.1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int', 'max', 'direct', 5)
-         176±1μs         80.6±2μs     0.46  groupby.GroupByMethods.time_dtype_as_group('datetime', 'max', 'direct', 5)
-       170±0.4μs         77.5±1μs     0.46  groupby.GroupByMethods.time_dtype_as_group('datetime', 'last', 'direct', 5)
-         175±3μs       79.7±0.2μs     0.46  groupby.GroupByMethods.time_dtype_as_group('float', 'prod', 'direct', 5)
-         179±1μs       81.5±0.3μs     0.46  groupby.GroupByMethods.time_dtype_as_group('int16', 'min', 'direct', 5)
-         179±4μs         81.5±2μs     0.45  groupby.GroupByMethods.time_dtype_as_group('float', 'first', 'direct', 5)
-       174±0.5μs         79.0±3μs     0.45  groupby.GroupByMethods.time_dtype_as_group('datetime', 'first', 'direct', 5)
-         175±1μs         79.4±2μs     0.45  groupby.GroupByMethods.time_dtype_as_group('datetime', 'min', 'direct', 5)
-         172±2μs         77.2±1μs     0.45  groupby.GroupByMethods.time_dtype_as_group('int', 'last', 'direct', 5)
-         170±2μs       75.6±0.3μs     0.44  groupby.GroupByMethods.time_dtype_as_group('uint', 'prod', 'direct', 5)
-       170±0.9μs       75.5±0.3μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int', 'prod', 'direct', 5)
-         170±2μs       75.5±0.7μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int16', 'prod', 'direct', 5)
-         172±4μs       76.2±0.2μs     0.44  groupby.GroupByMethods.time_dtype_as_group('int16', 'last', 'direct', 5)
-       175±0.6μs         75.7±2μs     0.43  groupby.GroupByMethods.time_dtype_as_group('uint', 'last', 'direct', 5)
-         158±1μs         66.0±2μs     0.42  groupby.GroupByMethods.time_dtype_as_group('object', 'first', 'direct', 5)
-         160±2μs         65.5±1μs     0.41  groupby.GroupByMethods.time_dtype_as_group('object', 'last', 'direct', 5)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@rhshadrach rhshadrach marked this pull request as draft January 19, 2023 14:44
@rhshadrach rhshadrach mentioned this pull request Jan 19, 2023
5 tasks
…adrach/pandas into groupby_select_obj_dup_cols

# Conflicts:
#	pandas/core/groupby/groupby.py
…adrach/pandas into groupby_select_obj_dup_cols

# Conflicts:
#	pandas/core/groupby/groupby.py
@rhshadrach rhshadrach marked this pull request as ready for review January 20, 2023 21:47
@rhshadrach
Copy link
Member Author

rhshadrach commented Jan 24, 2023

@mroeschke - what do you think about the conditional logic based on the number of columns here?

@rhshadrach rhshadrach requested a review from mroeschke January 26, 2023 22:39
@@ -726,7 +726,9 @@ def _selected_obj(self):

if self._selection is None or isinstance(self.obj, Series):
if self._group_selection is not None:
return self.obj[self._group_selection]
return self.obj._take(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same as _obj_with_exclusions, which should already be cached, so we could avoid making a copy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - not only that, but we can also avoid all the code that determines _group_selection. I've turned _group_selection into a Boolean flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we expect this change to affect the timings in the OP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASVs updated; essentially the same results.


if groupby_func in ("size", "ngroup", "cumcount"):
expected = getattr(
df.take([0, 1], axis=1).groupby("a", as_index=as_index), groupby_func
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can you avoid chaining take/gropby/getattr here (and in L1639)? easier to grok if something goes wrong

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rhshadrach
Copy link
Member Author

@mroeschke @jbrockmendel - friendly ping.

…pby_select_obj_dup_cols

# Conflicts:
#	pandas/core/groupby/groupby.py
…adrach/pandas into groupby_select_obj_dup_cols

# Conflicts:
#	pandas/core/groupby/groupby.py
@mroeschke mroeschke added this to the 2.0 milestone Feb 3, 2023
@mroeschke mroeschke merged commit 50d288e into pandas-dev:main Feb 3, 2023
@mroeschke
Copy link
Member

Thanks @rhshadrach

@rhshadrach rhshadrach deleted the groupby_select_obj_dup_cols branch April 2, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby.describe on a frame with duplicate column names
4 participants