Description
I propose adding a pipe method to GroupBy objects, having the same interface as DataFrame.pipe
/Series.pipe
.
The use case is reusing a GroupBy object when doing calculations, see use case example below.
Use case
A pipe is useful for succintly reusing Groupby objects in calculations, for example calculating prices given a column of revenue and a columns of quantity sold:
>>> from numpy.random import choice, random
>>> n = 100_000
>>> df = pd.DataFrame({'Store': choice(['Store_1', 'Store_2'], n),
'Year': choice(['Year_1', 'Year_2', 'Year_3', 'Year_4'], n),
'Revenue': (np.random.random(n)*50+10).round(2),
'Quantity': np.random.randint(1, 10, size=n)})
>>> df.head(2)
Quantity Revenue Store Year
0 2 14.69 Store_1 Year_1
1 9 25.89 Store_2 Year_4
Then having .pipe
, we could for example get prices per store/year like so
>>> (df.groupby(['Store', 'Year'])
... .pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
... .unstack().round(2))
Year Year_1 Year_2 Year_3 Year_4
Store
Store_1 6.99 6.99 7.01 6.92
Store_2 6.95 6.98 6.97 6.96
Note that the above is vectorized and piping makes the code succint and clear.
Alternatives to .pipe
The alternatives would be:
- use
.apply
, - create a function and call that with a GroupBy object as its argument
- Create a price column
Option 1 is not good because of slowness.
A pipe is just syntactic sugar for option 2, but would piping be more readable, especially it you're piping other stuff already.
Creating a concrete calculated column is in some instances the right approach, but in other cases it is better to calculate stuff.