Description
I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write
SELECT l_returnflag,
l_linestatus,
Sum(l_extendedprice * ( 1 - l_discount )) AS sum_disc_price,
Avg(l_quantity) AS avg_qty,
Count(*) AS count_order
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
?
There's two problems with the current API:
- we can't mix and match
sum
andavg
within the same groupby - we can't transform columns within the group by aggregation. For example, tpc-h would require
l_extendedprice*(1-l_discount)
to be written inside the group by, rather than outside it
There's a couple of solutions I can think of
1. Expressions
Once again, the expressions API may help us here (#249)
We could introduce GroupBy.aggregate
, which you can pass an expression to and which must reduce itself to a 1-row column per group
So, for example, the above could be expressed with:
lineitem: DataFrame
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price'),
col('l_quantity').mean().rename('avg_qty'),
col('l_return_flag').count().rename('count_order'),
)
In the pandas implementation:
col('l_quantity').mean().rename('avg_qty')
could get mapped toavg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price')
would probably require a custom lambda function
This would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation
2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples
Alternatively, .aggregate
could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then (col('l_extendedprice')*(1-col('discount')))
would have had to be computed before entering the group_by
. E.g.:
lineitem: DataFrame
lineitem = lineitem.assign(
(col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
('sum_disc_price', 'disc_price', 'sum', {}),
('avg_qty', 'l_quantity', 'mean', {}),
('count_order', 'l_return_flag', 'count', {}),
)
Maybe we could also define our own dataclass to make the above more readable
Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation
Which solution is best?
- if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
- if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2
All things considered, I might be leaning more towards option 2