GroupBy.aggregate

I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write
```sql
SELECT l_returnflag,
       l_linestatus,
       Sum(l_extendedprice * ( 1 - l_discount )) AS sum_disc_price,
       Avg(l_quantity)                           AS avg_qty,
       Count(*)                                  AS count_order
FROM   lineitem
GROUP  BY l_returnflag,
          l_linestatus  
```
?

There's two problems with the current API:
1. we can't mix and match `sum` and `avg` within the same groupby
2. we can't transform columns within the group by aggregation. For example, tpc-h would require `l_extendedprice*(1-l_discount)` to be written inside the group by, rather than outside it

There's a couple of solutions I can think of

## 1. Expressions

Once again, the expressions API may help us here (https://github.com/data-apis/dataframe-api/pull/249)

We could introduce `GroupBy.aggregate`, which you can pass an expression to and which must reduce itself to a 1-row column per group

So, for example, the above could be expressed with:
```python
lineitem: DataFrame
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price'),
    col('l_quantity').mean().rename('avg_qty'),
    col('l_return_flag').count().rename('count_order'),
)
```

In the pandas implementation:
- `col('l_quantity').mean().rename('avg_qty')` could get mapped to `avg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")`
- `(col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price')` would probably require a custom lambda function

This would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation

## 2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Alternatively, `.aggregate` could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then `(col('l_extendedprice')*(1-col('discount')))` would have had to be computed before entering the `group_by`. E.g.:
```python
lineitem: DataFrame
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    ('sum_disc_price', 'disc_price', 'sum', {}),
    ('avg_qty', 'l_quantity', 'mean', {}),
    ('count_order', 'l_return_flag', 'count', {}),
)
```
Maybe we could also define our own dataclass to make the above more readable

Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation

---

Which solution is best?
- if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
- if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2

All things considered, I might be leaning more towards option 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GroupBy.aggregate #274

1. Expressions

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GroupBy.aggregate #274

Description

1. Expressions

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions