Skip to content

GroupBy.aggregate #274

Closed
Closed
@MarcoGorelli

Description

@MarcoGorelli

I think the current group_by design is too limited. For example, trying to implement tpc-h queries, how are we supposed to write

SELECT l_returnflag,
       l_linestatus,
       Sum(l_extendedprice * ( 1 - l_discount )) AS sum_disc_price,
       Avg(l_quantity)                           AS avg_qty,
       Count(*)                                  AS count_order
FROM   lineitem
GROUP  BY l_returnflag,
          l_linestatus  

?

There's two problems with the current API:

  1. we can't mix and match sum and avg within the same groupby
  2. we can't transform columns within the group by aggregation. For example, tpc-h would require l_extendedprice*(1-l_discount) to be written inside the group by, rather than outside it

There's a couple of solutions I can think of

1. Expressions

Once again, the expressions API may help us here (#249)

We could introduce GroupBy.aggregate, which you can pass an expression to and which must reduce itself to a 1-row column per group

So, for example, the above could be expressed with:

lineitem: DataFrame
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price'),
    col('l_quantity').mean().rename('avg_qty'),
    col('l_return_flag').count().rename('count_order'),
)

In the pandas implementation:

  • col('l_quantity').mean().rename('avg_qty') could get mapped to avg_qty=pd.NamedAgg(column="l_quantity", aggfunc="mean")
  • (col('l_extendedprice')*(1-col('discount'))).sum().rename('sum_disc_price') would probably require a custom lambda function

This would be maximally flexible, but it would mean that people would write code which could be inefficient in the pandas implementation

2. (output_column_name, input_column_name, aggregation_name, **kwargs) tuples

Alternatively, .aggregate could take a tuple of (output_column_name, input_column_name, aggregation_name), and in the case above then (col('l_extendedprice')*(1-col('discount'))) would have had to be computed before entering the group_by. E.g.:

lineitem: DataFrame
lineitem = lineitem.assign(
    (col('l_extendedprice')*(1-col('discount'))).rename('disc_price')
)
lineitem.group_by('l_return_flag', 'l_linestatus').aggregate(
    ('sum_disc_price', 'disc_price', 'sum', {}),
    ('avg_qty', 'l_quantity', 'mean', {}),
    ('count_order', 'l_return_flag', 'count', {}),
)

Maybe we could also define our own dataclass to make the above more readable

Technically this might not be totally in accordance to tpc-h rules on not reordering, but it would avoid requiring lambda functions for the pandas implementation


Which solution is best?

  • if we're going for "how should the API of a new dataframe library look like?", then I prefer option 1
  • if we're going for "a minimal API which all major dataframe libraries today can agree on" then I prefer option 2

All things considered, I might be leaning more towards option 2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions