Description
Let's talk column reductions.
I see two uses cases for them:
- a user wants the exact value of a scalar, right now:
df: DataFrame df.col('a').mean()
- a user just needs to use the scalar as part of another operation, so it can stay lazy if necessary:
df: DataFrame df.assign((df.col('a') - df.col('a').mean()).rename('a_centered'))
The Standard currently defines the return value of Column.mean
to be Scalar
. Implementations are supposed to figure out which of the two cases above the user wants.
I have two problems with this:
- I really don't like anything related to implicit materialisation (so long as we're defining a top-level Python API)
- we have an inconsistency with the DataFrame case:
DataFrame.mean
returns a 1-row DataFrameColumn.mean
returns a Scalar
Proposal
Column reductions return 1-row Columns (just like how DataFrame reductions return 1-row DataFrames).
Broadcasting rules: a binary operation between a n-row Column and a 1-row Column, the 1-row Column is broadcast to be of length-n. So column - column.mean()
is well-defined, and everything can stay lazy if necessary.
If someone really need the value of a reduction now, they can call .get_value(0)
. And behaviour of scalars may vary based on implementations, but I think that's fine.
At least, for the (much more common) case when reductions are used as part of other operations, the operations can stay completely within the DataFrame API now, the rules become predictable, and everything is well-defined