Skip to content

Column reductions should return 1-row Column #297

Closed
@MarcoGorelli

Description

@MarcoGorelli

Let's talk column reductions.

I see two uses cases for them:

  • a user wants the exact value of a scalar, right now:
    df: DataFrame
    df.col('a').mean()
  • a user just needs to use the scalar as part of another operation, so it can stay lazy if necessary:
    df: DataFrame
    df.assign((df.col('a') - df.col('a').mean()).rename('a_centered'))

The Standard currently defines the return value of Column.mean to be Scalar. Implementations are supposed to figure out which of the two cases above the user wants.

I have two problems with this:

  • I really don't like anything related to implicit materialisation (so long as we're defining a top-level Python API)
  • we have an inconsistency with the DataFrame case:
    • DataFrame.mean returns a 1-row DataFrame
    • Column.mean returns a Scalar

Proposal

Column reductions return 1-row Columns (just like how DataFrame reductions return 1-row DataFrames).

Broadcasting rules: a binary operation between a n-row Column and a 1-row Column, the 1-row Column is broadcast to be of length-n. So column - column.mean() is well-defined, and everything can stay lazy if necessary.

If someone really need the value of a reduction now, they can call .get_value(0). And behaviour of scalars may vary based on implementations, but I think that's fine.

At least, for the (much more common) case when reductions are used as part of other operations, the operations can stay completely within the DataFrame API now, the rules become predictable, and everything is well-defined

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions