Skip to content

ENH: Method for selecting columns from DataFrameGroupBy (and maybe DataFrame) #40322

Open
@mwaskom

Description

@mwaskom

Is your feature request related to a problem?

When using a "fluent"/"method chaining" style of programming with pandas, a common hiccup is the lack of a method for selecting columns from a DataFrameGroupBy object. For example, take this DataFrame

import pandas as pd
df = pd.DataFrame(dict(
    x=[1, 2, 3, 4, 5, 6],
    y=[2, 4, 6, 1, 2, 3],
    a=["a", "b", "c", "a", "b", "c"],
    b=["x", "x", "y", "z", "z", "z"],
))

To group by one column and then return the mean of another column within each group, one has to use multiple syntactical constructs:

(
    df
    .groupby("a")
    ["x"]
    .mean()
)

There is also dot-attribute access columns:

(
    df
    .groupby("a")
    .x
    .mean()
)

but that is not fully general: it won't work if your column name collides with an existing method, and it won't work if the column is defined by a variable.

Another option is to select the column first then groupby using a series, not a name:

(
    df["x"]
    .groupby(df["a"])
    .mean()
)

This is not so bad. Well, ideally each step in the pipeline would be on a new line, and it requires some duplicated typing. But the bigger issue is that it fails when you want to group by more than one column:

(
    df["x"]
    .groupby(df[["a", "b"]])
    .mean()
)
Traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-edc38a1f02b1> in <module>
      1 (
----> 2     df["x"]
      3     .groupby(df[["a", "b"]])
      4     .mean()
      5 )

~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/series.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   1689         axis = self._get_axis_number(axis)
   1690 
-> 1691         return SeriesGroupBy(
   1692             obj=self,
   1693             keys=by,

~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    558             from pandas.core.groupby.grouper import get_grouper
    559 
--> 560             grouper, exclusions, obj = get_grouper(
    561                 obj,
    562                 keys,

~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    826         # allow us to passing the actual Grouping as the gpr
    827         ping = (
--> 828             Grouping(
    829                 group_axis,
    830                 gpr,

~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis, dropna)
    541                 if getattr(self.grouper, "ndim", 1) != 1:
    542                     t = self.name or str(type(self.grouper))
--> 543                     raise ValueError(f"Grouper for '{t}' not 1-dimensional")
    544                 self.grouper = self.index.map(self.grouper)
    545                 if not (

ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional

(I actually feel like this really ought to work, independently of whether it's the best solution to this Feature Request, but let's leave it aside for now).

So in summary: There's no fully-general, non-awkward method for selecting a column from DataFrameGroupBy. While certainly not fatal, it does make fluent pandas harder to read[1] and write[2]. And this is not an esoteric application: groupby/select/apply must be one of the most common pandas workflows.

[1] The great advantage of this method chaining approach for understanding code is that you can read off the sequence of verbs that comprise the operation. Introducing the [column] syntax requires your brain to shift from comprehension mode to production mode — you need to generate a verb for what's happening — which is a costly operation that likely (on the margin) impairs comprehension.

[2] I really like fluent pandas but it currently can be a little bit annoying to fight with autoindent/autoformat tools when writing it. It likely would be easier to improve that tooling if one could assume each pipeline step begins with a dotted method access.

Describe the solution you'd like

The DataFrameGroupBy object could have a column selection method.

Possible names:

DataFrameGroupBy.get

This exists in DataFrame as a thin wrapper around self[arg]. Possibly adds confusion with DataFrameGroupBy.get_group

DataFrameGroupBy.select

This is the method I have reached for more than once. I'm broadly aware of the history of NDFrame.select; this name should now be available, though it could cause some confusion to reintroduce it with different semantics (but xref #26642).

It probably wouldn't make sense to add .select only to DataFrameGroupBy and not to DataFrame/NDFrame. So this would be more work (but also more benefit?)

In code, that might look like

(
    df
    .groupby("a")
    .select("x")
    .mean()
)

API breaking implications

No breakage, unless someone has very old code using DataFrame.select that completely missed the original deprecation cycle and gets resurrected now.

Adding new methods (especially new ways of indexing pandas objects) definitely has an API complexity cost. But I would argue that, by converging on "one way to do it" in method chains, it could provide a a net reduction in complexity from the user perspective.

Additional context

This originally got some discussion on twitter, note especially @TomAugspurger's response.

Other relevant context would be the select verb in dplyr.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions