Description
Is your feature request related to a problem?
When using a "fluent"/"method chaining" style of programming with pandas, a common hiccup is the lack of a method for selecting columns from a DataFrameGroupBy
object. For example, take this DataFrame
import pandas as pd
df = pd.DataFrame(dict(
x=[1, 2, 3, 4, 5, 6],
y=[2, 4, 6, 1, 2, 3],
a=["a", "b", "c", "a", "b", "c"],
b=["x", "x", "y", "z", "z", "z"],
))
To group by one column and then return the mean of another column within each group, one has to use multiple syntactical constructs:
(
df
.groupby("a")
["x"]
.mean()
)
There is also dot-attribute access columns:
(
df
.groupby("a")
.x
.mean()
)
but that is not fully general: it won't work if your column name collides with an existing method, and it won't work if the column is defined by a variable.
Another option is to select the column first then groupby using a series, not a name:
(
df["x"]
.groupby(df["a"])
.mean()
)
This is not so bad. Well, ideally each step in the pipeline would be on a new line, and it requires some duplicated typing. But the bigger issue is that it fails when you want to group by more than one column:
(
df["x"]
.groupby(df[["a", "b"]])
.mean()
)
Traceback
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-edc38a1f02b1> in <module>
1 (
----> 2 df["x"]
3 .groupby(df[["a", "b"]])
4 .mean()
5 )
~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/series.py in groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
1689 axis = self._get_axis_number(axis)
1690
-> 1691 return SeriesGroupBy(
1692 obj=self,
1693 keys=by,
~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/groupby.py in __init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
558 from pandas.core.groupby.grouper import get_grouper
559
--> 560 grouper, exclusions, obj = get_grouper(
561 obj,
562 keys,
~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
826 # allow us to passing the actual Grouping as the gpr
827 ping = (
--> 828 Grouping(
829 group_axis,
830 gpr,
~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/core/groupby/grouper.py in __init__(self, index, grouper, obj, name, level, sort, observed, in_axis, dropna)
541 if getattr(self.grouper, "ndim", 1) != 1:
542 t = self.name or str(type(self.grouper))
--> 543 raise ValueError(f"Grouper for '{t}' not 1-dimensional")
544 self.grouper = self.index.map(self.grouper)
545 if not (
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
(I actually feel like this really ought to work, independently of whether it's the best solution to this Feature Request, but let's leave it aside for now).
So in summary: There's no fully-general, non-awkward method for selecting a column from DataFrameGroupBy
. While certainly not fatal, it does make fluent pandas harder to read[1] and write[2]. And this is not an esoteric application: groupby/select/apply must be one of the most common pandas workflows.
[1] The great advantage of this method chaining approach for understanding code is that you can read off the sequence of verbs that comprise the operation. Introducing the [column]
syntax requires your brain to shift from comprehension mode to production mode — you need to generate a verb for what's happening — which is a costly operation that likely (on the margin) impairs comprehension.
[2] I really like fluent pandas but it currently can be a little bit annoying to fight with autoindent/autoformat tools when writing it. It likely would be easier to improve that tooling if one could assume each pipeline step begins with a dotted method access.
Describe the solution you'd like
The DataFrameGroupBy
object could have a column selection method.
Possible names:
DataFrameGroupBy.get
This exists in DataFrame
as a thin wrapper around self[arg]
. Possibly adds confusion with DataFrameGroupBy.get_group
DataFrameGroupBy.select
This is the method I have reached for more than once. I'm broadly aware of the history of NDFrame.select
; this name should now be available, though it could cause some confusion to reintroduce it with different semantics (but xref #26642).
It probably wouldn't make sense to add .select
only to DataFrameGroupBy
and not to DataFrame
/NDFrame
. So this would be more work (but also more benefit?)
In code, that might look like
(
df
.groupby("a")
.select("x")
.mean()
)
API breaking implications
No breakage, unless someone has very old code using DataFrame.select
that completely missed the original deprecation cycle and gets resurrected now.
Adding new methods (especially new ways of indexing pandas objects) definitely has an API complexity cost. But I would argue that, by converging on "one way to do it" in method chains, it could provide a a net reduction in complexity from the user perspective.
Additional context
This originally got some discussion on twitter, note especially @TomAugspurger's response.
Other relevant context would be the select verb in dplyr
.