Description
The docs for groupby say (http://pandas.pydata.org/pandas-docs/stable/groupby.html):
Note:
Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
From the section, it's implied that this is talking about builtins and the aggregate
functionality, but I very often find myself operating with complicated functions on the groups themselves, so apply
is my bread and butter (and this is part of a larger issue that groupby.apply
has some inconsistent behavior).
N = 10
df = pd.DataFrame(index=range(N), columns=['id', 'x', 'y', 'z'])
df.loc[:, ['x', 'y', 'z']] = np.arange(N*3).reshape(N, 3)
df.id = np.random.randint(0, int(N/3), (N,)) + 10
df
# id x y z
# 0 12 0 1 2
# 1 12 3 4 5
# 2 11 6 7 8
# 3 10 9 10 11
# 4 12 12 13 14
# 5 12 15 16 17
# 6 12 18 19 20
# 7 11 21 22 23
# 8 10 24 25 26
# 9 10 27 28 29
For something like sum
, the groupby-column gets dropped, as described:
df.groupby('id').sum()
# x y z
# id
# 10 60 63 66
# 11 27 29 31
# 12 48 53 58
But for using the same function in apply
, the result is different - mainly that the groupby column does not get removed (but also the dtype)
df.groupby('id', as_index=True).apply(lambda gr: gr.sum())
# id x y z
# id
# 10 30.0 60.0 63.0 66.0
# 11 22.0 27.0 29.0 31.0
# 12 60.0 48.0 53.0 58.0
Ideally, I'd like the make the behaviour of groupby.apply
more consistent in a number of cases, and this is one of them.