Description
xref #13992
Some of the issues and inconsistencies I noticed.
(Apologies for a bit lengthy input.)
Settings:
df = pd.DataFrame({"A": [2, 1, 2, 2],
"B": [3, 3, 4, 4],
"C1": pd.Categorical([5, 6, 6, 6], [5, 6, 7]),
"C2": pd.Categorical(list("aaab"), list("abc")),
"D": pd.date_range('2011-11-11', periods=4),
"E": [10, 20, 30, 40]})
df
Out[7]:
A B C1 C2 D E
0 2 3 5 a 2011-11-11 10
1 1 3 6 a 2011-11-12 20
2 2 4 6 a 2011-11-13 30
3 2 4 6 b 2011-11-14 40
df.dtypes
Out[8]:
A int64
B int64
C1 category
C2 category
D datetime64[ns]
E int64
dtype: object
df_ac = df[['A', 'C1', 'C2']]
df_ad = df[['A', 'D']]
df_abc = df[['A', 'B', 'C1', 'C2']]
df_abd = df[['A', 'B', 'D']]
df_abe = df[['A', 'B', 'E']]
df_acd = df[['A', 'C1', 'C2', 'D']]
df_abcd = df[['A', 'B', 'C1', 'C2', 'D']]
Usually, non-numeric types are skipped in aggregation functions:
df.groupby('A').mean()
Out[16]:
B E
A
1 3.000000 20.000000
2 3.666667 26.666667
df.groupby('A').sum()
Out[17]:
B E
A
1 3 20
2 11 80
df.groupby(['A', 'B']).mean()
Out[18]:
E
A B
1 3 20
2 3 10
4 35
df.groupby(['A', 'B'], as_index=False).sum()
Out[19]:
A B E
0 1 3 20
1 2 3 10
2 2 4 70
Issues:
But if there are no numeric types, an output varies. (I use here subframes df_xxx
of the original data frame.)
# .mean() always raises.
df_ac.groupby('A').mean() # (1)
pandas.core.base.DataError: No numeric types to aggregate
# .sum() adds categoricals
df_ac.groupby('A').sum() # (2)
Out[21]:
C1 C2
A
1 6 a
2 17 aab
# and tries to do something with datetimes.
df_ad.groupby('A').sum() # (3)
Out[22]:
D
A
1 2011-11-12
2 NaT
df_acd.groupby('A').sum() # (4)
Out[23]:
C1 C2 D
A
1 6.0 a 2011-11-12 00:00:00
2 NaN NaN NaN
# It's even worse for multiple groupers.
df_abcd.groupby(['A', 'B']).sum() # (5)
Out[24]:
A B
1 3 C1 6
C2 a
D 2011-11-12 00:00:00
2 3 C1 5
C2 a
D 2011-11-11 00:00:00
dtype: object
df_abcd.groupby(['A', 'B'], as_index=False).sum() # (6)
Out[25]:
0 A 1
B 3
C1 6
C2 a
D 2011-11-12 00:00:00
1 A 2
B 3
C1 5
C2 a
D 2011-11-11 00:00:00
2 A 4
B 8
dtype: object
# Additionally, the index is not reset here and the grouper columns are transformed.
df_abc.groupby(['A', 'B'], as_index=False).sum() # (7)
Out[26]:
A B C1 C2
A B
1 3 1 3 6 a
2 3 2 3 5 a
4 4 8 12 ab
# Sometimes an empty data frame is returned (a reasonable output):
df_abd.groupby(['A', 'B']).sum() # (8)
Out[27]:
Empty DataFrame
Columns: []
Index: []
# but not always:
df_abd.groupby(['A', 'B'], as_index=False).sum() # (9)
Out[28]:
0 A 1
B 3
D 2011-11-12 00:00:00
1 A 2
B 3
D 2011-11-11 00:00:00
2 A 4
B 8
dtype: object
df_abcd.groupby(['A', 'B']).sum() # (10)
Out[29]:
A B
1 3 C1 6
C2 a
D 2011-11-12 00:00:00
2 3 C1 5
C2 a
D 2011-11-11 00:00:00
dtype: object
Some other issues:
Multiple groupers with a categrical one (it's already addressed in #13204).
df.groupby(['A', 'C1'], as_index=False).sum() # (11)
Out[36]:
A C1 B E
A C1
1 5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
2 5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
apply
# mean() as expected:
df.groupby(['A', 'B']).apply(lambda x: np.mean(x))
Out[30]:
A B E
A B
1 3 1.0 3.0 20.0
2 3 2.0 3.0 10.0
4 2.0 4.0 35.0
# sum() not:
df.groupby(['A', 'B']).apply(lambda x: np.sum(x)) # (12)
Out[31]:
A B
1 3 A 1
B 3
C1 6
C2 a
D 2011-11-12 00:00:00
E 20
2 3 A 2
B 3
C1 5
C2 a
D 2011-11-11 00:00:00
E 10
4 A 4
B 8
E 70
dtype: object
transform
# it's ok with numeric types only
df_abe.groupby(['A', 'B']).transform('sum')
Out[32]:
E
0 10
1 20
2 70
3 70
# Doesn't transform anything with mixed types:
df.groupby(['A', 'B']).transform('sum') # (13)
Out[33]:
C1 C2 D E
0 5 a 2011-11-11 10
1 6 a 2011-11-12 20
2 6 a 2011-11-13 30
3 6 b 2011-11-14 40
# but someimtes transforms categoricals
df_abc.groupby(['A', 'B']).transform('sum') # (14)
Out[34]:
C1 C2
0 5 a
1 6 a
2 12 ab
3 12 ab
# and sometimes not:
df_abc.groupby(['A', 'B'], as_index=False).transform('sum') # (15)
Out[35]:
C1 C2
0 5 a
1 6 a
2 6 a
3 6 b
What should be the expected output?
Some ideas for aggregation (with sum()
) when there's no numeric types (1)-(10):
- always raise (as
.mean()
does) - return an empty DataFrame as in (8)
(but (a) should.mean()
do the same? (b) should groupers be returned whenas_index=False
?) - ???
output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
machine: x86_64
pandas: 0.18.1
Cython: 0.24
numpy: 1.11.0
IPython: 4.0.1
Exactly the same output with:
pandas: 0.18.1+119.gd405bf2
Just a thought about a possible approach:
A call to _python_apply_general
inside _python_agg_general
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L795
seems to trigger some of the issues. (It doesn't exclude grouper columns when as_index=False
, treats categoricals as their underlying types, and possibly just replicates some actions from earlier part of _python_agg_general
.)
The following change affects (and possibly solves, at least partially)
issues (5), (6), (7), (9), (10), (14):
diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index bea62e9..0d401e2 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -791,10 +791,7 @@ class _GroupBy(PandasObject, SelectionMixin):
except TypeError:
continue
- if len(output) == 0:
- return self._python_apply_general(f)
-
- if self.grouper._filter_empty_groups:
+ if len(output) > 0 and self.grouper._filter_empty_groups:
mask = counts.ravel() > 0
for name, result in compat.iteritems(output):