Skip to content

Inconsistencies in groupby aggregation with non-numeric types #13416

Closed
@pijucha

Description

@pijucha

xref #13992

Some of the issues and inconsistencies I noticed.
(Apologies for a bit lengthy input.)

Settings:
df = pd.DataFrame({"A": [2, 1, 2, 2],
                   "B": [3, 3, 4, 4],
                   "C1": pd.Categorical([5, 6, 6, 6], [5, 6, 7]),
                   "C2": pd.Categorical(list("aaab"), list("abc")),
                   "D": pd.date_range('2011-11-11', periods=4),
                   "E": [10, 20, 30, 40]})
df
Out[7]: 
   A  B C1 C2          D   E
0  2  3  5  a 2011-11-11  10
1  1  3  6  a 2011-11-12  20
2  2  4  6  a 2011-11-13  30
3  2  4  6  b 2011-11-14  40

df.dtypes
Out[8]: 
A              int64
B              int64
C1          category
C2          category
D     datetime64[ns]
E              int64
dtype: object

df_ac = df[['A', 'C1', 'C2']]
df_ad = df[['A', 'D']]
df_abc = df[['A', 'B', 'C1', 'C2']]
df_abd = df[['A', 'B', 'D']]
df_abe = df[['A', 'B', 'E']]
df_acd = df[['A', 'C1', 'C2', 'D']]
df_abcd = df[['A', 'B', 'C1', 'C2', 'D']]

Usually, non-numeric types are skipped in aggregation functions:

df.groupby('A').mean()
Out[16]: 
          B          E
A                     
1  3.000000  20.000000
2  3.666667  26.666667

df.groupby('A').sum()
Out[17]: 
    B   E
A        
1   3  20
2  11  80

df.groupby(['A', 'B']).mean()
Out[18]: 
      E
A B    
1 3  20
2 3  10
  4  35

df.groupby(['A', 'B'], as_index=False).sum()
Out[19]: 
   A  B   E
0  1  3  20
1  2  3  10
2  2  4  70

Issues:

But if there are no numeric types, an output varies. (I use here subframes df_xxx of the original data frame.)

# .mean() always raises.
df_ac.groupby('A').mean()                               # (1)
pandas.core.base.DataError: No numeric types to aggregate

# .sum() adds categoricals
df_ac.groupby('A').sum()                                # (2)
Out[21]: 
   C1   C2
A         
1   6    a
2  17  aab

# and tries to do something with datetimes.
df_ad.groupby('A').sum()                                # (3)
Out[22]: 
           D
A           
1 2011-11-12
2        NaT

df_acd.groupby('A').sum()                               # (4)
Out[23]: 
    C1   C2                    D
A                               
1  6.0    a  2011-11-12 00:00:00
2  NaN  NaN                  NaN

# It's even worse for multiple groupers.
df_abcd.groupby(['A', 'B']).sum()                       # (5)
Out[24]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object

df_abcd.groupby(['A', 'B'], as_index=False).sum()       # (6)
Out[25]: 
0  A                       1
   B                       3
   C1                      6
   C2                      a
   D     2011-11-12 00:00:00
1  A                       2
   B                       3
   C1                      5
   C2                      a
   D     2011-11-11 00:00:00
2  A                       4
   B                       8
dtype: object

# Additionally, the index is not reset here and the grouper columns are transformed.
df_abc.groupby(['A', 'B'], as_index=False).sum()        # (7)
Out[26]: 
     A  B  C1  C2
A B              
1 3  1  3   6   a
2 3  2  3   5   a
  4  4  8  12  ab

# Sometimes an empty data frame is returned  (a reasonable output):
df_abd.groupby(['A', 'B']).sum()                        # (8)
Out[27]: 
Empty DataFrame
Columns: []
Index: []

# but not always:
df_abd.groupby(['A', 'B'], as_index=False).sum()        # (9)
Out[28]: 
0  A                      1
   B                      3
   D    2011-11-12 00:00:00
1  A                      2
   B                      3
   D    2011-11-11 00:00:00
2  A                      4
   B                      8
dtype: object

df_abcd.groupby(['A', 'B']).sum()                       # (10)
Out[29]: 
A  B    
1  3  C1                      6
      C2                      a
      D     2011-11-12 00:00:00
2  3  C1                      5
      C2                      a
      D     2011-11-11 00:00:00
dtype: object
Some other issues:

Multiple groupers with a categrical one (it's already addressed in #13204).

df.groupby(['A', 'C1'], as_index=False).sum()           # (11)
Out[36]: 
       A  C1   B   E
A C1                
1 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN
2 5  NaN NaN NaN NaN
  6  NaN NaN NaN NaN
  7  NaN NaN NaN NaN

apply

# mean() as expected:
df.groupby(['A', 'B']).apply(lambda x: np.mean(x))
Out[30]: 
       A    B     E
A B                
1 3  1.0  3.0  20.0
2 3  2.0  3.0  10.0
  4  2.0  4.0  35.0

# sum() not:
df.groupby(['A', 'B']).apply(lambda x: np.sum(x))       # (12)
Out[31]: 
A  B    
1  3  A                       1
      B                       3
      C1                      6
      C2                      a
      D     2011-11-12 00:00:00
      E                      20
2  3  A                       2
      B                       3
      C1                      5
      C2                      a
      D     2011-11-11 00:00:00
      E                      10
   4  A                       4
      B                       8
      E                      70
dtype: object

transform

# it's ok with numeric types only
df_abe.groupby(['A', 'B']).transform('sum')
Out[32]: 
    E
0  10
1  20
2  70
3  70

# Doesn't transform anything with mixed types:
df.groupby(['A', 'B']).transform('sum')                 # (13)
Out[33]: 
  C1 C2          D   E
0  5  a 2011-11-11  10
1  6  a 2011-11-12  20
2  6  a 2011-11-13  30
3  6  b 2011-11-14  40

# but someimtes transforms categoricals
df_abc.groupby(['A', 'B']).transform('sum')             # (14)
Out[34]: 
   C1  C2
0   5   a
1   6   a
2  12  ab
3  12  ab

# and sometimes not:
df_abc.groupby(['A', 'B'], as_index=False).transform('sum')     # (15)
Out[35]: 
  C1 C2
0  5  a
1  6  a
2  6  a
3  6  b

What should be the expected output?

Some ideas for aggregation (with sum()) when there's no numeric types (1)-(10):

  1. always raise (as .mean() does)
  2. return an empty DataFrame as in (8)
    (but (a) should .mean() do the same? (b) should groupers be returned when as_index=False?)
  3. ???

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
machine: x86_64

pandas: 0.18.1
Cython: 0.24
numpy: 1.11.0
IPython: 4.0.1

Exactly the same output with:

pandas: 0.18.1+119.gd405bf2
Just a thought about a possible approach:

A call to _python_apply_general inside _python_agg_general
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L795
seems to trigger some of the issues. (It doesn't exclude grouper columns when as_index=False, treats categoricals as their underlying types, and possibly just replicates some actions from earlier part of _python_agg_general.)

The following change affects (and possibly solves, at least partially)
issues (5), (6), (7), (9), (10), (14):

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index bea62e9..0d401e2 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -791,10 +791,7 @@ class _GroupBy(PandasObject, SelectionMixin):
             except TypeError:
                 continue

-        if len(output) == 0:
-            return self._python_apply_general(f)
-
-        if self.grouper._filter_empty_groups:
+        if len(output) > 0 and self.grouper._filter_empty_groups:

             mask = counts.ravel() > 0
             for name, result in compat.iteritems(output):

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions