Skip to content

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605

Closed
@nmusolino

Description

@nmusolino

Steps to reproduce

In [1]: import pandas

In [2]: df = pandas.DataFrame({'type': pandas.Categorical(['AAA', 'AAA', 'B', 'C']),
   ...:                        'voltage': pandas.Series([1.5, 1.5, 1.5, 1.5]),
   ...:                        'treatment': pandas.Categorical(['T', 'C', 'T', 'C'])})

In [3]: df.groupby(['treatment', 'type']).count()
Out[3]:
                voltage
treatment type
C         AAA       1.0
          B         NaN
          C         1.0
T         AAA       1.0
          B         1.0
          C         NaN

In [4]: df.groupby(['treatment', 'type'])['voltage'].count()
Out[4]:
treatment  type
C          AAA     1
           C       1
T          AAA     1
           B       1
Name: voltage, dtype: int64

Problem description

When performing a groupby on categorical columns, categories with empty groups should be present in output. That is, the multi-index of the object returned by count() should contain the Cartesian product of all the labels of the first categorical column ("treatment" in the example above) and the second categorical column ("type") by which the grouping was performed.

The behavior in cell [3] above is correct. But in cell [4], after obtaining a pandas.core.groupby.SeriesGroupBy object, the series returned by the count() method does not have entries for all levels of the "type" categorical.

Expected Output

The output from cell [4] should be equivalent to this output, with length 6, and include values for the index values (C, B) and (T, C).

In [5]: df.groupby(['treatment', 'type']).count().squeeze()
Out[5]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Workaround

Perform column access after calling count():

In [7]: df.groupby(['treatment', 'type']).count()['voltage']
Out[7]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions