Skip to content

Inconsistant behaviour of empty groups when grouping with one vs. many  #23865

Closed
@mojones

Description

@mojones

Not sure if this is a bug or expected behaviour, but it's something that catches me out constantly.

import pandas as pd
# Dummy data. Each row is a day and I'm recording where I worked, 
# what percentage of my time was spent coding, 
# and how many cups of coffee I drank
df = pd.DataFrame({
    'place' : ['office','cafe','office','cafe','office','cafe','cafe','cafe','office','office'],
    'percentage':[5, 15, 72,24, 34, 35, 37, 47, 78, 67],
    'coffee': [1,2,0,0,1,2,0,0,1,1] })

# this includes zero counts
df.groupby(pd.cut(df['percentage'], range(0, 100, 10))).size()

# this doesn't 
df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     ).size()

Problem description

I want to look at the distribution of my percentage columns, so I do groupby so that I can use sensible bins. This is my first gropuby in the code example. The output is as expected:

percentage
(0, 10]     1
(10, 20]    1
(20, 30]    1
(30, 40]    3
(40, 50]    1
(50, 60]    0
(60, 70]    1
(70, 80]    2
(80, 90]    0
dtype: int64

Including the bins for which there were zero observations (50-60% and 80-90%).

Now I want to also group by place. This is my second groupby. Now my empty groups disappear:

percentage  place 
(0, 10]     office    1
(10, 20]    cafe      1
(20, 30]    cafe      1
(30, 40]    cafe      2
                 office    1
(40, 50]    cafe      1
(60, 70]    office    1
(70, 80]    office    2
dtype: int64

When I unstack this to make a summary table the empty percentage bins are missing:

place       cafe  office
percentage              
(0, 10]      NaN     1.0
(10, 20]     1.0     NaN
(20, 30]     1.0     NaN
(30, 40]     2.0     1.0
(40, 50]     1.0     NaN
(60, 70]     NaN     1.0
(70, 80]     NaN     2.0

This will cause a problem when I go to plot this data, as the spacing on my 'percentage' axis will not be consistent. If I'm just counting the size of the bins then I can get round the problem by reindexing with the original categories:

df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     ).size().unstack().reindex(pd.cut(df['percentage'], range(0, 100, 10)).cat.categories).fillna(0).stack()

percentage  place 
(0, 10]     cafe      0.0
            office    1.0
(10, 20]    cafe      1.0
            office    0.0
(20, 30]    cafe      1.0
            office    0.0
(30, 40]    cafe      2.0
            office    1.0
(40, 50]    cafe      1.0
            office    0.0
(50, 60]    cafe      0.0
            office    0.0
(60, 70]    cafe      0.0
            office    1.0
(70, 80]    cafe      0.0
            office    2.0
(80, 90]    cafe      0.0
            office    0.0
dtype: float64

but this seems like a complicated solution for what should be a relatively common problem. And if I want to do some other type of aggregation, e.g. calculate the mean number of cups of coffee for each group, I can't figure it out:

df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     )['coffee'].mean()

percentage  place 
(0, 10]     office    1.0
(10, 20]    cafe      2.0
(20, 30]    cafe      0.0
(30, 40]    cafe      1.0
            office    1.0
(40, 50]    cafe      0.0
(60, 70]    office    1.0
(70, 80]    office    0.5

I have missing groups, but I can't fix it with stack/fillna/unstack as I don't want to fill in a value - I want to leave it as missing data, but still have the group appear.

Reading through previous issues, it sounds like I am describing this:

#8138

but the thread says it's fixed, so I can't figure out what is different in my situation.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions