Description
Not sure if this is a bug or expected behaviour, but it's something that catches me out constantly.
import pandas as pd
# Dummy data. Each row is a day and I'm recording where I worked,
# what percentage of my time was spent coding,
# and how many cups of coffee I drank
df = pd.DataFrame({
'place' : ['office','cafe','office','cafe','office','cafe','cafe','cafe','office','office'],
'percentage':[5, 15, 72,24, 34, 35, 37, 47, 78, 67],
'coffee': [1,2,0,0,1,2,0,0,1,1] })
# this includes zero counts
df.groupby(pd.cut(df['percentage'], range(0, 100, 10))).size()
# this doesn't
df.groupby(
[pd.cut(df['percentage'], range(0, 100, 10)),
'place']
).size()
Problem description
I want to look at the distribution of my percentage
columns, so I do groupby
so that I can use sensible bins. This is my first gropuby
in the code example. The output is as expected:
percentage
(0, 10] 1
(10, 20] 1
(20, 30] 1
(30, 40] 3
(40, 50] 1
(50, 60] 0
(60, 70] 1
(70, 80] 2
(80, 90] 0
dtype: int64
Including the bins for which there were zero observations (50-60% and 80-90%).
Now I want to also group by place. This is my second groupby
. Now my empty groups disappear:
percentage place
(0, 10] office 1
(10, 20] cafe 1
(20, 30] cafe 1
(30, 40] cafe 2
office 1
(40, 50] cafe 1
(60, 70] office 1
(70, 80] office 2
dtype: int64
When I unstack
this to make a summary table the empty percentage bins are missing:
place cafe office
percentage
(0, 10] NaN 1.0
(10, 20] 1.0 NaN
(20, 30] 1.0 NaN
(30, 40] 2.0 1.0
(40, 50] 1.0 NaN
(60, 70] NaN 1.0
(70, 80] NaN 2.0
This will cause a problem when I go to plot this data, as the spacing on my 'percentage' axis will not be consistent. If I'm just counting the size of the bins then I can get round the problem by reindexing with the original categories:
df.groupby(
[pd.cut(df['percentage'], range(0, 100, 10)),
'place']
).size().unstack().reindex(pd.cut(df['percentage'], range(0, 100, 10)).cat.categories).fillna(0).stack()
percentage place
(0, 10] cafe 0.0
office 1.0
(10, 20] cafe 1.0
office 0.0
(20, 30] cafe 1.0
office 0.0
(30, 40] cafe 2.0
office 1.0
(40, 50] cafe 1.0
office 0.0
(50, 60] cafe 0.0
office 0.0
(60, 70] cafe 0.0
office 1.0
(70, 80] cafe 0.0
office 2.0
(80, 90] cafe 0.0
office 0.0
dtype: float64
but this seems like a complicated solution for what should be a relatively common problem. And if I want to do some other type of aggregation, e.g. calculate the mean number of cups of coffee for each group, I can't figure it out:
df.groupby(
[pd.cut(df['percentage'], range(0, 100, 10)),
'place']
)['coffee'].mean()
percentage place
(0, 10] office 1.0
(10, 20] cafe 2.0
(20, 30] cafe 0.0
(30, 40] cafe 1.0
office 1.0
(40, 50] cafe 0.0
(60, 70] office 1.0
(70, 80] office 0.5
I have missing groups, but I can't fix it with stack/fillna/unstack
as I don't want to fill in a value - I want to leave it as missing data, but still have the group appear.
Reading through previous issues, it sounds like I am describing this:
but the thread says it's fixed, so I can't figure out what is different in my situation.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None