Skip to content

Groupby ignores unobserved combinations when passing more than one categorical column even if observed=True #27075

Closed
@harmbuisman

Description

@harmbuisman

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df_test = pd.DataFrame()
df_test['A'] = pd.Series(np.arange(0,2), dtype='category').cat.set_categories(list(range(0,3)))
df_test['B'] = pd.Series(np.arange(10,12), dtype='category').cat.set_categories(list(range(10,13)))

print("Test DF:")
print(df_test)

print("\nThe following are as expected, unobserved categories have size = 0:")
print(df_test.groupby('A').size())
print(df_test.groupby('B').size())

print("\nThe following does not consider categories, I would expect 9 result lines here:")
print(df_test.groupby(['A','B']).size())

print("\nExpected:")
print(pd.DataFrame({'A':list(range(0,3))*3, 'B':list(range(10,13))*3, '':[1]*2+[0]*7 }).set_index(['A','B']))

image

Problem description

groupby([cols]) gives back a result for all categories if only one column that is categorical is provided (e.g. ['A']), but it only shows the observed combinations if multiple categorical columns are provided ['A', 'B'], regardless of the setting of observed. I would expect that I would get a result for all combinations of the categorical columns.

Expected Output

A result for all combinations of the categorical categories of the groupby columns. For the example above:
A B
0 10 1
1 11 1
2 12 0
0 10 0
1 11 0
2 12 0
0 10 0
1 11 0
2 12 0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

Metadata

Metadata

Assignees

Labels

CategoricalCategorical Data TypeGroupbyNeeds TestsUnit test(s) needed to prevent regressions

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions