Description
Code Sample:
import pandas as pd
# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
dict(cat = pd.Series([3, 1, 2, 1, 3, 2],
dtype=pd.CategoricalDtype(
categories=[1, 2, 3, 4],
ordered=True)
),
val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
)
)
Including unobserved categories gives correct groups:
# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')
cat
1 1.0
2 2.0
3 3.0
4 0.0
Name: val, dtype: float64
Excluding unobserved categories changes the order, groups are wrong:
# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')
cat
3 1.0
1 2.0
2 3.0
Name: val, dtype: float64
Problem description:
The sample code shows that grouping with an ordered factor does not respect the factor's order when 'observed=True'
is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical's order.
Thus, the result is wrong.
Related issues: #25167
There, the Categorical was unordered, and the sort=True
parameter did not work as expected in combination with observed=True
. In my case, sort
makes no difference:
df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')
both give the same, wrong result as shown above.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None