Skip to content

groupby aggregation on ordered Categorial with 'observed=True' breaks order #25871

Closed
kpflugshaupt/pandas
#1
@kpflugshaupt

Description

@kpflugshaupt

Code Sample:

import pandas as pd

# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
            dict(cat = pd.Series([3, 1, 2, 1, 3, 2], 
                                 dtype=pd.CategoricalDtype(
                                              categories=[1, 2, 3, 4], 
                                              ordered=True)
                                ), 
                 val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
            )
)

Including unobserved categories gives correct groups:

# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')
cat
1    1.0
2    2.0
3    3.0
4    0.0
Name: val, dtype: float64

Excluding unobserved categories changes the order, groups are wrong:

# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')
cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

Problem description:

The sample code shows that grouping with an ordered factor does not respect the factor's order when 'observed=True' is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical's order.
Thus, the result is wrong.

Related issues: #25167
There, the Categorical was unordered, and the sort=True parameter did not work as expected in combination with observed=True. In my case, sort makes no difference:

df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

both give the same, wrong result as shown above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.0
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: 0.11.3
IPython: 7.1.1
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.2.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions