Skip to content

BUG: Groupby aggregate functions (e.g. min/max) fail to preserve categorical dtype on single categorical column #37275

Closed
@btw08

Description

@btw08
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Set-up Code

import pandas as pd

dtype = pd.CategoricalDtype(categories=['small', 'big'], ordered=True)
df = pd.DataFrame([
    [1, 'small'],
    [1, 'big'],
    [2, 'small']
], columns=['grp', 'description']).astype({'description': dtype})


biggest = df.groupby('grp')['description'].max()

Outputs

Here, I'm expecting a categorical dtype - the same as defined in my first line. Instead, I get an object dtype.

>>> biggest.dtype
dtype('O')

... and that causes problems down the line, because now, the overall max of "biggest" turns out to be "small".

>>> biggest.max()
'small'

Problem description

After performing a groupby (using a column or columns of any type as the "by"), I'd expect taking the min or max (or first or last, and probably other aggregations that make sense) of a column of categorical dtype to return a series of that same dtype. Instead, the dtype appears to be converted to object during the aggregation.

For what it's worth, the expected behavior seems to happen if I do this as a frame operation as opposed to a series operation. That is, if I replace this line from above

biggest = df.groupby('grp')['description'].max()

with

df_biggest = df.groupby('grp')[['description']].max()

the dtype does seem to be preserved.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : db08276
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.1.3
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0.post20201006
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions