Skip to content

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

Closed
@batterseapower

Description

@batterseapower

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# This returns 0 on any Pandas version. That's as expected.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

# These return -1 on Pandas 1.4.4, and nan in Pandas 1.5.0 or 1.5.2
# From the docs I'm not sure what is meant to happen. Arguably returning -1 is the right behaviour (since in particular it avoids casting the group numbers to float), so this is a bug.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]

# This returns -1 on Pandas 1.4.4, 0 on Pandas 1.5.0 and nan on 1.5.2
# I'm pretty sure that the correct answer is 0 (for consistency with the float case above)
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]

Issue Description

Probably related to the recent categorical/groupby changes e.g. #48702. I'm raising this as a separate issue because I don't see any ticket specifically discussing the impact on ngroup.

Expected Behavior

with dropna=False groups containing missing values should get their own ngroup number. The ngroup should never be a float i.e. the group number may contain -1 if dropna=True and any group contains a missing value.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 8dab54d python : 3.10.8.final.0 python-bits : 64 OS : Linux OS-release : 4.18.0-348.20.1.el8_5.x86_64 Version : #1 SMP Thu Mar 10 20:59:28 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

/home/mboling/opt/conda/envs/pandas1.5.2/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions