Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# This returns 0 on any Pandas version. That's as expected.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]
# These return -1 on Pandas 1.4.4, and nan in Pandas 1.5.0 or 1.5.2
# From the docs I'm not sure what is meant to happen. Arguably returning -1 is the right behaviour (since in particular it avoids casting the group numbers to float), so this is a bug.
pd.DataFrame.from_dict({'a': [np.nan], 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b']).ngroup().iloc[0]
# This returns -1 on Pandas 1.4.4, 0 on Pandas 1.5.0 and nan on 1.5.2
# I'm pretty sure that the correct answer is 0 (for consistency with the float case above)
pd.DataFrame.from_dict({'a': pd.Categorical([np.nan]), 'b': [1]}).groupby(['a', 'b'], dropna=False).ngroup().iloc[0]
Issue Description
Probably related to the recent categorical/groupby changes e.g. #48702. I'm raising this as a separate issue because I don't see any ticket specifically discussing the impact on ngroup
.
Expected Behavior
with dropna=False
groups containing missing values should get their own ngroup
number. The ngroup
should never be a float i.e. the group number may contain -1 if dropna=True and any group contains a missing value.
Installed Versions
pandas : 1.5.2
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
/home/mboling/opt/conda/envs/pandas1.5.2/lib/python3.10/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")