Skip to content

groupby on 2 categorical columns, when one categorical is based on datetimes, incorrectly returns all NaN dataframe #21390

Closed
@rogeriomgatto

Description

@rogeriomgatto

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'label1': list('abcbabcba'),
    'label2': list('xyxyxyxyx'),
    'minute': list(pd.date_range('2018-06-01 00', freq='1T', periods=3)) * 3,
    'n1': np.arange(9, dtype='float'),
    'n2': np.arange(9, dtype='float') ** 2
})

# this is correct
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

# convert to categoricals
df['label1'] = df['label1'].astype('category')
df['label2'] = df['label2'].astype('category')
df['minute'] = df['minute'].astype('category')

# this is wrong, returns all NaNs
df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()

Problem description

When grouping by [str, datetime] columns, results are as expected:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                             n1    n2
label1 minute                        
a      2018-06-01 00:00:00  0.0   0.0
       2018-06-01 00:01:00  4.0  16.0
       2018-06-01 00:02:00  8.0  64.0
b      2018-06-01 00:00:00  3.0   9.0
       2018-06-01 00:01:00  4.0  25.0
       2018-06-01 00:02:00  5.0  25.0
c      2018-06-01 00:00:00  6.0  36.0
       2018-06-01 00:02:00  2.0   4.0

After converting label1, label2, and minute to categoricals, that same groupby returns all NaNs:

>>> df.groupby(['label1', 'minute'])[['n1', 'n2']].mean()
                            n1  n2
label1 minute                     
a      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
b      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN
c      2018-06-01 00:00:00 NaN NaN
       2018-06-01 00:01:00 NaN NaN
       2018-06-01 00:02:00 NaN NaN

I only got this bug when grouping on 2 categoricals with one of them being datetime based (order is irrelevant). Grouping by ['label1', 'label2'] and 'minute' by itself works as expected.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-22-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeGroupbyRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions