Skip to content

BUG: wrong df.groupby().groups when grouping with [Grouper(freq=), ...] #33132

Closed
@falcaopetri

Description

@falcaopetri

Code

import pandas as pd
from datetime import datetime
mi = pd.MultiIndex.from_product([pd.date_range(datetime.today(), periods=2),
                                    ["C", "D"]], names=["alpha", "beta"])
df = pd.DataFrame({"foo": [1, 2, 1, 2], "bar": [1, 2, 3, 4]}, index=mi)
result = df.groupby([pd.Grouper(level="alpha", freq='D'), "beta"])

print(len(result), result.ngroups)
# 2 4
print(result.groups)
# {(Timestamp('2020-04-05 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-06 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta'])}

Problem description

This issue is an extension of the bug reported in #26326. The PR #26374 resolved the bug for the case of when we have a nested BaseGrouper. Nonetheless, having a nested BinGrouper still results in wrong behavior, as can be checked by the above code.

Note that len(result) is based on len(result.groups), and that result.groups should return the following:

# {(Timestamp('2020-04-05 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-05 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-06 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']),
#  (Timestamp('2020-04-06 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta'])}
INSTALLED VERSIONS
------------------
commit           : 7673357191709036faad361cbb5f31a802703249
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.5.8-1-MANJARO
Version          : #1 SMP PREEMPT Thu Mar 5 20:29:51 UTC 2020
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+1027.g767335719.dirty
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.4.1
hypothesis       : 5.6.0
sphinx           : 2.4.4
blosc            : None
feather          : None
xlsxwriter       : 1.2.8
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : 0.3.3
gcsfs            : None
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.1
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pyxlsb           : None
s3fs             : 0.4.0
scipy            : 1.4.1
sqlalchemy       : 1.3.15
tables           : 3.6.1
tabulate         : 0.8.6
xarray           : 0.15.0
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions