Skip to content

pd.core.groupby.groupby.DataFrameGroupBy.nth yielding false values with Interval MultiIndex Interval #24205

Closed
@JoElfner

Description

@JoElfner

Hi,

using a solution for my problem posted on SO, I stumbled upon this bug. Thanks @jorisvandenbossche for looking into that matter and confirming this to be a bug.

np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(50) * 50 + 20,
                   'temp_b': np.random.rand(50) * 30 + 40,
                   'power_deg': np.random.rand(50),
                   'eta': 1 - np.random.rand(50) / 5},
                  index=pd.date_range(start='20181201', freq='T', periods=50))

df_grpd = df.groupby(
    [pd.cut(df.temp_a, np.arange(0, 100, 5)),  # categorical for temp_a
     pd.cut(df.temp_b, np.arange(0, 100, 5)),   # categorical for temp_b
     pd.cut(df.power_deg, np.arange(0, 1, 1 / 20))  # categorical for power_deg
    ])

df_grpd.nth(1).T
# Out:
temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      36.147946
temp_b      57.832956
power_deg    0.983631
eta          0.974274

df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T

# Out:
temp_a       (65, 70]            
temp_b       (50, 55]            
power_deg (0.8, 0.85] (0.8, 0.85]
temp_a      69.038210   46.591379
temp_b      52.510666   57.173709
power_deg    0.846506    0.988345
eta          0.867026    0.885187

Problem description

When using groupby with df.cut it seems that taking the n-th row of a group with df_grpd.nth(n) with n > 1 results in values which lie out of the group boundaries, as shown with df_grpd.nth(1).T in my code example. Furthermore sometimes there are multiple rows per group for n=0, as shown with df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T, also with values outside of the interval.

Expected Output

I guess the expected output is quite clear in this case...
For df_grpd.nth(1).T:

temp_a       (65, 70]
temp_b       (50, 55]
power_deg (0.8, 0.85]
temp_a      66.147946  # some values within the interval...
temp_b      53.832956  # some values within the interval...
power_deg    0.833631  # some values within the interval...
eta          0.974274

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Is there currently any workaround for this problem? Can I somehow access the group values in any other way?
Thanks in advance!

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions