Description
Hi,
using a solution for my problem posted on SO, I stumbled upon this bug. Thanks @jorisvandenbossche for looking into that matter and confirming this to be a bug.
np.random.seed(123)
df = pd.DataFrame({'temp_a': np.random.rand(50) * 50 + 20,
'temp_b': np.random.rand(50) * 30 + 40,
'power_deg': np.random.rand(50),
'eta': 1 - np.random.rand(50) / 5},
index=pd.date_range(start='20181201', freq='T', periods=50))
df_grpd = df.groupby(
[pd.cut(df.temp_a, np.arange(0, 100, 5)), # categorical for temp_a
pd.cut(df.temp_b, np.arange(0, 100, 5)), # categorical for temp_b
pd.cut(df.power_deg, np.arange(0, 1, 1 / 20)) # categorical for power_deg
])
df_grpd.nth(1).T
# Out:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85]
temp_a 36.147946
temp_b 57.832956
power_deg 0.983631
eta 0.974274
df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T
# Out:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85] (0.8, 0.85]
temp_a 69.038210 46.591379
temp_b 52.510666 57.173709
power_deg 0.846506 0.988345
eta 0.867026 0.885187
Problem description
When using groupby
with df.cut
it seems that taking the n-th row of a group with df_grpd.nth(n)
with n > 1
results in values which lie out of the group boundaries, as shown with df_grpd.nth(1).T
in my code example. Furthermore sometimes there are multiple rows per group for n=0
, as shown with df_grpd.nth(0).loc[(pd.Interval(65, 70), pd.Interval(50, 55), pd.Interval(0.8, 0.85))].T
, also with values outside of the interval.
Expected Output
I guess the expected output is quite clear in this case...
For df_grpd.nth(1).T
:
temp_a (65, 70]
temp_b (50, 55]
power_deg (0.8, 0.85]
temp_a 66.147946 # some values within the interval...
temp_b 53.832956 # some values within the interval...
power_deg 0.833631 # some values within the interval...
eta 0.974274
Output of pd.show_versions()
pandas: 0.23.4
pytest: 4.0.1
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.11
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
Is there currently any workaround for this problem? Can I somehow access the group values in any other way?
Thanks in advance!