Skip to content

Indexing broken inside groupby - apply #33058

Closed
@diegodlh

Description

@diegodlh

Code Sample, a copy-pastable example if possible

import pandas as pd
import pdb

df = pd.DataFrame(
	{
		'col1': ['A', 'A', 'A', 'B', 'B', 'B'],
		'col2': [1, 2, 3, 4, 5, 6],
	}
)

def fn(x):
	pdb.set_trace()
	x.col2[x.index[-1]] = 0
	return x.col2

result = df.groupby(['col1'], as_index=False).apply(fn)
print(result)

Problem description

The expected output is:

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    0

Instead, I get a Series one row longer than expected:

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    6
   5    0

The problem seems to come from processing the second group (col1 == 'B'), where indices do not match row numbers. If I stand at the breakpoint (pdb.set_trace()), I can run this with the following results:

-> x.col2[x.index[-1]] = 0
(Pdb) x.col2     
3    4
4    5
5    6
Name: col2, dtype: int64
(Pdb) x.col2[5]
*** KeyError: 5
(Pdb) x.col2[5] = 0
(Pdb) x.col2
3    4
4    5
5    6
5    0
Name: col2, dtype: int64
(Pdb) x.col2[5]
5    6
5    0
Name: col2, dtype: int64
(Pdb) x.col2[5] = 0
(Pdb) x.col2
3    4
4    5
5    0
5    0
Name: col2, dtype: int64

Expected output

0  0    1
   1    2
   2    0
1  3    4
   4    5
   5    0

This was working before. Unfortunately, I do not know what Pandas version it was.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.5.13-050513-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 40.6.2
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyNeeds TestsUnit test(s) needed to prevent regressionsRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions