Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
df = pd.DataFrame(
{
"col1": [0, 1, 2, 3],
"col4": [17, 13, 16, 15],
"col5": [-4, -5, -6, -7],
}
)
by=["col4", "col5"]
apply_function = min
gb = df.groupby(by, as_index=True)
df1 = gb.apply(apply_function)
print(df1)
df2 = gb.min()
print(df2)
df3 = gb.apply(apply_function)
print(df3)
Problem description
[this should explain why the current behaviour is a problem and why the expected output is a better solution]
In the code above two calls to gb.apply(apply_function)
produce different output. The reason for this is that groupby.min
is called before 2nd apply
and makes its output different and incorrect.
Expected Output
Expected that both calls to gb.apply(apply_function)
produce the same output.
Output of pd.show_versions()
pandas : 1.0.4
numpy : 1.18.4
pytz : 2019.2
dateutil : 2.7.3
pip : 20.1.1
setuptools : 47.1.0
Cython : 0.29.17
pytest : 5.4.2
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : 0.13.2
pyarrow : 0.16.0
pytables : None
pytest : 5.4.2
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.46.0