Description
Code Sample, a copy-pastable example:
import pandas as pd
import numpy as np
pd.show_versions()
dates = ['2019-05-09', '2019-05-09', '2019-05-09']
date_series = pd.Series(dates)
# it appears to happen on object dtype aggregations
# where there is another object dtype in the frame with missing values
# hence column 'a' being populated with strings
# and missing and the dt.date accessor in 'c'
date_series_parsed = pd.to_datetime(date_series, format='%Y-%m-%d').dt.date
df = pd.DataFrame({
'a': [np.nan, '1', np.nan],
'b': [0, 1, 1],
'c': date_series_parsed,
})
# as expected the series values returned are each populated with non missing data
# I'm ignoring that this is returning a Series
# and the other method returns a DataFrame
# the logic is my real concern
print(df.groupby('b')['c'].min())
# output (as is and as expected):
# b
# 0 2019-05-09
# 1 2019-05-09
# Name: c, dtype: object
# looks like it's dependent on column 'a' being fully populated for some reason
# as the group df['b'] == 1 min value is not returned appropriately here
# but the group df['b'] == 0 min value does appear to be working
print(df.groupby('b', as_index=False)['c'].min())
# output (as is):
# b c
# 0 0 2019-05-09
# 1 1 NaN
#
# output (as expected):
# b c
# 0 0 2019-05-09
# 1 1 2019-05-09
Problem description
As there is no missing value in column 'c', I would expect that every subgroup of the series would have the minimum as being well defined and non-missing. This is consistent with the behavior shown when as_index in the groupby method is left as it's default parameter of False. However it can clearly be seen that the minimum of column 'c' depends on column 'a' in some capacity. It appears as though at some point in time the indexing is looking to column 'a' instead of column 'c' though I'm not sure why or when.
I have tested this both on the version of pandas available on PyPI (0.24.2 as of the time of writing) and through the pandas-dev master branch. The issue persists regardless of versions. Furthermore the issue has been experienced on a variety of machines.
Expected Output
I would expect the results to be consistent with the calculation performed when as_index is left as it's default value of False, as it seems that the minimum of the values which comprise the group where column b is equal to 1 is a well defined result (explicitly 2019-05-09).
Thank you for the consideration!
Output of pd.show_versions()
Note: I ran this on the master branch, and 0.24.2, the below is the output from the master branch. If this is insufficient please let me know and I will supplement with the PyPI version of 0.24.2 when able.
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.4-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: C
LOCALE: None.None
pandas: 0+unknown
pytest: None
pip: 18.1
setuptools: 40.8.0
Cython: 0.29.7
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.2
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None