Skip to content

BUG: cummin, max, sum not valid when used with agg + NaN not skipped. #34047

Open
@yohplala

Description

@yohplala
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
from random import seed, randint

# Data
kp = pd.period_range(start='2020-01-01 00:00', end='2020-01-01 00:25', freq='5T')
sp = pd.period_range(start='2020-01-01 00:00', end='2020-01-01 00:25', freq='1h')
seed(1)
values = [randint(0,10) for p in kp]
dft = pd.DataFrame({'Values' : values}, index=kp)
dft.loc[kp[-2]] = np.nan

# Trouble 1: `cummin`, `cummax`, `cumsum` not available through `agg`?
resampler = dft.resample(sp.freqstr)
progress = resampler.agg('cummin')

Get (same with cummax and cumsum):
AttributeError: 'cummin' is not a valid function for 'PeriodIndexResampler' object.

# Trouble 2: `cummin`, `cummax`, `cumsum` appear to work when used in a dict,
# but not skipna parameter. Whatever `skipna` (`True` or `False`) result is the same.

resampler = dft.resample(sp.freqstr)
progress = resampler.agg({('Values','cummin')},skipna=False)

Output obtained

progress
Values
Values
2020-01-01 00:00 2.0
2020-01-01 00:05 2.0
2020-01-01 00:10 1.0
2020-01-01 00:15 1.0
2020-01-01 00:20 NaN
2020-01-01 00:25 1.0

Problem description

It appears cumsum, cummin, cumsum cannot be used directly with agg. The documentation appears to state differently:
"Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply."
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.resample.Resampler.aggregate.html

Expected Output

For trouble 1 :
progress (with cummin)
Values
2020-01-01 00:00 2.0
2020-01-01 00:05 2.0
2020-01-01 00:10 1.0
2020-01-01 00:15 1.0
2020-01-01 00:20 NaN
2020-01-01 00:25 1.0

For trouble 2:
progress (with cummin & skipna=False)
Values
2020-01-01 00:00 2.0
2020-01-01 00:05 2.0
2020-01-01 00:10 1.0
2020-01-01 00:15 1.0
2020-01-01 00:20 NaN
2020-01-01 00:25 NaN

As a side question, is it possible to have an additional parameter fill_value=0 to have then following output?

progress (with cummin & fill_value=0)
Values
2020-01-01 00:00 2.0
2020-01-01 00:05 2.0
2020-01-01 00:10 1.0
2020-01-01 00:15 1.0
2020-01-01 00:20 1.0
2020-01-01 00:25 1.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-51-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8

pandas : 1.0.3
numpy : 1.16.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.1.3.post20200330
Cython : None
pytest : None
hypothesis : None
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : None
matplotlib : 3.0.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions