Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import numpy as np
import pandas
from pandas.testing import assert_frame_equal
data = {
"col1": [np.nan, 1, 2]
}
pd_df = pandas.DataFrame(data)
pd_rolled = pd_df.rolling(window=2, min_periods=None)
res1 = pd_rolled.sum()
pd_rolled.count()
res2 = pd_rolled.sum()
assert_frame_equal(res1, res2) # AssertionError
Output
AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
DataFrame.iloc[:, 0] (column name="col1") values are different (66.66667 %)
[index]: [0, 1, 2]
[left]: [nan, nan, 3.0]
[right]: [0.0, 1.0, 3.0]
Problem description
Two sequential calls of .sum
on the rolling object produces different results if we call .count
between them and min_periods=None
. The default behavior of Rolling.sum
if min_periods
is None is to consider min_periods
to be equal to the window size. Currently, Rolling.count
behaves differently, and considers min_periods
to be 0 if it is None. #36649 brought a warning that this behavior is deprecated and also refactored .count
implementation. Right after giving a warning, it modifies the original value of min_periods
of the rolling object, so the future calls of .sum
and other operations give incorrect results.
pandas/pandas/core/window/rolling.py
Lines 2138 to 2149 in 5e02681
Expected Output
Rolling.count
should not modify min_periods
attribute of the rolling object, or if it is, revert back the original value of min_periods
after performing count
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 9d598a5e1eee26df95b3910e3f2934890d062caa
python : 3.7.7.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-50-generic
Version : #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.1
numpy : 1.19.0
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200622
Cython : None
pytest : 6.0.2
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.7.4
fastparquet : None
gcsfs : None
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : 0.13.2
pyarrow : 1.0.1
pyxlsb : None
s3fs : 0.4.2
scipy : 1.5.1
sqlalchemy : 1.3.18
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : None
numba : None