Description
First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the rolling
methods.
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
print('Pandas Version', pd.__version__)
pd.options.display.float_format = '{:,.15f}'.format
np.random.seed(42)
index = pd.date_range(start='2020-01-01', end= '2020-01-02', freq='1s').append(pd.DatetimeIndex(['2020-01-03']))
data = np.random.rand(len(index))
df = pd.DataFrame({'data':data}, index=index)
df['mean'] = df.rolling('60s').mean()
df
Result
>>> print('Pandas Version', pd.__version__)
Pandas Version 1.0.2
...
>>> df
data mean
2020-01-01 00:00:00 0.374540118847362 0.374540118847362
2020-01-01 00:00:01 0.950714306409916 0.662627212628639
2020-01-01 00:00:02 0.731993941811405 0.685749455689561
2020-01-01 00:00:03 0.598658484197037 0.663976712816430
2020-01-01 00:00:04 0.156018640442437 0.562385098341631
... ... ...
2020-01-01 23:59:57 0.888246435024149 0.461337636515083
2020-01-01 23:59:58 0.825973496654809 0.468473569401559
2020-01-01 23:59:59 0.487629482790733 0.473417509587717
2020-01-02 00:00:00 0.705341871943242 0.480022530163534
2020-01-03 00:00:00 0.105076984427931 0.105076984428225
[86402 rows x 2 columns]
Problem description
In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.
I believe this is due to an omission in the function roll_mean_variable
which I am reading in the file pandas/_libs/window/aggregations.pyx
. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.
If I apply this patch to my local copy then I get the exact mean.
diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index 495b436..76ba1dd 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -344,20 +344,24 @@ def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,
val = values[j]
add_mean(val, &nobs, &sum_x, &neg_ct)
else:
# calculate deletes
for j in range(start[i - 1], s):
val = values[j]
remove_mean(val, &nobs, &sum_x, &neg_ct)
+ # Reset sum if NOBS goes to 0
+ if nobs == 0:
+ sum_x = 0
+
# calculate adds
for j in range(end[i - 1], e):
val = values[j]
add_mean(val, &nobs, &sum_x, &neg_ct)
output[i] = calc_mean(minp, nobs, neg_ct, sum_x)
if not is_monotonic_bounds:
for j in range(s, e):
val = values[j]
I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.
I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:
pandas/pandas/_libs/window/aggregations.pyx
Line 520 in 5da500a
Thank you,
Bishop Brock
Expected Output
I expect the rolling mean of any window specified with a time period with a single data point to be equal to that data point.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.0.2+0.g7485dbe6f.dirty
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1