Skip to content

Rolling mean on a time-based period with one data point does not return the exact mean #32761

Closed
@bcbrock

Description

@bcbrock

First, thank you all for creating and maintaining this package, and for the obvious care that has gone into the numerical analysis of the rolling methods.

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
print('Pandas Version', pd.__version__)
pd.options.display.float_format = '{:,.15f}'.format
np.random.seed(42)

index = pd.date_range(start='2020-01-01', end= '2020-01-02', freq='1s').append(pd.DatetimeIndex(['2020-01-03']))
data = np.random.rand(len(index))

df = pd.DataFrame({'data':data}, index=index)
df['mean'] = df.rolling('60s').mean()
df

Result

>>> print('Pandas Version', pd.__version__)
Pandas Version 1.0.2
...
>>> df
                                 data              mean
2020-01-01 00:00:00 0.374540118847362 0.374540118847362
2020-01-01 00:00:01 0.950714306409916 0.662627212628639
2020-01-01 00:00:02 0.731993941811405 0.685749455689561
2020-01-01 00:00:03 0.598658484197037 0.663976712816430
2020-01-01 00:00:04 0.156018640442437 0.562385098341631
...                               ...               ...
2020-01-01 23:59:57 0.888246435024149 0.461337636515083
2020-01-01 23:59:58 0.825973496654809 0.468473569401559
2020-01-01 23:59:59 0.487629482790733 0.473417509587717
2020-01-02 00:00:00 0.705341871943242 0.480022530163534
2020-01-03 00:00:00 0.105076984427931 0.105076984428225

[86402 rows x 2 columns]

Problem description

In the example above I show a rolling mean over a 1-minute window, but the final point jumps ahead by one day. Since this is the only point in the final window, I expect the mean here to be exactly the single value in the window. You can see however that there is some error in the low-order 4 dights displayed.

I believe this is due to an omission in the function roll_mean_variable which I am reading in the file pandas/_libs/window/aggregations.pyx. I believe that the function should zero the running sum whenever the number of observations goes to 0 after a "remove" operation. Otherwise, some noise from previous periods will be left in the running sum.

If I apply this patch to my local copy then I get the exact mean.

diff --git a/pandas/_libs/window/aggregations.pyx b/pandas/_libs/window/aggregations.pyx
index 495b436..76ba1dd 100644
--- a/pandas/_libs/window/aggregations.pyx
+++ b/pandas/_libs/window/aggregations.pyx
@@ -344,20 +344,24 @@ def roll_mean_variable(ndarray[float64_t] values, ndarray[int64_t] start,
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             else:
 
                 # calculate deletes
                 for j in range(start[i - 1], s):
                     val = values[j]
                     remove_mean(val, &nobs, &sum_x, &neg_ct)
 
+                # Reset sum if NOBS goes to 0
+                if nobs == 0:
+                    sum_x = 0
+
                 # calculate adds
                 for j in range(end[i - 1], e):
                     val = values[j]
                     add_mean(val, &nobs, &sum_x, &neg_ct)
 
             output[i] = calc_mean(minp, nobs, neg_ct, sum_x)
 
             if not is_monotonic_bounds:
                 for j in range(s, e):
                     val = values[j]

I understand that there is a philosophical question: Given that we know that most rolling means are likely not bit-accurate anyway, is it worth doing a check for this special case? One argument would be that for time series like mine with gaps here and there, the gaps would serve to clear the accumulated errors every time a gap occurred, and in general the accuracy of the procedure would improve.

I notice that the rolling variance calculations do reset the accumulators to 0 when they completely depopulate a window. I am curious though, is there a numerical-analysis reason for doing the adds before the removes there? This order nullifies the resetting-effect of clearing the accumulators I believe, because my windows will never go to empty with this ordering. This is the code I'm referring to:

Thank you,

Bishop Brock

Expected Output

I expect the rolling mean of any window specified with a time period with a single data point to be equal to that data point.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.2+0.g7485dbe6f.dirty
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.2.1
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.45.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Windowrolling, ewma, expanding

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions