Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
# Interpolating "inside" gaps with limit gap size of 3
import pandas as pd
import numpy as np
ts_index = pd.date_range('2016-01-01','2016-01-2',freq='H')
limit_gap_size = 3
df = pd.DataFrame(index=ts_index, data={'raw':np.random.uniform(size=ts_index.size)})
df.iloc[0:3] = np.nan
df.iloc[5:7] = np.nan
df.iloc[10:16] = np.nan
df.iloc[17:20] = np.nan
df.iloc[23:25] = np.nan
df['filled'] = df.interpolate(limit=limit_gap_size,limit_area='inside')
# solution to specific case
# credit functions below: https://stackoverflow.com/a/54512613/752092
def bfill_nan(arr):
""" Backward-fill NaNs """
mask = np.isnan(arr)
idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
out = arr[idx]
return out
def calc_mask(arr, maxgap):
""" Mask NaN gaps longer than `maxgap` """
isnan = np.isnan(arr)
cumsum = np.cumsum(isnan).astype('float')
diff = np.zeros_like(arr)
diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
diff[isnan] = np.nan
diff = bfill_nan(diff)
return (diff <= maxgap) | ~isnan # <= instead of < compared to SO answer
df['expected'] = df['raw'].interpolate(limit=3,limit_area='inside').where(calc_mask(df['raw'],limit_gap_size))
df
raw filled expected
2016-01-01 00:00:00 NaN NaN NaN
2016-01-01 01:00:00 NaN NaN NaN
2016-01-01 02:00:00 NaN NaN NaN
2016-01-01 03:00:00 0.781920 0.781920 0.781920
2016-01-01 04:00:00 0.732783 0.732783 0.732783
2016-01-01 05:00:00 NaN 0.545743 0.545743
2016-01-01 06:00:00 NaN 0.358704 0.358704
2016-01-01 07:00:00 0.171664 0.171664 0.171664
2016-01-01 08:00:00 0.689487 0.689487 0.689487
2016-01-01 09:00:00 0.131983 0.131983 0.131983
2016-01-01 10:00:00 NaN 0.140856 NaN
2016-01-01 11:00:00 NaN 0.149729 NaN
2016-01-01 12:00:00 NaN 0.158601 NaN
2016-01-01 13:00:00 NaN NaN NaN
2016-01-01 14:00:00 NaN NaN NaN
2016-01-01 15:00:00 NaN NaN NaN
2016-01-01 16:00:00 0.194093 0.194093 0.194093
2016-01-01 17:00:00 NaN 0.330719 0.330719
2016-01-01 18:00:00 NaN 0.467345 0.467345
2016-01-01 19:00:00 NaN 0.603971 0.603971
2016-01-01 20:00:00 0.740598 0.740598 0.740598
2016-01-01 21:00:00 0.223751 0.223751 0.223751
2016-01-01 22:00:00 0.383625 0.383625 0.383625
2016-01-01 23:00:00 NaN NaN NaN
2016-01-02 00:00:00 NaN NaN NaN
Problem description
When passing a limit this is expected to be respected and gaps larger than this should not be interpolated at all. Partially filling the beginning (or end, depending on limit_direction) of the gaps is not reasonable behavior
Expected Output
See output in example
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 2a7d332
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None
pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None