Skip to content

BUG: interpolate with limit keyword partially fills gaps larger than limit #36352

Open
@rhkarls

Description

@rhkarls
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Interpolating "inside" gaps with limit gap size of 3

import pandas as pd
import numpy as np

ts_index = pd.date_range('2016-01-01','2016-01-2',freq='H')
limit_gap_size = 3

df = pd.DataFrame(index=ts_index, data={'raw':np.random.uniform(size=ts_index.size)})

df.iloc[0:3] = np.nan
df.iloc[5:7] = np.nan
df.iloc[10:16] = np.nan
df.iloc[17:20] = np.nan
df.iloc[23:25] = np.nan

df['filled'] = df.interpolate(limit=limit_gap_size,limit_area='inside')

# solution to specific case
# credit functions below: https://stackoverflow.com/a/54512613/752092
def bfill_nan(arr):
    """ Backward-fill NaNs """
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[0]), mask.shape[0]-1)
    idx = np.minimum.accumulate(idx[::-1], axis=0)[::-1]
    out = arr[idx]
    return out

def calc_mask(arr, maxgap):
    """ Mask NaN gaps longer than `maxgap` """
    isnan = np.isnan(arr)
    cumsum = np.cumsum(isnan).astype('float')
    diff = np.zeros_like(arr)
    diff[~isnan] = np.diff(cumsum[~isnan], prepend=0)
    diff[isnan] = np.nan
    diff = bfill_nan(diff)
    return (diff <= maxgap) | ~isnan # <= instead of < compared to SO answer


df['expected'] = df['raw'].interpolate(limit=3,limit_area='inside').where(calc_mask(df['raw'],limit_gap_size))

df

                          raw    filled  expected
2016-01-01 00:00:00       NaN       NaN       NaN
2016-01-01 01:00:00       NaN       NaN       NaN
2016-01-01 02:00:00       NaN       NaN       NaN
2016-01-01 03:00:00  0.781920  0.781920  0.781920
2016-01-01 04:00:00  0.732783  0.732783  0.732783
2016-01-01 05:00:00       NaN  0.545743  0.545743
2016-01-01 06:00:00       NaN  0.358704  0.358704
2016-01-01 07:00:00  0.171664  0.171664  0.171664
2016-01-01 08:00:00  0.689487  0.689487  0.689487
2016-01-01 09:00:00  0.131983  0.131983  0.131983
2016-01-01 10:00:00       NaN  0.140856       NaN
2016-01-01 11:00:00       NaN  0.149729       NaN
2016-01-01 12:00:00       NaN  0.158601       NaN
2016-01-01 13:00:00       NaN       NaN       NaN
2016-01-01 14:00:00       NaN       NaN       NaN
2016-01-01 15:00:00       NaN       NaN       NaN
2016-01-01 16:00:00  0.194093  0.194093  0.194093
2016-01-01 17:00:00       NaN  0.330719  0.330719
2016-01-01 18:00:00       NaN  0.467345  0.467345
2016-01-01 19:00:00       NaN  0.603971  0.603971
2016-01-01 20:00:00  0.740598  0.740598  0.740598
2016-01-01 21:00:00  0.223751  0.223751  0.223751
2016-01-01 22:00:00  0.383625  0.383625  0.383625
2016-01-01 23:00:00       NaN       NaN       NaN
2016-01-02 00:00:00       NaN       NaN       NaN

Problem description

When passing a limit this is expected to be respected and gaps larger than this should not be interpolated at all. Partially filling the beginning (or end, depending on limit_direction) of the gaps is not reasonable behavior

Expected Output

See output in example

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.2
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : None
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.19
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions