Skip to content

BUG: Groupby rolling count gives inconsistent outputs with missing index  #43355

Closed
@LiutongZhou

Description

@LiutongZhou
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# [1] In:
import pandas as pd

assert pd.__version__ == "1.3.2"

df = pd.DataFrame(
    {
        "id": ["a", "a", "b", "b", "b"],
        "timestamp": pd.date_range("2021-9-1", periods=5, freq="H"),
        "y": range(5),
    }
)
df
# [1] Out:
  id           timestamp  y
0  a 2021-09-01 00:00:00  0
1  a 2021-09-01 01:00:00  1
2  b 2021-09-01 02:00:00  2
3  b 2021-09-01 03:00:00  3
4  b 2021-09-01 04:00:00  4
# [2] In: 
grp = df.groupby("id").rolling("1H", on="timestamp")
grp.count()  
# [2] Out: 
# original index [0,1,2,3,4] is in output
               timestamp    y
id                           
a  0 2021-09-01 00:00:00  1.0
   1 2021-09-01 01:00:00  1.0
b  2 2021-09-01 02:00:00  1.0
   3 2021-09-01 03:00:00  1.0
   4 2021-09-01 04:00:00  1.0
# [3] In:
grp["y"].count()  
# [3] Out:
# original index [0,1,2,3,4] is missing in output as opposed to [2] Out
id  timestamp          
a   2021-09-01 00:00:00    1.0
    2021-09-01 01:00:00    1.0
b   2021-09-01 02:00:00    1.0
    2021-09-01 03:00:00    1.0
    2021-09-01 04:00:00    1.0
Name: y, dtype: float64
# [4] In:
grp.count()  # same as [2] In
# [4] Out:
# output is inconsistent with that from [2] Out
                       timestamp    y
id timestamp                         
a  2021-09-01 00:00:00       NaT  1.0
   2021-09-01 01:00:00       NaT  1.0
b  2021-09-01 02:00:00       NaT  1.0
   2021-09-01 03:00:00       NaT  1.0
   2021-09-01 04:00:00       NaT  1.0

Expected Output

  1. [4] Out should be identical to [2] Out
  2. I as a user would hope original index is, at least optionally if not by default, kept in output so I can efficiently join the rolling count results back to the original df.

Problem description

  1. Groupby rolling count gives inconsistent outputs (see [2] Out and [4] Out) when running cell 1~4. Same problem exists when count is replaced by agg(len)
  2. [3] Out using pandas 1.3.2 ignored original index whereas [2] Out (as well as [3] Out using pandas 1.1.5 ) maintained original index. Original index (in this example, [0,1,2,3,4]) should ideally be optionally kept in outputs otherwise there will be no way to traceback which rolling count corresponds to the original row (see [3] Out). Joining two data frames on ['id', 'timestamp'] afterwards is not the correct solution as it is easy to find non 1-1 match examples.
# [3] Out: pandas 1.3.2
# original index [0,1,2,3,4] is missing. 
id  timestamp          
a   2021-09-01 00:00:00    1.0
    2021-09-01 01:00:00    1.0
b   2021-09-01 02:00:00    1.0
    2021-09-01 03:00:00    1.0
    2021-09-01 04:00:00    1.0
Name: y, dtype: float64
# [3] Out: pandas 1.1.5
# original index [0,1,2,3,4] maintained as expected though the rolling count numbers are wrong (should be [1,1,1,1,1])
id   
a   0    1.0
    1    2.0
b   2    1.0
    3    2.0
    4    3.0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 5f648bf python : 3.8.10.final.0 python-bits : 64 OS : Linux OS-release : 4.14.225-121.362.amzn1.x86_64 Version : #1 SMP Tue Mar 23 00:29:14 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions