Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
# [1] In:
import pandas as pd
assert pd.__version__ == "1.3.2"
df = pd.DataFrame(
{
"id": ["a", "a", "b", "b", "b"],
"timestamp": pd.date_range("2021-9-1", periods=5, freq="H"),
"y": range(5),
}
)
df
# [1] Out:
id timestamp y
0 a 2021-09-01 00:00:00 0
1 a 2021-09-01 01:00:00 1
2 b 2021-09-01 02:00:00 2
3 b 2021-09-01 03:00:00 3
4 b 2021-09-01 04:00:00 4
# [2] In:
grp = df.groupby("id").rolling("1H", on="timestamp")
grp.count()
# [2] Out:
# original index [0,1,2,3,4] is in output
timestamp y
id
a 0 2021-09-01 00:00:00 1.0
1 2021-09-01 01:00:00 1.0
b 2 2021-09-01 02:00:00 1.0
3 2021-09-01 03:00:00 1.0
4 2021-09-01 04:00:00 1.0
# [3] In:
grp["y"].count()
# [3] Out:
# original index [0,1,2,3,4] is missing in output as opposed to [2] Out
id timestamp
a 2021-09-01 00:00:00 1.0
2021-09-01 01:00:00 1.0
b 2021-09-01 02:00:00 1.0
2021-09-01 03:00:00 1.0
2021-09-01 04:00:00 1.0
Name: y, dtype: float64
# [4] In:
grp.count() # same as [2] In
# [4] Out:
# output is inconsistent with that from [2] Out
timestamp y
id timestamp
a 2021-09-01 00:00:00 NaT 1.0
2021-09-01 01:00:00 NaT 1.0
b 2021-09-01 02:00:00 NaT 1.0
2021-09-01 03:00:00 NaT 1.0
2021-09-01 04:00:00 NaT 1.0
Expected Output
[4] Out
should be identical to[2] Out
- I as a user would hope original index is, at least optionally if not by default, kept in output so I can efficiently join the rolling count results back to the original
df
.
Problem description
- Groupby rolling count gives inconsistent outputs (see
[2] Out
and[4] Out
) when running cell 1~4. Same problem exists whencount
is replaced byagg(len)
[3] Out
using pandas 1.3.2 ignored original index whereas[2] Out
(as well as[3] Out
using pandas 1.1.5 ) maintained original index. Original index (in this example,[0,1,2,3,4]
) should ideally be optionally kept in outputs otherwise there will be no way to traceback which rolling count corresponds to the original row (see[3] Out
). Joining two data frames on['id', 'timestamp']
afterwards is not the correct solution as it is easy to find non 1-1 match examples.
# [3] Out: pandas 1.3.2
# original index [0,1,2,3,4] is missing.
id timestamp
a 2021-09-01 00:00:00 1.0
2021-09-01 01:00:00 1.0
b 2021-09-01 02:00:00 1.0
2021-09-01 03:00:00 1.0
2021-09-01 04:00:00 1.0
Name: y, dtype: float64
# [3] Out: pandas 1.1.5
# original index [0,1,2,3,4] maintained as expected though the rolling count numbers are wrong (should be [1,1,1,1,1])
id
a 0 1.0
1 2.0
b 2 1.0
3 2.0
4 3.0
Output of pd.show_versions()
pandas : 1.3.2
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : 4.1.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 0.4.2
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None