Skip to content

BUG: Rolling groupby should not maintain the by column in the resulting DataFrame #32332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a6ad6c5
make sure exclusions are applied before the groupby object reaches ro…
Feb 28, 2020
54903a7
pass object with exclusions earlier on
MarcoGorelli Feb 29, 2020
d81c572
revert accident
MarcoGorelli Feb 29, 2020
e049205
check for side effects in obj
MarcoGorelli Feb 29, 2020
5f43f3a
wip
MarcoGorelli Mar 1, 2020
3b1c3ff
Wip
MarcoGorelli Mar 1, 2020
207a507
change _selected_obj according to self.mutated
MarcoGorelli Mar 1, 2020
c9b34b2
sanitize
MarcoGorelli Mar 1, 2020
b2ce758
update old tests
MarcoGorelli Mar 1, 2020
c307fdc
fix old docstring
MarcoGorelli Mar 1, 2020
35f2b2d
finish correcting old docstring
MarcoGorelli Mar 1, 2020
2bb7932
use getattr
MarcoGorelli Mar 1, 2020
eed4122
old fix
Mar 3, 2020
26d9dec
simpler fix
Mar 3, 2020
68b4299
try making Window's own _shallow_copy
Mar 3, 2020
159a3d6
more wip
Mar 3, 2020
0b393b6
typeerror!
Mar 3, 2020
9fc50ee
add some types, comment tests
Mar 3, 2020
7fe2fcf
fix typing
Mar 3, 2020
8a34ff4
better fix
Mar 3, 2020
8953bda
correct return annotation
Mar 3, 2020
b2b8a42
simplify code
Mar 3, 2020
d476191
fix upstream, remove try except
Mar 4, 2020
63e3d85
document change as api-breaking
Mar 4, 2020
17373ad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
MarcoGorelli Mar 15, 2020
a7ca8eb
remove elements from exclusions within _obj_with_exclusions
MarcoGorelli Mar 15, 2020
35e7340
remove no-longer-necessary performance warning
MarcoGorelli Mar 15, 2020
17090da
simplify _obj_with_exclusions
MarcoGorelli Mar 15, 2020
7c7f79c
correct comment
MarcoGorelli Mar 15, 2020
9b98dad
Merge remote-tracking branch 'upstream/master' into rolling-groupby
Mar 19, 2020
4d1c5a6
deal with multiindexed columns case
Mar 21, 2020
0dde4a8
use elif in _obj_with_exclusions, add test for column multiindex, dis…
Mar 21, 2020
1e4ba56
simplify unique_column_names
Mar 21, 2020
6312725
rewrite using get_level_values
Mar 21, 2020
d9748d1
revert 'or' which was accidentally changed to 'and'
Mar 21, 2020
1623902
only patch _apply, cov and corr
Mar 21, 2020
5d7a477
reinstate change to _obj_with_exclusions
Mar 21, 2020
d117624
exclude 'by' column in Rolling.count
Mar 22, 2020
367e671
don't modify _selected_obj - instead, patch obj before we reach apply
Mar 22, 2020
d7cf9f1
slight simplification to _shallow_copy
Mar 22, 2020
d251715
comment on patch in corr and cov
Mar 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,31 @@ key and type of :class:`Index`. These now consistently raise ``KeyError`` (:iss
...
KeyError: Timestamp('1970-01-01 00:00:00')

GroupBy.rolling no longer returns grouped-by column in values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an expl here, a couple of sentences; this the issue number.

*Previous behavior*:

.. code-block:: ipython

In [1]: df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})

In [2]: df.groupby("A").rolling(2).sum()
Out[2]:
A B
A
1 0 NaN NaN
1 2.0 1.0
2 2 NaN NaN
3 3 NaN NaN

*New behavior*:

.. ipython:: python

df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
df.groupby("A").rolling(2).sum()

.. ---------------------------------------------------------------------------

.. _whatsnew_110.deprecations:
Expand Down
1 change: 1 addition & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1577,6 +1577,7 @@ def rolling(self, *args, **kwargs):
"""
from pandas.core.window import RollingGroupby

kwargs["exclusions"] = self.exclusions
return RollingGroupby(self, *args, **kwargs)

@Substitution(name="groupby")
Expand Down
1 change: 1 addition & 0 deletions pandas/core/window/ewm.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ def __init__(
adjust=True,
ignore_na=False,
axis=0,
**kwargs,
):
self.obj = obj
self.com = _get_center_of_mass(com, span, halflife, alpha)
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/window/rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,18 @@ def __init__(
self.axis = obj._get_axis_number(axis) if axis is not None else None
self.validate()
self._numba_func_cache: Dict[Optional[str], Callable] = dict()
self.exclusions = kwargs.get("exclusions", set())

def _shallow_copy(self, obj: FrameOrSeries, **kwargs) -> ShallowMixin:
if isinstance(obj, ABCDataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAT? i would much rather properly copy in the super class. really avoid if/then cases unless absolutely necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOW we can always just copy exclusions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback sure, I can remove the if-then cases here, but then I need to potentially remove elements from self.exclusions in

def _obj_with_exclusions(self)

, otherwise this could throw an error if there are elements of self.exclusions that are no longer part of self.obj.columns:

return self.obj.drop(self.exclusions, axis=1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, that's what I've done in the latest commit

# there may be elements in self.exclusions that are no longer
# in obj, see GH 32468
exclusions = self.exclusions.intersection(obj.columns)
new_obj = super()._shallow_copy(obj, exclusions=exclusions, **kwargs)
new_obj.obj = new_obj._obj_with_exclusions
else:
new_obj = super()._shallow_copy(obj, **kwargs)
return new_obj

@property
def _constructor(self):
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/window/test_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,11 +62,15 @@ def test_rolling(self):
for f in ["sum", "mean", "min", "max", "count", "kurt", "skew"]:
result = getattr(r, f)()
expected = g.apply(lambda x: getattr(x.rolling(4), f)())
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

for f in ["std", "var"]:
result = getattr(r, f)(ddof=1)
expected = g.apply(lambda x: getattr(x.rolling(4), f)(ddof=1))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
Expand All @@ -79,6 +83,8 @@ def test_rolling_quantile(self, interpolation):
expected = g.apply(
lambda x: x.rolling(4).quantile(0.4, interpolation=interpolation)
)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_corr_cov(self):
Expand All @@ -92,6 +98,8 @@ def func(x):
return getattr(x.rolling(4), f)(self.frame)

expected = g.apply(func)
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

result = getattr(r.B, f)(pairwise=True)
Expand All @@ -109,6 +117,8 @@ def test_rolling_apply(self, raw):
# reduction
result = r.apply(lambda x: x.sum(), raw=raw)
expected = g.apply(lambda x: x.rolling(4).apply(lambda y: y.sum(), raw=raw))
# groupby.apply doesn't drop the grouped-by column
expected = expected.drop("A", axis=1)
tm.assert_frame_equal(result, expected)

def test_rolling_apply_mutability(self):
Expand Down
12 changes: 12 additions & 0 deletions pandas/tests/window/test_rolling.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,3 +465,15 @@ def test_rolling_count_default_min_periods_with_null_values(constructor):
result = constructor(values).rolling(3).count()
expected = constructor(expected_counts)
tm.assert_equal(result, expected)


def test_by_column_not_in_values():
# GH 32262
df = pd.DataFrame({"A": [1] * 20 + [2] * 12 + [3] * 8, "B": np.arange(40)})

g = df.groupby("A")
original_obj = g.obj.copy(deep=True)
r = g.rolling(4)
result = r.sum()
assert "A" not in result.columns
tm.assert_frame_equal(g.obj, original_obj) # check for side-effects