REF: Refactor GroupBy a tiny bit #46258

NickCrews · 2022-03-08T00:45:40Z

[NA, small] closes #xxxx (Replace xxxx with the Github issue number)
[waiting for review] Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
[waiting for review] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow. Per a review comment in pandas-dev#46207

The return value is never used. Don't pretend that it is.

and move dropna up out of the "generated" attribute section into the "source" attribute section

jbrockmendel · 2022-03-08T02:18:54Z

pandas/core/groupby/grouper.py

@@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False):
        # "NDFrameT", variable has type "None")
        self.obj = obj  # type: ignore[assignment]
        self._gpr_index = ax


if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about

I agree, if I understand what you're saying. Do you mean make this whole class less stateful?

I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Do you mean make this whole class less stateful?

exactly. IIRC i put some effort into this last year and found it difficult.

Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell.

Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor).

What gets in the way of this are the ax and groups properties, which look like they are supposed to be public. But I'm pretty sure it is a mistake to expose those two, since they depend on the Grouper being bound to an obj, and there isn't a user-facing way to bind a Grouper to an obj. So I feel like any user who is relying on those two properties is already reaching too far into the Grouper and therefore that should be considered undefined.

Thoughts?

jbrockmendel · 2022-03-08T02:19:48Z

pandas/core/groupby/ops.py

        self,
    ) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]:
-        # The first returned ndarray may have any signed integer dtype
+        """Analogous to core.algorithms.factorize"""


not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike

I think it still is the same pattern where uniques.take[codes] is the same as the original data, just with a slightly different type. Or is there some bigger difference I'm missing? Should I just change the docstring to be "Similar idea as core.algorithms.factorize", or you don't think algorithms.factorize should be mentioned at all?

or you don't think algorithms.factorize should be mentioned at all?

yah lets leave this case out

BTW if you're interested in a hopefully-hours-sized goal:

In all extant test cases, the second array returned by this method has length(self._get_compressed_codes()[1]) == self.ngroups. Can we confirm that this will always hold? Doing so would allow simplifying group_info.

(It is easy-ish to confirm that this holds for the len(self.groupings) == 1 case, but I haven't confirmed it for the other case)

jbrockmendel · 2022-03-08T02:20:52Z

pandas/core/groupby/grouper.py

@@ -263,7 +263,7 @@ class Grouper:
    _gpr_index: Index | None
    _grouper: Index | None

-    _attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort")
+    _attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna")


is _attributes used for anything? if so, this should change some behavior right

I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()

This being the case, I think I might even rename this to _repr_attributes, what do you think? Because it is quite hard to tell what this is otherwise used for.

i think we use the name _attributes elsewhere so wouldn't change it.

NickCrews · 2022-03-08T16:56:23Z

@jbrockmendel how about I drop the commit where I add dropna, address the other tweaks you suggest, and then we can merge this as a start? And then maybe do those offshoot changes separately?

pandas/core/resample.py

github-actions · 2022-04-12T00:05:07Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

The return value is never used. Don't pretend that it is. Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.

All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__

NickCrews · 2022-04-12T16:11:40Z

Looks like this is a bit more than I want to take on, but I pulled the less-controversial changes into #46754, where hopefully they can still be merged.

All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__

The return value is never used. Don't pretend that it is. Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.

All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__

NickCrews added 4 commits March 7, 2022 19:19

REF: Unify use of factorize() in groupby

9e51ab0

Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow. Per a review comment in pandas-dev#46207

TYP: Add Literal annotations to TimeGrouper

3d80d3d

TYP: Fix annotation of Grouper._set_grouper

28a7973

The return value is never used. Don't pretend that it is.

BUG: Include "dropna" in Grouper._attributes

8086c31

and move dropna up out of the "generated" attribute section into the "source" attribute section

NickCrews mentioned this pull request Mar 8, 2022

REF: Move Categorical logic from Grouping (#46203) #46207

Closed

4 tasks

NickCrews changed the title ~~Simple refactor groupby~~ REF: Simple GroupBy refactors Mar 8, 2022

NickCrews changed the title ~~REF: Simple GroupBy refactors~~ REF: Refactor GroupBy a tiny bit Mar 8, 2022

jbrockmendel reviewed Mar 8, 2022

View reviewed changes

jbrockmendel reviewed Mar 11, 2022

View reviewed changes

pandas/core/resample.py Show resolved Hide resolved

github-actions bot added the Stale label Apr 12, 2022

NickCrews mentioned this pull request Apr 12, 2022

BUG: REF: Groupby refactor #46754

Merged

3 tasks

NickCrews closed this Apr 12, 2022

Uh oh!

REF: Refactor GroupBy a tiny bit #46258

REF: Refactor GroupBy a tiny bit #46258

Uh oh!

Conversation

NickCrews commented Mar 8, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NickCrews commented Mar 8, 2022

Uh oh!

Uh oh!

github-actions bot commented Apr 12, 2022

Uh oh!

NickCrews commented Apr 12, 2022

Uh oh!

Uh oh!