-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
REF: Refactor GroupBy a tiny bit #46258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes were both semantically doing the same thing as core.algorithms.factorize(), so rename them so that it's just a bit easier to follow. Per a review comment in pandas-dev#46207
The return value is never used. Don't pretend that it is.
and move dropna up out of the "generated" attribute section into the "source" attribute section
@@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False): | |||
# "NDFrameT", variable has type "None") | |||
self.obj = obj # type: ignore[assignment] | |||
self._gpr_index = ax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, if I understand what you're saying. Do you mean make this whole class less stateful?
I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about obj
, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj
. Does that sound like a reasonable way to organize things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean make this whole class less stateful?
exactly. IIRC i put some effort into this last year and found it difficult.
Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?
Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor).
What gets in the way of this are the ax
and groups
properties, which look like they are supposed to be public. But I'm pretty sure it is a mistake to expose those two, since they depend on the Grouper being bound to an obj, and there isn't a user-facing way to bind a Grouper to an obj. So I feel like any user who is relying on those two properties is already reaching too far into the Grouper and therefore that should be considered undefined.
Thoughts?
self, | ||
) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]: | ||
# The first returned ndarray may have any signed integer dtype | ||
"""Analogous to core.algorithms.factorize""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it still is the same pattern where uniques.take[codes]
is the same as the original data, just with a slightly different type. Or is there some bigger difference I'm missing? Should I just change the docstring to be "Similar idea as core.algorithms.factorize", or you don't think algorithms.factorize should be mentioned at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you don't think algorithms.factorize should be mentioned at all?
yah lets leave this case out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW if you're interested in a hopefully-hours-sized goal:
In all extant test cases, the second array returned by this method has length(self._get_compressed_codes()[1]) == self.ngroups
. Can we confirm that this will always hold? Doing so would allow simplifying group_info
.
(It is easy-ish to confirm that this holds for the len(self.groupings) == 1
case, but I haven't confirmed it for the other case)
@@ -263,7 +263,7 @@ class Grouper: | |||
_gpr_index: Index | None | |||
_grouper: Index | None | |||
|
|||
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort") | |||
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is _attributes used for anything? if so, this should change some behavior right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being the case, I think I might even rename this to _repr_attributes
, what do you think? Because it is quite hard to tell what this is otherwise used for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we use the name _attributes
elsewhere so wouldn't change it.
@jbrockmendel how about I drop the commit where I add |
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
The return value is never used. Don't pretend that it is. Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.
All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__
Looks like this is a bit more than I want to take on, but I pulled the less-controversial changes into #46754, where hopefully they can still be merged. |
All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__
The return value is never used. Don't pretend that it is. Ideally we don't even hold this state inside the Grouper, it would make it easier to reason about. But just do this tiny change for now, still an improvement. See pandas-dev#46258 for more details.
All this changes I think is that now `dropna` is included in the `__repr__` output. See pandas-dev#46258 for more discussion. And move dropna up out of the "generated" attribute section into the "source" attribute section in the __init__
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.