Skip to content

REF: Refactor GroupBy a tiny bit #46258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

NickCrews
Copy link
Contributor

  • [NA, small] closes #xxxx (Replace xxxx with the Github issue number)
  • [waiting for review] Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • [waiting for review] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Grouping._codes_and_uniques and BaseGrouper._get_compressed_codes
were both semantically doing the same thing as
core.algorithms.factorize(), so rename them so that it's just a bit
easier to follow.

Per a review comment in pandas-dev#46207
The return value is never used. Don't pretend
that it is.
and move dropna up out of the "generated"
attribute section into the "source" attribute
section
@NickCrews NickCrews changed the title Simple refactor groupby REF: Simple GroupBy refactors Mar 8, 2022
@NickCrews NickCrews changed the title REF: Simple GroupBy refactors REF: Refactor GroupBy a tiny bit Mar 8, 2022
@@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False):
# "NDFrameT", variable has type "None")
self.obj = obj # type: ignore[assignment]
self._gpr_index = ax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if I understand what you're saying. Do you mean make this whole class less stateful?

I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean make this whole class less stateful?

exactly. IIRC i put some effort into this last year and found it difficult.

Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor).

What gets in the way of this are the ax and groups properties, which look like they are supposed to be public. But I'm pretty sure it is a mistake to expose those two, since they depend on the Grouper being bound to an obj, and there isn't a user-facing way to bind a Grouper to an obj. So I feel like any user who is relying on those two properties is already reaching too far into the Grouper and therefore that should be considered undefined.

Thoughts?

self,
) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]:
# The first returned ndarray may have any signed integer dtype
"""Analogous to core.algorithms.factorize"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still is the same pattern where uniques.take[codes] is the same as the original data, just with a slightly different type. Or is there some bigger difference I'm missing? Should I just change the docstring to be "Similar idea as core.algorithms.factorize", or you don't think algorithms.factorize should be mentioned at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you don't think algorithms.factorize should be mentioned at all?

yah lets leave this case out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you're interested in a hopefully-hours-sized goal:

In all extant test cases, the second array returned by this method has length(self._get_compressed_codes()[1]) == self.ngroups. Can we confirm that this will always hold? Doing so would allow simplifying group_info.

(It is easy-ish to confirm that this holds for the len(self.groupings) == 1 case, but I haven't confirmed it for the other case)

@@ -263,7 +263,7 @@ class Grouper:
_gpr_index: Index | None
_grouper: Index | None

_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort")
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _attributes used for anything? if so, this should change some behavior right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the case, I think I might even rename this to _repr_attributes, what do you think? Because it is quite hard to tell what this is otherwise used for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we use the name _attributes elsewhere so wouldn't change it.

@NickCrews
Copy link
Contributor Author

@jbrockmendel how about I drop the commit where I add dropna, address the other tweaks you suggest, and then we can merge this as a start? And then maybe do those offshoot changes separately?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 12, 2022
NickCrews added a commit to NickCrews/pandas that referenced this pull request Apr 12, 2022
The return value is never used. Don't pretend
that it is.

Ideally we don't even hold this state inside the Grouper, it would make
it easier to reason about. But just do this tiny change for now,
still an improvement.
See pandas-dev#46258 for more details.
NickCrews added a commit to NickCrews/pandas that referenced this pull request Apr 12, 2022
All this changes I think is that now `dropna` is included in the
`__repr__` output.
See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated"
attribute section into the "source" attribute
section in the __init__
@NickCrews NickCrews mentioned this pull request Apr 12, 2022
3 tasks
@NickCrews
Copy link
Contributor Author

Looks like this is a bit more than I want to take on, but I pulled the less-controversial changes into #46754, where hopefully they can still be merged.

@NickCrews NickCrews closed this Apr 12, 2022
NickCrews added a commit to NickCrews/pandas that referenced this pull request Apr 12, 2022
All this changes I think is that now `dropna` is included in the
`__repr__` output.
See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated"
attribute section into the "source" attribute
section in the __init__
NickCrews added a commit to NickCrews/pandas that referenced this pull request Apr 16, 2022
The return value is never used. Don't pretend
that it is.

Ideally we don't even hold this state inside the Grouper, it would make
it easier to reason about. But just do this tiny change for now,
still an improvement.
See pandas-dev#46258 for more details.
NickCrews added a commit to NickCrews/pandas that referenced this pull request Apr 16, 2022
All this changes I think is that now `dropna` is included in the
`__repr__` output.
See pandas-dev#46258 for more discussion.

And move dropna up out of the "generated"
attribute section into the "source" attribute
section in the __init__
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants