Skip to content

REF: Refactor GroupBy a tiny bit #46258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ class Grouper:
_gpr_index: Index | None
_grouper: Index | None

_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort")
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _attributes used for anything? if so, this should change some behavior right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being the case, I think I might even rename this to _repr_attributes, what do you think? Because it is quite hard to tell what this is otherwise used for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we use the name _attributes elsewhere so wouldn't change it.


def __new__(cls, *args, **kwargs):
if kwargs.get("freq") is not None:
Expand All @@ -287,6 +287,7 @@ def __init__(
self.freq = freq
self.axis = axis
self.sort = sort
self.dropna = dropna

self.grouper = None
self._gpr_index = None
Expand All @@ -295,7 +296,6 @@ def __init__(
self.binner = None
self._grouper = None
self._indexer = None
self.dropna = dropna

@final
@property
Expand Down Expand Up @@ -339,7 +339,7 @@ def _get_grouper(
return self.binner, self.grouper, self.obj # type: ignore[return-value]

@final
def _set_grouper(self, obj: NDFrame, sort: bool = False):
def _set_grouper(self, obj: NDFrame, sort: bool = False) -> None:
"""
given an object and the specifications, setup the internal grouper
for this particular specification
Expand Down Expand Up @@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False):
# "NDFrameT", variable has type "None")
self.obj = obj # type: ignore[assignment]
self._gpr_index = ax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, if I understand what you're saying. Do you mean make this whole class less stateful?

I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean make this whole class less stateful?

exactly. IIRC i put some effort into this last year and found it difficult.

Make this class, Grouper, not know anything about obj, and then a separate class BoundGrouper, that is the pairing of a Grouper and a specific obj. Does that sound like a reasonable way to organize things?

Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor).

What gets in the way of this are the ax and groups properties, which look like they are supposed to be public. But I'm pretty sure it is a mistake to expose those two, since they depend on the Grouper being bound to an obj, and there isn't a user-facing way to bind a Grouper to an obj. So I feel like any user who is relying on those two properties is already reaching too far into the Grouper and therefore that should be considered undefined.

Thoughts?

return self._gpr_index

@final
@property
Expand Down Expand Up @@ -619,7 +618,7 @@ def codes(self) -> npt.NDArray[np.signedinteger]:
# _codes is set in __init__ for MultiIndex cases
return self._codes

return self._codes_and_uniques[0]
return self._factorize[0]

@cache_readonly
def group_arraylike(self) -> ArrayLike:
Expand All @@ -635,7 +634,7 @@ def group_arraylike(self) -> ArrayLike:
# retain dtype for categories, including unobserved ones
return self.result_index._values

return self._codes_and_uniques[1]
return self._factorize[1]

@cache_readonly
def result_index(self) -> Index:
Expand All @@ -653,11 +652,12 @@ def group_index(self) -> Index:
# _group_index is set in __init__ for MultiIndex cases
return self._group_index

uniques = self._codes_and_uniques[1]
uniques = self._factorize[1]
return Index._with_infer(uniques, name=self.name)

@cache_readonly
def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]:
def _factorize(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]:
"""Analogous to core.algorithms.factorize"""
if self._passed_categorical:
# we make a CategoricalIndex out of the cat grouper
# preserving the categories / ordered attributes
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -878,7 +878,7 @@ def has_dropped_na(self) -> bool:

@cache_readonly
def group_info(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]:
comp_ids, obs_group_ids = self._get_compressed_codes()
comp_ids, obs_group_ids = self._factorize()

ngroups = len(obs_group_ids)
comp_ids = ensure_platform_int(comp_ids)
Expand All @@ -899,10 +899,10 @@ def codes_info(self) -> npt.NDArray[np.intp]:
return ids

@final
def _get_compressed_codes(
def _factorize(
self,
) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]:
# The first returned ndarray may have any signed integer dtype
"""Analogous to core.algorithms.factorize"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still is the same pattern where uniques.take[codes] is the same as the original data, just with a slightly different type. Or is there some bigger difference I'm missing? Should I just change the docstring to be "Similar idea as core.algorithms.factorize", or you don't think algorithms.factorize should be mentioned at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you don't think algorithms.factorize should be mentioned at all?

yah lets leave this case out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW if you're interested in a hopefully-hours-sized goal:

In all extant test cases, the second array returned by this method has length(self._get_compressed_codes()[1]) == self.ngroups. Can we confirm that this will always hold? Doing so would allow simplifying group_info.

(It is easy-ish to confirm that this holds for the len(self.groupings) == 1 case, but I haven't confirmed it for the other case)

if len(self.groupings) > 1:
group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True)
return compress_group_index(group_index, sort=self._sort)
Expand Down
7 changes: 3 additions & 4 deletions pandas/core/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -1466,14 +1466,14 @@ def __init__(
self,
freq="Min",
closed: Literal["left", "right"] | None = None,
label: str | None = None,
label: Literal["left", "right"] | None = None,
how="mean",
axis=0,
fill_method=None,
limit=None,
loffset=None,
kind: str | None = None,
convention: str | None = None,
convention: Literal["start", "end", "e", "s"] | None = None,
base: int | None = None,
origin: str | TimestampConvertibleTypes = "start_day",
offset: TimedeltaConvertibleTypes | None = None,
Expand Down Expand Up @@ -1519,8 +1519,7 @@ def __init__(
self.label = label
self.kind = kind

self.convention = convention or "E"
self.convention = self.convention.lower()
self.convention = convention or "e"

self.how = how
self.fill_method = fill_method
Expand Down