-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
REF: Refactor GroupBy a tiny bit #46258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -263,7 +263,7 @@ class Grouper: | |
_gpr_index: Index | None | ||
_grouper: Index | None | ||
|
||
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort") | ||
_attributes: tuple[str, ...] = ("key", "level", "freq", "axis", "sort", "dropna") | ||
|
||
def __new__(cls, *args, **kwargs): | ||
if kwargs.get("freq") is not None: | ||
|
@@ -287,6 +287,7 @@ def __init__( | |
self.freq = freq | ||
self.axis = axis | ||
self.sort = sort | ||
self.dropna = dropna | ||
|
||
self.grouper = None | ||
self._gpr_index = None | ||
|
@@ -295,7 +296,6 @@ def __init__( | |
self.binner = None | ||
self._grouper = None | ||
self._indexer = None | ||
self.dropna = dropna | ||
|
||
@final | ||
@property | ||
|
@@ -339,7 +339,7 @@ def _get_grouper( | |
return self.binner, self.grouper, self.obj # type: ignore[return-value] | ||
|
||
@final | ||
def _set_grouper(self, obj: NDFrame, sort: bool = False): | ||
def _set_grouper(self, obj: NDFrame, sort: bool = False) -> None: | ||
""" | ||
given an object and the specifications, setup the internal grouper | ||
for this particular specification | ||
|
@@ -413,7 +413,6 @@ def _set_grouper(self, obj: NDFrame, sort: bool = False): | |
# "NDFrameT", variable has type "None") | ||
self.obj = obj # type: ignore[assignment] | ||
self._gpr_index = ax | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if you can find a way to avoid setting these attributes instead, that'd make this class easier to reason about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, if I understand what you're saying. Do you mean make this whole class less stateful? I actually started doing this because I thought the same thing. I wanted to make this Grouper be immutable. Make this class, Grouper, not know anything about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
exactly. IIRC i put some effort into this last year and found it difficult.
Not sure off the top of my head. Depends in large part on if it would be user-facing, in which case it is a harder sell. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I was worried about user facing changes. What exactly do we consider user-facing? In the docs Grouper is pretty much a namedtuple, and I would argue this should be the public API (just the attributes given in the constructor). What gets in the way of this are the Thoughts? |
||
return self._gpr_index | ||
|
||
@final | ||
@property | ||
|
@@ -619,7 +618,7 @@ def codes(self) -> npt.NDArray[np.signedinteger]: | |
# _codes is set in __init__ for MultiIndex cases | ||
return self._codes | ||
|
||
return self._codes_and_uniques[0] | ||
return self._factorize[0] | ||
|
||
@cache_readonly | ||
def group_arraylike(self) -> ArrayLike: | ||
|
@@ -635,7 +634,7 @@ def group_arraylike(self) -> ArrayLike: | |
# retain dtype for categories, including unobserved ones | ||
return self.result_index._values | ||
|
||
return self._codes_and_uniques[1] | ||
return self._factorize[1] | ||
|
||
@cache_readonly | ||
def result_index(self) -> Index: | ||
|
@@ -653,11 +652,12 @@ def group_index(self) -> Index: | |
# _group_index is set in __init__ for MultiIndex cases | ||
return self._group_index | ||
|
||
uniques = self._codes_and_uniques[1] | ||
uniques = self._factorize[1] | ||
return Index._with_infer(uniques, name=self.name) | ||
|
||
@cache_readonly | ||
def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | ||
def _factorize(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]: | ||
"""Analogous to core.algorithms.factorize""" | ||
if self._passed_categorical: | ||
# we make a CategoricalIndex out of the cat grouper | ||
# preserving the categories / ordered attributes | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -878,7 +878,7 @@ def has_dropped_na(self) -> bool: | |
|
||
@cache_readonly | ||
def group_info(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp], int]: | ||
comp_ids, obs_group_ids = self._get_compressed_codes() | ||
comp_ids, obs_group_ids = self._factorize() | ||
|
||
ngroups = len(obs_group_ids) | ||
comp_ids = ensure_platform_int(comp_ids) | ||
|
@@ -899,10 +899,10 @@ def codes_info(self) -> npt.NDArray[np.intp]: | |
return ids | ||
|
||
@final | ||
def _get_compressed_codes( | ||
def _factorize( | ||
self, | ||
) -> tuple[npt.NDArray[np.signedinteger], npt.NDArray[np.intp]]: | ||
# The first returned ndarray may have any signed integer dtype | ||
"""Analogous to core.algorithms.factorize""" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure this is accurate; the second item returned is ndarray[intp] instead of ArrayLike There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it still is the same pattern where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
yah lets leave this case out There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW if you're interested in a hopefully-hours-sized goal: In all extant test cases, the second array returned by this method has (It is easy-ish to confirm that this holds for the |
||
if len(self.groupings) > 1: | ||
group_index = get_group_index(self.codes, self.shape, sort=True, xnull=True) | ||
return compress_group_index(group_index, sort=self._sort) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is _attributes used for anything? if so, this should change some behavior right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure after a bit of poking around it's just used in repr, but I'm not positive (TimeGrouper does some funky stuff with getattr that might somehow be getting to this). I think the only failed tests are ones that test repr()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being the case, I think I might even rename this to
_repr_attributes
, what do you think? Because it is quite hard to tell what this is otherwise used for.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we use the name
_attributes
elsewhere so wouldn't change it.