-
-
Notifications
You must be signed in to change notification settings - Fork 143
groupby.__iter__() fix types #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
stubtest will be really useful to (slowly) align the type stubs with the implementation structure. |
tests/test_frame.py
Outdated
def test_groupby_result() -> None: | ||
# GH 142 | ||
df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]}) | ||
lresult = [(cast(Tuple[int, int], k), g) for k, g in df.groupby(["a", "b"])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be very cautious adding cast
s to the type tests, if we add them we might in end up testing (only) the runtime API of pandas
index, value = next(assert_type(df.groupby(["a", "b"]), Iterator[tuple[Hashable, DataFrame]]))
assert isinstance(index.__hash__(), int) # if possible isinstance(index, Hashable)
assert isinstance(value, DataFrame)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did something like this in the next commit. cast
has been removed. It's a bit tricky to test __iter__()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Also did a scheme to address the original issue.
tests/test_frame.py
Outdated
def test_groupby_result() -> None: | ||
# GH 142 | ||
df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]}) | ||
index, value = next(df.groupby(["a", "b"]).__iter__()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change: a two-element list can also be unpacked into two variables.
I tested the following small change on your branch:
Iterable,
+ Iterator,
List,
...
- index, value = next(df.groupby(["a", "b"]).__iter__())
+ iterator = df.groupby(["a", "b"]).__iter__()
+ assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]])
+ index, value = next(iterator)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small change: a two-element list can also be unpacked into two variables.
I tested the following small change on your branch:
Iterable, + Iterator, List, ... - index, value = next(df.groupby(["a", "b"]).__iter__()) + iterator = df.groupby(["a", "b"]).__iter__() + assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]]) + index, value = next(iterator)
OK, that is in the next commit.
def __iter__(self) -> Iterator[Tuple[Scalar, Series]]: ... | ||
|
||
class SeriesGroupByNonScalar(SeriesGroupBy): | ||
def __iter__(self) -> Iterator[Tuple[Tuple, Series]]: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion: I see the point from a user-perspective, but I think deviating too much from the pandas implementation will make it challenging to use stubtest later (and maybe also maintaining the stubs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably make these classes that are only in the stubs but not in pandas private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion
I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.
Personally, I prefer the previous state of the PR where you returned tuple[Hashable, NDFrameT]
(or making SeriesGroupBy generic, might be more invasive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.
Personally, I prefer the previous state of the PR where you returned
tuple[Hashable, NDFrameT]
(or making SeriesGroupBy generic, might be more invasive).
This is a tough call. There are lots of things in pandas that are designed without static typing taken into consideration. In this case, we want to differentiate when someone writes df.groupby(["a", "b"])
versus df.groupby("a")
so that __iter__()
returns a different result. The implementation has __iter__()
in a base class, so there isn't an easy way from a static typing perspective to differentiate between the different kinds of arguments of groupby()
when you get down to the base class.
My philosophy on the stubs has been to make it useful for end users with the most common ways of using pandas. To do that, we have to deviate from the implementation. A good example of this is Series.dt
, where the accessors are all dynamically hooked in, but we have to create static declarations for each of the accessors. I think this groupby()
example is similar.
I tried some experiments making DataFrameGroupBy
generic, but the types that are then returned by __iter__()
become too wide. Now we just make the return types either Iterator[Tuple[Tuple, DataFrame]]
or Iterator[Tuple[Scalar, DataFrame]]
, which covers the majority of use cases.
My suggestion is you accept this PR as it is - it addresses the issue reported, and we could create a new issue to see if we can figure out a way to make a generic implementation work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably make these classes that are only in the stubs but not in pandas private
Did that in the next commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I prefer pandas and pandas-stubs to be aligned (I would like if pandas and pandas-stub could in the future converge and only be different in challenging cases: __getitem__, Timestamp, ...
- might be a long list) I'm fine with introducing classes/variables that are not in pandas as long as they are private (or some other prefix).
Thanks @Dr-Irv ! |
test_frame.py:test_groupby_result()
The
groupby
internal class hierarchies are not what is currently in pandas. They were changed in v1.2, but the type stubs were initially created on top of pandas 1.1. Will create an issue about that.