Skip to content

groupby.__iter__() fix types #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 18, 2022
Merged

groupby.__iter__() fix types #148

merged 5 commits into from
Jul 18, 2022

Conversation

Dr-Irv
Copy link
Collaborator

@Dr-Irv Dr-Irv commented Jul 18, 2022

The groupby internal class hierarchies are not what is currently in pandas. They were changed in v1.2, but the type stubs were initially created on top of pandas 1.1. Will create an issue about that.

@twoertwein
Copy link
Member

The groupby internal class hierarchies are not what is currently in pandas. They were changed in v1.2, but the type stubs were initially created on top of pandas 1.1. Will create an issue about that.

stubtest will be really useful to (slowly) align the type stubs with the implementation structure.

def test_groupby_result() -> None:
# GH 142
df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]})
lresult = [(cast(Tuple[int, int], k), g) for k, g in df.groupby(["a", "b"])]
Copy link
Member

@twoertwein twoertwein Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very cautious adding casts to the type tests, if we add them we might in end up testing (only) the runtime API of pandas

index, value = next(assert_type(df.groupby(["a", "b"]), Iterator[tuple[Hashable, DataFrame]]))
assert isinstance(index.__hash__(), int)  # if possible isinstance(index, Hashable) 
assert isinstance(value, DataFrame)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did something like this in the next commit. cast has been removed. It's a bit tricky to test __iter__()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Also did a scheme to address the original issue.

def test_groupby_result() -> None:
# GH 142
df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]})
index, value = next(df.groupby(["a", "b"]).__iter__())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change: a two-element list can also be unpacked into two variables.

I tested the following small change on your branch:

     Iterable,
+    Iterator,
     List,

...
-    index, value = next(df.groupby(["a", "b"]).__iter__())
+    iterator = df.groupby(["a", "b"]).__iter__()
+    assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]])
+    index, value = next(iterator)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change: a two-element list can also be unpacked into two variables.

I tested the following small change on your branch:

     Iterable,
+    Iterator,
     List,

...
-    index, value = next(df.groupby(["a", "b"]).__iter__())
+    iterator = df.groupby(["a", "b"]).__iter__()
+    assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]])
+    index, value = next(iterator)

OK, that is in the next commit.

def __iter__(self) -> Iterator[Tuple[Scalar, Series]]: ...

class SeriesGroupByNonScalar(SeriesGroupBy):
def __iter__(self) -> Iterator[Tuple[Tuple, Series]]: ...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion: I see the point from a user-perspective, but I think deviating too much from the pandas implementation will make it challenging to use stubtest later (and maybe also maintaining the stubs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably make these classes that are only in the stubs but not in pandas private

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion

I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.

Personally, I prefer the previous state of the PR where you returned tuple[Hashable, NDFrameT] (or making SeriesGroupBy generic, might be more invasive).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.

Personally, I prefer the previous state of the PR where you returned tuple[Hashable, NDFrameT] (or making SeriesGroupBy generic, might be more invasive).

This is a tough call. There are lots of things in pandas that are designed without static typing taken into consideration. In this case, we want to differentiate when someone writes df.groupby(["a", "b"]) versus df.groupby("a") so that __iter__() returns a different result. The implementation has __iter__() in a base class, so there isn't an easy way from a static typing perspective to differentiate between the different kinds of arguments of groupby() when you get down to the base class.

My philosophy on the stubs has been to make it useful for end users with the most common ways of using pandas. To do that, we have to deviate from the implementation. A good example of this is Series.dt, where the accessors are all dynamically hooked in, but we have to create static declarations for each of the accessors. I think this groupby() example is similar.

I tried some experiments making DataFrameGroupBy generic, but the types that are then returned by __iter__() become too wide. Now we just make the return types either Iterator[Tuple[Tuple, DataFrame]] or Iterator[Tuple[Scalar, DataFrame]], which covers the majority of use cases.

My suggestion is you accept this PR as it is - it addresses the issue reported, and we could create a new issue to see if we can figure out a way to make a generic implementation work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably make these classes that are only in the stubs but not in pandas private

Did that in the next commit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I prefer pandas and pandas-stubs to be aligned (I would like if pandas and pandas-stub could in the future converge and only be different in challenging cases: __getitem__, Timestamp, ... - might be a long list) I'm fine with introducing classes/variables that are not in pandas as long as they are private (or some other prefix).

@twoertwein twoertwein merged commit a6dd774 into pandas-dev:main Jul 18, 2022
@twoertwein
Copy link
Member

Thanks @Dr-Irv !

@Dr-Irv Dr-Irv deleted the fixgroupby branch December 28, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpacking group keys is refused in multi-columns groups
2 participants