groupby.iter() fix types #148

Dr-Irv · 2022-07-18T00:26:23Z

Closes Unpacking group keys is refused in multi-columns groups #142
Tests added:
- test_frame.py:test_groupby_result()

The groupby internal class hierarchies are not what is currently in pandas. They were changed in v1.2, but the type stubs were initially created on top of pandas 1.1. Will create an issue about that.

twoertwein · 2022-07-18T00:40:35Z

The groupby internal class hierarchies are not what is currently in pandas. They were changed in v1.2, but the type stubs were initially created on top of pandas 1.1. Will create an issue about that.

stubtest will be really useful to (slowly) align the type stubs with the implementation structure.

twoertwein · 2022-07-18T00:49:04Z

tests/test_frame.py

+def test_groupby_result() -> None:
+    # GH 142
+    df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]})
+    lresult = [(cast(Tuple[int, int], k), g) for k, g in df.groupby(["a", "b"])]


I would be very cautious adding casts to the type tests, if we add them we might in end up testing (only) the runtime API of pandas

index, value = next(assert_type(df.groupby(["a", "b"]), Iterator[tuple[Hashable, DataFrame]])) assert isinstance(index.__hash__(), int) # if possible isinstance(index, Hashable) assert isinstance(value, DataFrame)

I did something like this in the next commit. cast has been removed. It's a bit tricky to test __iter__()

Changed.

Also did a scheme to address the original issue.

twoertwein · 2022-07-18T02:18:09Z

tests/test_frame.py

+def test_groupby_result() -> None:
+    # GH 142
+    df = pd.DataFrame({"a": [0, 1, 2], "b": [4, 5, 6], "c": [7, 8, 9]})
+    index, value = next(df.groupby(["a", "b"]).__iter__())


One small change: a two-element list can also be unpacked into two variables.

I tested the following small change on your branch:

Iterable, + Iterator, List, ... - index, value = next(df.groupby(["a", "b"]).__iter__()) + iterator = df.groupby(["a", "b"]).__iter__() + assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]]) + index, value = next(iterator)

One small change: a two-element list can also be unpacked into two variables.

I tested the following small change on your branch:

Iterable, + Iterator, List, ... - index, value = next(df.groupby(["a", "b"]).__iter__()) + iterator = df.groupby(["a", "b"]).__iter__() + assert_type(iterator, Iterator[Tuple[Tuple, pd.DataFrame]]) + index, value = next(iterator)

OK, that is in the next commit.

twoertwein · 2022-07-18T02:22:39Z

pandas-stubs/core/groupby/generic.pyi

+    def __iter__(self) -> Iterator[Tuple[Scalar, Series]]: ...
+
+class SeriesGroupByNonScalar(SeriesGroupBy):
+    def __iter__(self) -> Iterator[Tuple[Tuple, Series]]: ...


No strong opinion: I see the point from a user-perspective, but I think deviating too much from the pandas implementation will make it challenging to use stubtest later (and maybe also maintaining the stubs).

should probably make these classes that are only in the stubs but not in pandas private

No strong opinion

I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.

Personally, I prefer the previous state of the PR where you returned tuple[Hashable, NDFrameT] (or making SeriesGroupBy generic, might be more invasive).

I think I develop a slightly stronger opinion. I would prefer not to introduce too many non-pandas classes: I believe that this will keep pandas-stubs more maintainable in the long-run.

Personally, I prefer the previous state of the PR where you returned tuple[Hashable, NDFrameT] (or making SeriesGroupBy generic, might be more invasive).

This is a tough call. There are lots of things in pandas that are designed without static typing taken into consideration. In this case, we want to differentiate when someone writes df.groupby(["a", "b"]) versus df.groupby("a") so that __iter__() returns a different result. The implementation has __iter__() in a base class, so there isn't an easy way from a static typing perspective to differentiate between the different kinds of arguments of groupby() when you get down to the base class.

My philosophy on the stubs has been to make it useful for end users with the most common ways of using pandas. To do that, we have to deviate from the implementation. A good example of this is Series.dt, where the accessors are all dynamically hooked in, but we have to create static declarations for each of the accessors. I think this groupby() example is similar.

I tried some experiments making DataFrameGroupBy generic, but the types that are then returned by __iter__() become too wide. Now we just make the return types either Iterator[Tuple[Tuple, DataFrame]] or Iterator[Tuple[Scalar, DataFrame]], which covers the majority of use cases.

My suggestion is you accept this PR as it is - it addresses the issue reported, and we could create a new issue to see if we can figure out a way to make a generic implementation work.

should probably make these classes that are only in the stubs but not in pandas private

Did that in the next commit

While I prefer pandas and pandas-stubs to be aligned (I would like if pandas and pandas-stub could in the future converge and only be different in challenging cases: __getitem__, Timestamp, ... - might be a long list) I'm fine with introducing classes/variables that are not in pandas as long as they are private (or some other prefix).

twoertwein · 2022-07-18T12:53:33Z

Thanks @Dr-Irv !

groupby.__iter__() fix types

1ab02f5

Dr-Irv requested a review from twoertwein July 18, 2022 00:26

Dr-Irv mentioned this pull request Jul 18, 2022

Unpacking group keys is refused in multi-columns groups #142

Closed

Dr-Irv mentioned this pull request Jul 18, 2022

CLEAN: Fix up Groupby classes #149

Closed

twoertwein reviewed Jul 18, 2022

View reviewed changes

Dr-Irv added 3 commits July 17, 2022 20:54

WIP: try splitting by label and otherwise

62fd358

fix tests to avoid cast

167ba50

differentiate by list or scalar

383e1bf

twoertwein reviewed Jul 18, 2022

View reviewed changes

make new classes private. Change tests to test iterator and next

981bf9a

twoertwein merged commit a6dd774 into pandas-dev:main Jul 18, 2022

Dr-Irv deleted the fixgroupby branch December 28, 2022 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

groupby.iter() fix types #148

groupby.iter() fix types #148

Uh oh!

Dr-Irv commented Jul 18, 2022

Uh oh!

twoertwein commented Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022 •

edited

Loading

Uh oh!

Dr-Irv Jul 18, 2022

Uh oh!

Dr-Irv Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022

Uh oh!

Dr-Irv Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022

Uh oh!

Dr-Irv Jul 18, 2022

Uh oh!

Dr-Irv Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022

Uh oh!

twoertwein commented Jul 18, 2022

Uh oh!

Uh oh!

Uh oh!

groupby.__iter__() fix types #148

groupby.__iter__() fix types #148

Uh oh!

Conversation

Dr-Irv commented Jul 18, 2022

Uh oh!

twoertwein commented Jul 18, 2022

Uh oh!

twoertwein Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

twoertwein commented Jul 18, 2022

Uh oh!

Uh oh!

groupby.iter() fix types #148

groupby.iter() fix types #148

twoertwein Jul 18, 2022 •

edited

Loading