Fix groupby.apply() and sum(axis=1) #168

Dr-Irv · 2022-07-25T00:14:49Z

Closes [BUG] DataFrameGroupBy.apply can return Series #167
Tests added:
- test_frame.py:test_groupby_apply()

This revealed an issue with DataFrame.sum(axis=1) and we probably need to fix things like DataFrame.mean(axis=1), etc., but I think we should just create an issue for that.

twoertwein · 2022-07-25T00:28:31Z

pandas-stubs/core/frame.pyi

@@ -1902,14 +1902,24 @@ class DataFrame(NDFrame, OpsMixin):
        fill_value: float | None = ...,
    ) -> DataFrame: ...
    @overload
+    def sum(
+        self,
+        axis: Literal[1, "columns"],


Isn't the 3rd overload the same as this one but has a broader axis annotation? Is it possible to simply move the third overload up?

Yes, that's right. Changed in next commit

danielroseman · 2022-07-25T09:10:27Z

Thanks for looking into this so quickly. Unfortunately I don't think this is quite right.

df.sum() with no parameters returns a Series. The summean function here returns a float, not a series - I'm not sure why this isn't showing a type error in your test. (In fact using latest head, reveal_type(df.sum()) gives Any, which is a bit strange.)

But returning a scalar is the only situation in which apply would return a Series. In all other cases, including using the built-in sum (which maps to np.sum internally), apply returns a DataFrame:

In [2]: df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})

In [3]: df.groupby('col1').apply(sum)
Out[3]:
      col1  col2
col1
1        1     4
2        2     5
3        3     6


In [4]: df.sum().mean()   # returns float, not Series
Out[4]: 10.5

In [5]: df.gropuby('col1').apply(lambda x: x.sum().mean())
Out[5]:
col1
1    2.5
2    3.5
3    4.5
dtype: float64

So I think there are three cases here:

func returns a DataFrame or a Series: apply(func) returns a DataFrame
func is the built-in sum, min or max [ie the keys of pandas.core.common._builtin_table]: apply(func) returns a DataFrame
func returns a scalar (actually, anything that can go into a single DF element, including eg a string, list, tuple or dict): apply(func) returns a Series

Dr-Irv · 2022-07-25T12:43:20Z

So I think there are three cases here:

func returns a DataFrame or a Series: apply(func) returns a DataFrame

func is the built-in sum, min or max [ie the keys of pandas.core.common._builtin_table]: apply(func) returns a DataFrame

func returns a scalar (actually, anything that can go into a single DF element, including eg a string, list, tuple or dict): apply(func) returns a Series

I'm not sure I got all the cases right. Would need better tests. Current version passes the example you initially supplied.

One thing to be aware of is that using a lambda gives you a Callable that is not typed. So I changed the test to type the callable. It's inconvenient from a coding standpoint, but there is no way for a type checker to know the arguments and return type of a lambda.

pandas-stubs/core/groupby/generic.pyi

pandas-stubs/core/frame.pyi

twoertwein · 2022-07-25T15:26:19Z

pandas-stubs/core/groupby/generic.pyi

+        self, func: Callable[[DataFrame], Series | Scalar], *args, **kwargs
+    ) -> Series: ...
+    @overload
+    def apply(  # type: ignore[misc]


I replaced my local version of generic.pyi with this one and it seems that neither pyright nor mypy need these two ignores (I would have thought both of them need ignores, a DataFrame is Iterable).

If I remove both of these ignores, I get errors from both pyright and mypy

you are right, not sure what I messed up. Unless you add a test covering this second overload, I would be tempted to remove it.

you are right, not sure what I messed up. Unless you add a test covering this second overload, I would be tempted to remove it.

There is a test already. That's why I needed to add the second overload.

pandas-stubs/tests/test_frame.py

Line 558 in 77b357f

df7: pd.DataFrame = df.groupby(by="col1").apply(sum)

Okay, I'm not a fan of this overload but it seems to work :)

twoertwein · 2022-07-25T15:53:06Z

Thanks @Dr-Irv

Fix groupby.apply() and sum(axis=1)

b1d2e30

Dr-Irv requested a review from twoertwein July 25, 2022 00:14

twoertwein reviewed Jul 25, 2022

View reviewed changes

reorder sum overloads, change apply overloads

2a6b37e

twoertwein reviewed Jul 25, 2022

View reviewed changes

pandas-stubs/core/groupby/generic.pyi Outdated Show resolved Hide resolved

make the catchall return a Union

999f147

twoertwein reviewed Jul 25, 2022

View reviewed changes

pandas-stubs/core/frame.pyi Show resolved Hide resolved

twoertwein reviewed Jul 25, 2022

View reviewed changes

twoertwein merged commit cdec539 into pandas-dev:main Jul 25, 2022

danielroseman mentioned this pull request Aug 1, 2022

More specific types for GroupBy.apply. #177

Merged

2 tasks

Dr-Irv deleted the gbapply branch December 28, 2022 15:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix groupby.apply() and sum(axis=1) #168

Fix groupby.apply() and sum(axis=1) #168

Uh oh!

Dr-Irv commented Jul 25, 2022

Uh oh!

twoertwein Jul 25, 2022

Uh oh!

Dr-Irv Jul 25, 2022

Uh oh!

danielroseman commented Jul 25, 2022

Uh oh!

Dr-Irv commented Jul 25, 2022

Uh oh!

Uh oh!

Uh oh!

twoertwein Jul 25, 2022

Uh oh!

Dr-Irv Jul 25, 2022

Uh oh!

twoertwein Jul 25, 2022

Uh oh!

Dr-Irv Jul 25, 2022

Uh oh!

twoertwein Jul 25, 2022

Uh oh!

twoertwein commented Jul 25, 2022

Uh oh!

Uh oh!

Uh oh!

Fix groupby.apply() and sum(axis=1) #168

Fix groupby.apply() and sum(axis=1) #168

Uh oh!

Conversation

Dr-Irv commented Jul 25, 2022

Uh oh!

twoertwein Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

danielroseman commented Jul 25, 2022

Uh oh!

Dr-Irv commented Jul 25, 2022

Uh oh!

Uh oh!

Uh oh!

twoertwein Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

twoertwein Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

twoertwein Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

twoertwein commented Jul 25, 2022

Uh oh!

Uh oh!