Skip to content

Fix groupby.apply() and sum(axis=1) #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 25, 2022
Merged

Conversation

Dr-Irv
Copy link
Collaborator

@Dr-Irv Dr-Irv commented Jul 25, 2022

This revealed an issue with DataFrame.sum(axis=1) and we probably need to fix things like DataFrame.mean(axis=1), etc., but I think we should just create an issue for that.

@Dr-Irv Dr-Irv requested a review from twoertwein July 25, 2022 00:14
@@ -1902,14 +1902,24 @@ class DataFrame(NDFrame, OpsMixin):
fill_value: float | None = ...,
) -> DataFrame: ...
@overload
def sum(
self,
axis: Literal[1, "columns"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the 3rd overload the same as this one but has a broader axis annotation? Is it possible to simply move the third overload up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right. Changed in next commit

@danielroseman
Copy link
Contributor

Thanks for looking into this so quickly. Unfortunately I don't think this is quite right.

df.sum() with no parameters returns a Series. The summean function here returns a float, not a series - I'm not sure why this isn't showing a type error in your test. (In fact using latest head, reveal_type(df.sum()) gives Any, which is a bit strange.)

But returning a scalar is the only situation in which apply would return a Series. In all other cases, including using the built-in sum (which maps to np.sum internally), apply returns a DataFrame:

In [2]: df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})

In [3]: df.groupby('col1').apply(sum)
Out[3]:
      col1  col2
col1
1        1     4
2        2     5
3        3     6


In [4]: df.sum().mean()   # returns float, not Series
Out[4]: 10.5

In [5]: df.gropuby('col1').apply(lambda x: x.sum().mean())
Out[5]:
col1
1    2.5
2    3.5
3    4.5
dtype: float64

So I think there are three cases here:

  1. func returns a DataFrame or a Series: apply(func) returns a DataFrame
  2. func is the built-in sum, min or max [ie the keys of pandas.core.common._builtin_table]: apply(func) returns a DataFrame
  3. func returns a scalar (actually, anything that can go into a single DF element, including eg a string, list, tuple or dict): apply(func) returns a Series

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jul 25, 2022

So I think there are three cases here:

  1. func returns a DataFrame or a Series: apply(func) returns a DataFrame
  2. func is the built-in sum, min or max [ie the keys of pandas.core.common._builtin_table]: apply(func) returns a DataFrame
  3. func returns a scalar (actually, anything that can go into a single DF element, including eg a string, list, tuple or dict): apply(func) returns a Series

I'm not sure I got all the cases right. Would need better tests. Current version passes the example you initially supplied.

One thing to be aware of is that using a lambda gives you a Callable that is not typed. So I changed the test to type the callable. It's inconvenient from a coding standpoint, but there is no way for a type checker to know the arguments and return type of a lambda.

self, func: Callable[[DataFrame], Series | Scalar], *args, **kwargs
) -> Series: ...
@overload
def apply( # type: ignore[misc]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced my local version of generic.pyi with this one and it seems that neither pyright nor mypy need these two ignores (I would have thought both of them need ignores, a DataFrame is Iterable).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove both of these ignores, I get errors from both pyright and mypy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, not sure what I messed up. Unless you add a test covering this second overload, I would be tempted to remove it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, not sure what I messed up. Unless you add a test covering this second overload, I would be tempted to remove it.

There is a test already. That's why I needed to add the second overload.

df7: pd.DataFrame = df.groupby(by="col1").apply(sum)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm not a fan of this overload but it seems to work :)

@twoertwein twoertwein merged commit cdec539 into pandas-dev:main Jul 25, 2022
@twoertwein
Copy link
Member

Thanks @Dr-Irv

@Dr-Irv Dr-Irv deleted the gbapply branch December 28, 2022 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] DataFrameGroupBy.apply can return Series
3 participants