Skip to content

REF: avoid unnecessary argument to groupby numba funcs #43487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 10, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

cc @mroeschke im pretty sure this is slightly more correct than the status quo, but i could use some help in coming up with test cases. Two main cases to cover:

  1. res = df.groupby(more_than_one_column).transform(..., engine="numba", i_dont_know_off_the_top_the_calling_convention)

I'm pretty sure in this case res.index.names will be wrong in master.

  1. Cases with BinGrouper where there is a length mismatch (also an issue with non-numba paths)
import pandas as pd

df = pd.DataFrame(
        {
            "Buyer": "Carl Carl Carl Carl Joe Carl".split(),
            "Quantity": [18, 3, 5, 1, 9, 3],
            "Date": [
                pd.Timestamp(2013, 9, 1, 13, 0),
                pd.Timestamp(2013, 9, 1, 13, 5),
                pd.Timestamp(2013, 10, 1, 20, 0),
                pd.Timestamp(2013, 10, 3, 10, 0),
                pd.NaT,
                pd.Timestamp(2013, 9, 2, 14, 0),
            ],
        }
    )


gb = df.groupby(pd.Grouper(key="Date", freq="5D"))

gb.transform(lambda x: x, engine="numba", ...)  # <- pretty sure this raises

Or are these not reachable for some reason?

@mroeschke
Copy link
Member

For 1, these two examples are equivalent, but you mentioned the issue is in the index.names?

In [7]: def f(values, index):
   ...:     return values + 1
   ...:

In [8]: df = pd.DataFrame(np.eye(3))

In [9]: df.groupby([0, 1]).transform(f, engine="numba")
Out[9]:
     2
0  1.0
1  1.0
2  2.0

In [10]: df.groupby([0, 1]).transform(lambda x: x + 1)
Out[10]:
     2
0  1.0
1  1.0
2  2.0

For 2, that example should not work because the groups contain string data (also while debuging the timestamp data is also brought into the group). The numba engine can only work on numeric data.

@jbrockmendel
Copy link
Member Author

For 1, these two examples are equivalent, but you mentioned the issue is in the index.names?

Looks like is just for aggregate, not transform. My bad.

For 2, that example should not work because the groups contain string data (also while debuging the timestamp data is also brought into the group). The numba engine can only work on numeric data.

OK. If we just use the Quantity column:

>>> gb["Quantity"].aggregate(lambda values, index: values[0], engine="numba")
Date
NaT           18.0
2013-09-01     9.0
2013-09-06     9.0
2013-09-11     9.0
2013-09-16     9.0
2013-09-21     9.0
2013-09-26     5.0
2013-10-01     9.0
Name: Quantity, dtype: float64

NaT is in the resulting index, which I think is wrong, at least doesn't match the behavior of essentially all the non-numba aggregations. (then again, that aggregation raises in this PR, which isn't a great sign).

Can you suggest an aggregation that we can do on gb["Quantity"] that we can do with and without engine="numba" and expect the answers to match?

@mroeschke
Copy link
Member

So the thought is that NaT should be dropped since it was a "null" group? If so, probably a bug in the numba engine.

I think for this particular df and gb["Quantity"], the numba and non-numba engine probably cannot be equal because of the bug above.

e.g. these are almost similar

In [10]: gb["Quantity"].aggregate(lambda values, index: np.nanmean(values), engine="numba")
Out[10]:
Date
NaT           8.0
2013-09-01    NaN
2013-09-06    NaN
2013-09-11    NaN
2013-09-16    NaN
2013-09-21    NaN
2013-09-26    3.0
2013-10-01    NaN
Name: Quantity, dtype: float64

In [11]: gb["Quantity"].aggregate(lambda values: np.nanmean(values))
Out[11]:
Date
2013-09-01    8.0
2013-09-06    NaN
2013-09-11    NaN
2013-09-16    NaN
2013-09-21    NaN
2013-09-26    NaN
2013-10-01    3.0
Name: Quantity, dtype: float64

@jbrockmendel
Copy link
Member Author

Thanks these examples are useful. I'll use them as test cases in the next PR. Any objections to this PR as it stands?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks fine.

@jreback jreback added Groupby Refactor Internal refactoring of code labels Sep 10, 2021
@jreback jreback added this to the 1.4 milestone Sep 10, 2021
@jreback jreback merged commit 1ec2d1d into pandas-dev:master Sep 10, 2021
@jbrockmendel jbrockmendel deleted the ref-gb-numba branch September 10, 2021 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants