REF: avoid unnecessary argument to groupby numba funcs #43487

jbrockmendel · 2021-09-10T00:01:18Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

cc @mroeschke im pretty sure this is slightly more correct than the status quo, but i could use some help in coming up with test cases. Two main cases to cover:

res = df.groupby(more_than_one_column).transform(..., engine="numba", i_dont_know_off_the_top_the_calling_convention)

I'm pretty sure in this case res.index.names will be wrong in master.

Cases with BinGrouper where there is a length mismatch (also an issue with non-numba paths)

import pandas as pd

df = pd.DataFrame(
        {
            "Buyer": "Carl Carl Carl Carl Joe Carl".split(),
            "Quantity": [18, 3, 5, 1, 9, 3],
            "Date": [
                pd.Timestamp(2013, 9, 1, 13, 0),
                pd.Timestamp(2013, 9, 1, 13, 5),
                pd.Timestamp(2013, 10, 1, 20, 0),
                pd.Timestamp(2013, 10, 3, 10, 0),
                pd.NaT,
                pd.Timestamp(2013, 9, 2, 14, 0),
            ],
        }
    )


gb = df.groupby(pd.Grouper(key="Date", freq="5D"))

gb.transform(lambda x: x, engine="numba", ...)  # <- pretty sure this raises

Or are these not reachable for some reason?

mroeschke · 2021-09-10T02:23:00Z

For 1, these two examples are equivalent, but you mentioned the issue is in the index.names?

In [7]: def f(values, index):
   ...:     return values + 1
   ...:

In [8]: df = pd.DataFrame(np.eye(3))

In [9]: df.groupby([0, 1]).transform(f, engine="numba")
Out[9]:
     2
0  1.0
1  1.0
2  2.0

In [10]: df.groupby([0, 1]).transform(lambda x: x + 1)
Out[10]:
     2
0  1.0
1  1.0
2  2.0

For 2, that example should not work because the groups contain string data (also while debuging the timestamp data is also brought into the group). The numba engine can only work on numeric data.

jbrockmendel · 2021-09-10T02:57:22Z

For 1, these two examples are equivalent, but you mentioned the issue is in the index.names?

Looks like is just for aggregate, not transform. My bad.

For 2, that example should not work because the groups contain string data (also while debuging the timestamp data is also brought into the group). The numba engine can only work on numeric data.

OK. If we just use the Quantity column:

>>> gb["Quantity"].aggregate(lambda values, index: values[0], engine="numba")
Date
NaT           18.0
2013-09-01     9.0
2013-09-06     9.0
2013-09-11     9.0
2013-09-16     9.0
2013-09-21     9.0
2013-09-26     5.0
2013-10-01     9.0
Name: Quantity, dtype: float64

NaT is in the resulting index, which I think is wrong, at least doesn't match the behavior of essentially all the non-numba aggregations. (then again, that aggregation raises in this PR, which isn't a great sign).

Can you suggest an aggregation that we can do on gb["Quantity"] that we can do with and without engine="numba" and expect the answers to match?

mroeschke · 2021-09-10T20:51:52Z

So the thought is that NaT should be dropped since it was a "null" group? If so, probably a bug in the numba engine.

I think for this particular df and gb["Quantity"], the numba and non-numba engine probably cannot be equal because of the bug above.

e.g. these are almost similar

In [10]: gb["Quantity"].aggregate(lambda values, index: np.nanmean(values), engine="numba")
Out[10]:
Date
NaT           8.0
2013-09-01    NaN
2013-09-06    NaN
2013-09-11    NaN
2013-09-16    NaN
2013-09-21    NaN
2013-09-26    3.0
2013-10-01    NaN
Name: Quantity, dtype: float64

In [11]: gb["Quantity"].aggregate(lambda values: np.nanmean(values))
Out[11]:
Date
2013-09-01    8.0
2013-09-06    NaN
2013-09-11    NaN
2013-09-16    NaN
2013-09-21    NaN
2013-09-26    NaN
2013-10-01    3.0
Name: Quantity, dtype: float64

jbrockmendel · 2021-09-10T20:59:24Z

Thanks these examples are useful. I'll use them as test cases in the next PR. Any objections to this PR as it stands?

mroeschke

Yup, looks fine.

REF: avoid unnecessary argument to groupby numba funcs

0d336de

mroeschke approved these changes Sep 10, 2021

View reviewed changes

jreback added Groupby Refactor Internal refactoring of code labels Sep 10, 2021

jreback added this to the 1.4 milestone Sep 10, 2021

jreback merged commit 1ec2d1d into pandas-dev:master Sep 10, 2021

jbrockmendel deleted the ref-gb-numba branch September 10, 2021 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

REF: avoid unnecessary argument to groupby numba funcs #43487

REF: avoid unnecessary argument to groupby numba funcs #43487

Uh oh!

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke commented Sep 10, 2021

Uh oh!

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke commented Sep 10, 2021

Uh oh!

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke left a comment

Uh oh!

Uh oh!

Uh oh!

REF: avoid unnecessary argument to groupby numba funcs #43487

REF: avoid unnecessary argument to groupby numba funcs #43487

Uh oh!

Conversation

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke commented Sep 10, 2021

Uh oh!

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke commented Sep 10, 2021

Uh oh!

jbrockmendel commented Sep 10, 2021

Uh oh!

mroeschke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!