Closed
Description
From #39146 (comment) (discovered while investigating a benchmark difference). It seems that in groupby/ops.py, the fast_apply
(using libreduction) vs the generic python apply gives a different result in case of same-indexed output of the function.
Using a small example dataframe and a function to be applied which simply copies the input:
N = 10
df = pd.DataFrame(
{
"key": np.random.randint(0, 3, size=N),
"value1": np.random.randn(N),
"value2": ["foo", "bar"] * (N // 2),
}
)
def df_copy_function(g):
# ensure that the group name is available (see GH #15062)
g.name
return g.copy()
By default you get this result:
In [3]: df.groupby("key").apply(df_copy_function)
Out[3]:
key value1 value2
key
0 8 0 -0.149534 foo
9 0 -0.391135 bar
1 1 1 -0.581107 bar
2 1 -0.338278 foo
3 1 0.768924 bar
6 1 -0.778718 foo
2 0 2 0.196477 foo
4 2 -0.364822 foo
5 2 -0.976079 bar
7 2 -2.671668 bar
But if I trigger to not take the fast apply path (in this case by making one column an extension dtype), we get a different result:
In [4]: df['value2'] = df["value2"].astype("string")
In [5]: df.groupby("key").apply(df_copy_function)
Out[5]:
key value1 value2
0 2 0.196477 foo
1 1 -0.581107 bar
2 1 -0.338278 foo
3 1 0.768924 bar
4 2 -0.364822 foo
5 2 -0.976079 bar
6 1 -0.778718 foo
7 2 -2.671668 bar
8 0 -0.149534 foo
9 0 -0.391135 bar
This might be another manifestation of #34998 and the issues linked from that PR.