Open
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import pandas as pd # 2.1.4 or 2.2.0.dev0+925.gac170fdc35 (current master)
df1 = pd.DataFrame({
"key": [x // 100 for x in range(1_000_000)],
"val1": 10
}).set_index("key")
df2 = pd.DataFrame({
"key": [x // 100 for x in range(200_000, 800_000)],
"val2": 20
}).set_index("key")
%timeit -r7 -n3 pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
# 2.1.4: 1.57 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.1 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
%timeit -r7 -n3 pd.merge(df1, df2, on=["key"], how="inner")
# 2.1.4: 1.25 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.06 s ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
Installed Versions
Exceptions:
ModuleNotFoundError: No module named 'PySide6'
QtBindingsNotFoundError: No Qt bindings could be found
Prior Performance
Both merges give the same output (verified using pd.testing.assert_frame_equal
).
pd.merge(..., on=index)
- 2.1.4: 20% faster, more stable (14.3 ms vs 247 ms std. dev.)
- master: average time is almost same, but there's massive regression: it's almost 65% slower than 2.1.4.
This behaviour is observed only when index has duplicated values in both dataframes. The expected behavior is at least the same execution time regardless of the provided arguments.
Master regression may be caused by #56523 (@lukemanley @phofl FYI).