Skip to content

PERF: pd.merge(on=index) is faster than pd.merge(left_index=True, right_index=True) if index is duplicated #56564

Open
@starhel

Description

@starhel

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd  # 2.1.4 or 2.2.0.dev0+925.gac170fdc35 (current master)

df1 = pd.DataFrame({
    "key": [x // 100 for x in range(1_000_000)],
    "val1": 10
}).set_index("key")

df2 = pd.DataFrame({
    "key": [x // 100 for x in range(200_000, 800_000)],
    "val2": 20
}).set_index("key")

%timeit -r7 -n3 pd.merge(df1, df2, left_index=True, right_index=True, how="inner")
# 2.1.4: 1.57 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.1 s ± 366 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

%timeit -r7 -n3 pd.merge(df1, df2, on=["key"], how="inner")
# 2.1.4: 1.25 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
# main: 2.06 s ± 19.8 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)

Installed Versions

Exceptions:
ModuleNotFoundError: No module named 'PySide6'
QtBindingsNotFoundError: No Qt bindings could be found

Prior Performance

Both merges give the same output (verified using pd.testing.assert_frame_equal).

pd.merge(..., on=index)

  • 2.1.4: 20% faster, more stable (14.3 ms vs 247 ms std. dev.)
  • master: average time is almost same, but there's massive regression: it's almost 65% slower than 2.1.4.

This behaviour is observed only when index has duplicated values in both dataframes. The expected behavior is at least the same execution time regardless of the provided arguments.

Master regression may be caused by #56523 (@lukemanley @phofl FYI).

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions