Skip to content

PERF: Slowdowns with .isin() on columns typed as np.uint64 #60098

Closed
@adrian17

Description

@adrian17

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd, numpy as np
data = pd.DataFrame({
    "uints": np.random.randint(10000, size=300000, dtype=np.uint64),
    "ints": np.random.randint(10000, size=300000, dtype=np.int64),
})

%timeit data["ints"].isin([1, 2]) # 3ms
%timeit data["ints"].isin(np.array([1, 2], dtype=np.int64)) # 3ms
%timeit data["ints"].isin(np.array([1, 2], dtype=np.uint64)) # 5ms
%timeit data["ints"].isin([np.int64(1), np.int64(2)]) # 3ms
%timeit data["ints"].isin([np.uint64(1), np.uint64(2)]) # 5ms

%timeit data["uints"].isin([1, 2]) # 14ms (!)
%timeit data["uints"].isin(np.array([1, 2], dtype=np.int64)) # 5ms
%timeit data["uints"].isin(np.array([1, 2], dtype=np.uint64)) # 3ms
%timeit data["uints"].isin([np.int64(1), np.int64(2)])  # 17ms (!)
%timeit data["uints"].isin([np.uint64(1), np.uint64(2)]) # 17ms (!)

The last line, with older numpy==1.26.4 (last version <2.0), is even worse: ~200ms.

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.12
python-bits : 64
OS : Linux
OS-release : 6.5.0-27-generic
Version : #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.1.2

Prior Performance

With pandas 1.4.4 and numpy 1.26.4, all the benchmarks above show 3-5ms (3ms on signedness match, 5ms on signedness mismatch). So despite updating numpy mitigating the worst 200ms regression, this still looks like a 5x performance regression on pandas side since 1.4.4.

I'm guessing the regression could be related to PR #46693 , which happened on the 1.5.0 release.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas versionisinisin method

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions