Skip to content

Performance regression in DataFrame[bool_indexer] #33924

Closed
@TomAugspurger

Description

@TomAugspurger
import numpy as np
import pandas as pd

idx_dupe = np.array(range(30)) * 99
df = pd.DataFrame(np.random.randn(10000, 5))
bool_indexer = [True] * 5000 + [False] * 5000


%timeit df[np.array(bool_indexer)]
# 1.0.2
2.58 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
5.47 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Note that this only affects the case where the indexer is a list. Performance looks fine when an ndarray is passed.

https://pandas.pydata.org/speed/pandas/index.html#indexing.DataFrameNumericIndexing.time_bool_indexer?commits=80d37adcc3d9bfbbe17e8aa626d6b5873465ca98-4f89c261f624305fc7bae6c43ae862663994be34 points to somewhere in 80d37ad..4f89c26, which has a bunch of commits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions