Skip to content

PERF: Categorical indexing performance regression #30744

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Recent regression in the categoricals.CategoricalSlicing.time_getitem_list benchmark: https://pandas.pydata.org/speed/pandas/#categoricals.CategoricalSlicing.time_getitem_list?commits=6efc2379-b9de33e3

Reproducible example for this benchmark:

N = 10 ** 6
categories = ["a", "b", "c"]
values = [0] * N + [1] * N + [2] * N
data = pd.Categorical.from_codes(values, categories=categories)

list_ = list(range(10000))

%timeit data[list_]

Now, this slowdown is due to the changes in #30308. Categorical __getitem__ now checks if the key is a boolean indexer: https://github.com/pandas-dev/pandas/pull/30308/files#diff-f3b2ea15ba728b55cab4a1acd97d996d

So this slowdown is of course expected, and also only for Categorical itself (eg pd.Series indexing already handles this boolean checking). So in that light, we can certainly ignore this regression.
But, this led me think: maybe the ExtensionArrays are a good place to start not supporting object dtype as boolean indexer? (and so not add support for it now, which also avoids this performance regression)

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions