Skip to content

PERF: Avoid materializing values in Categorical.set_categories #17508

Closed
@TomAugspurger

Description

@TomAugspurger

In Categorical.set_categories, we allocate an array of the values, which may be expensive:

values = cat.__array__()

It should be possible to do this operation by just manipulating the codes.

In [6]: c = pd.Categorical(['a'] * 100000)

In [7]: c.set_categories(['a', 'b'])
Out[7]:
[a, a, a, a, a, ..., a, a, a, a, a]
Length: 100000
Categories (2, object): [a, b]

See 5ab0123 for how this might work, which will probably be squashed, but it's the implementation of Categorical._set_dtype in #16015

I may get to this as a followup to that PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions