Skip to content

PERF: Avoid materializing entire IntervalIndex when using cut #27668

Closed
@jschendel

Description

@jschendel

When using cut with an IntervalIndex for bins the result of the cut is first materialized as an IntervalIndex and then converted to a Categorical:

if isinstance(bins, IntervalIndex):
# we have a fast-path here
ids = bins.get_indexer(x)
result = algos.take_nd(bins, ids)
result = Categorical(result, categories=bins, ordered=True)
return result, bins

It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex via take_nd and instead directly construct the Categorical via Categorical.from_codes.

Some ad hoc measurements on master:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 278.39 MiB, increment: 130.76 MiB

And the same measurements with the Categorical.from_codes fix:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 145.81 MiB, increment: 15.98 MiB

Metadata

Metadata

Assignees

No one assigned

    Labels

    IntervalInterval data typePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions