PERF: Avoid materializing entire IntervalIndex when using cut

When using `cut` with an `IntervalIndex` for `bins` the result of the `cut` is first materialized as an `IntervalIndex` and then converted to a `Categorical`:

https://github.com/pandas-dev/pandas/blob/143bc34aa8f068b18e7137df7ca91b9929cc1389/pandas/core/reshape/tile.py#L373-L378

It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an `IntervalIndex` via `take_nd` and instead directly construct the `Categorical` via `Categorical.from_codes`.

Some ad hoc measurements on `master`:
```python
In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 278.39 MiB, increment: 130.76 MiB
```
And the same measurements with the `Categorical.from_codes` fix:
```python
In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 145.81 MiB, increment: 15.98 MiB
```

	if isinstance(bins, IntervalIndex):
	# we have a fast-path here
	ids = bins.get_indexer(x)
	result = algos.take_nd(bins, ids)
	result = Categorical(result, categories=bins, ordered=True)
	return result, bins

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Avoid materializing entire IntervalIndex when using cut #27668

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

PERF: Avoid materializing entire IntervalIndex when using cut #27668

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions