Closed
Description
When using cut
with an IntervalIndex
for bins
the result of the cut
is first materialized as an IntervalIndex
and then converted to a Categorical
:
pandas/pandas/core/reshape/tile.py
Lines 373 to 378 in 143bc34
It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex
via take_nd
and instead directly construct the Categorical
via Categorical.from_codes
.
Some ad hoc measurements on master
:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii)
7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii)
peak memory: 278.39 MiB, increment: 130.76 MiB
And the same measurements with the Categorical.from_codes
fix:
In [3]: ii = pd.interval_range(0, 20)
In [4]: values = np.linspace(0, 20, 100).repeat(10**4)
In [5]: %timeit pd.cut(values, ii)
1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %memit pd.cut(values, ii)
peak memory: 145.81 MiB, increment: 15.98 MiB