Description
The fix in #42089 (or caused by the PR that this one was fixing) seems to have caused a large slowdown on the get_indexer
benchmarks: https://pandas.pydata.org/speed/pandas/#indexing.CategoricalIndexIndexing.time_get_indexer_list?python=3.8&Cython=0.29.21&p-index='monotonic_incr'&commits=cf5852bf-fce7f9eb
The regression overview (https://pandas.pydata.org/speed/pandas/#regressions?sort=1&dir=desc) lists it as a 1000x slowdown, but that's only because #42042 first improved the performance a lot (which might be a bit suspicious?). Compared to the timing before that, it's only 4-5x slowdown. With the below code, I see locally a ~9x slowdown on master compared to 1.2.5.
import string, itertools
data_unique = pd.CategoricalIndex(
["".join(perm) for perm in itertools.permutations(string.printable, 3)]
)
cat_list = ["a", "c"]
%timeit data_unique.get_indexer(cat_list)
52.8 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) # <-- pandas 1.2.5
417 ms ± 22.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # <-- master
I think it has to do with the fact that before we called the Engine.get_indexer on the codes, while now in the base class version we do that with the .categories
, which means in this case that both self
and target
are cast to object dtype and thus use the Engine.get_indexer for object dtype.
Originally posted by @jorisvandenbossche in #42089 (comment)