Skip to content

PERF: Regression with indexing with ExtensionEngine #45652

@dalejung

Description

@dalejung

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from itertools import permutations, chain
import string

string_values = chain(
    string.ascii_uppercase,
    permutations(string.ascii_uppercase, 2),
    permutations(string.ascii_uppercase, 3),
)

string_index = pd.Index(map(''.join, string_values)).astype('string')

df = pd.DataFrame({'ints': range(len(string_index))}, index=string_index)

subset_index = string_index[string_index.str.startswith('A')]

# %time slow_result = df.loc[subset_index]
# CPU times: user 1.93 s, sys: 0 ns, total: 1.93 s
# Wall time: 1.93 s
slow_result = df.loc[subset_index]

# %time fast_result = df.loc[subset_index.values]
# CPU times: user 2.85 ms, sys: 0 ns, total: 2.85 ms
# Wall time: 2.82 ms
fast_result = df.loc[subset_index.values]

# results are the same.
pd.testing.assert_frame_equal(slow_result, fast_result)

# Old object indexes don't have this issue.
object_index_df = df.copy()
object_index_df.index = object_index_df.index.astype(object)

# %time obj_result = object_index_df.loc[subset_index]
# CPU times: user 945 µs, sys: 19 µs, total: 964 µs
# Wall time: 939 µs
obj_result = object_index_df.loc[subset_index]

This only happens when indexing using dtype='string' on both the index and the indexer. Note here that df.index and string_index are both dtype='string. What is odd is that just accessing .values or converting to a list will make indexing fast again. Old object indexes don't have the same issue.

Installed Versions

INSTALLED VERSIONS

commit : c5ff649
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.16.0-arch1-1
Version : #1 SMP PREEMPT Mon, 10 Jan 2022 20:11:47 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.5.0.dev0+193.gc5ff649b11
numpy : 1.23.0.dev0+512.g6077afd65
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.5.3
Cython : 0.29.26
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.55.0dev0+1077.g0994f97c3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0.dev587+g458271315
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.0b1
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None

Prior Performance

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions