Skip to content

Non-unique integer index performance depending on index creation method #26064

Closed
@quantuminternet

Description

@quantuminternet

Code Sample, a copy-pastable example if possible

import pandas
pandas.__version__
# '0.24.2'
df = pandas.DataFrame([[x // 2] for x in range(0, 10**7)], columns=['A'])
df['B'] = df['A'] + 10**15
df = df.set_index('B').sort_index()
# First access to index might be slow, so get that out of the way
df.loc[1000000000000000]

%timeit df.loc[1000000000000000]
# 71.2 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

index_name = df.index.name
df.index = df.index.values.tolist()
df.index.name = index_name

%timeit df.loc[1000000000000000]
# 422 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Problem description

The issue seems to occur when using a non-unique, but sorted column as an index which is the result of a previous calculation. Access to rows with .loc is much slower than it could be. When all index values are copied into a python list and then reinerted into the index, performance increases by a large factor, although the DataFrames before and after look exactly the same.

For the first try, the performance depends on the size of the DataFrame as well as the size of the offset. For the second try (after copying the values to a list and back) the operation seems to take constant time. I assume that these use different indexing methods.

I have found no way to determine in advance whether the index will be affected by this and how it will perform. For a code that has to do many random access lookups in this DataFrame, the performance difference of more than 100 determines whether the code can run in reasonable time or not.

The workaround is easy enough, but to me it is bad practice to premptively do this all the time, because you don't want to determine the exact history of operations that have been done on the index.

There are different ways to 'rebuild' the index for this DataFrame, but as far as I can tell, all of them involve getting the index values into a non-pandas/numpy structure and back again. .reset_index().set_index('B') or .copy(deep=True) have no effect.

Expected Output

Ideally, pandas would chose the correct indexing method on set_index() or sort_index() and automatically arrange the data in for optimized access.

If this is not possible for technical reasons or would create slowdowns for other use cases, I would like to have a method to check which index method will be used and a method to optimize the data for indexing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-142-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8

pandas: 0.24.2
pytest: 3.4.2
pip: 19.0.3
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.16.2
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.1
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.6
pymysql: 0.8.0
psycopg2: None
jinja2: 2.8.1
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions