Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Hello, I am writing to suggest a potential improvement in the time efficiency of Pandas. This pertains to a performance issue similar to the one highlighted in xref. #54550. However, its PR, #54746, overlooks the aspect of adjusting the threshold for get_indexer_non_unique
in the class MaskedIndexEngine
. My suggestion is to revise this limit, aligning with the strategy adopted in #54746, specifically setting it to len(targets) < (n / (2 * n.bit_length())). I believe this adjustment could positively impact the performance.
I am willing to create a pull request for this if you believe it would be beneficial.
import random
import time
import pandas as pd
import numpy as np
if __name__ == "__main__":
# Create a large pandas dataframe with non-unique indexes and some NaN values
table_size = 10_000_000
num_index = 1_000_000
data = [1] * table_size
# Introduce NaNs into the data
for _ in range(table_size // 10): # Introduce NaNs in 10% of the data
data[random.randint(0, table_size - 1)] = np.nan
df = pd.DataFrame(data)
index = random.choices(range(num_index), k=table_size)
df.index = index
df = df.sort_index()
# Pre-query the index to force optimizations.
df.loc[[5, 6, 7, 456, 65743]]
df.loc[[1000]]
# Testing 'df.loc' with all at once using a list of indexes, on masked data.
for i in range(10):
indexes = random.sample(list(df.index), k=i+1)
start = time.monotonic()
df.loc[indexes]
measure = time.monotonic() - start
print(f"With all at once (masked data): num_indexes={i+1} => {measure:.5f}s")
print("---")
# Testing 'df.loc' one at a time using a list of indexes, on masked data.
for i in range(10):
indexes = random.sample(list(df.index), k=i+1)
start = time.monotonic()
pd.concat([df.loc[[idx]] for idx in indexes])
measure = time.monotonic() - start
print(f"With one at a time (masked data): num_indexes={i+1} => {measure:.5f}s")
printed result:
With all at once (masked data): num_indexes=1 => 0.00045s
With all at once (masked data): num_indexes=2 => 0.00048s
With all at once (masked data): num_indexes=3 => 0.00052s
With all at once (masked data): num_indexes=4 => 0.00050s
With all at once (masked data): num_indexes=5 => 0.64931s
With all at once (masked data): num_indexes=6 => 0.65066s
With all at once (masked data): num_indexes=7 => 0.65181s
With all at once (masked data): num_indexes=8 => 0.80003s
With all at once (masked data): num_indexes=9 => 0.65251s
With all at once (masked data): num_indexes=10 => 0.66629s
---
With one at a time (masked data): num_indexes=1 => 0.00081s
With one at a time (masked data): num_indexes=2 => 0.00134s
With one at a time (masked data): num_indexes=3 => 0.00114s
With one at a time (masked data): num_indexes=4 => 0.00173s
With one at a time (masked data): num_indexes=5 => 0.00132s
With one at a time (masked data): num_indexes=6 => 0.00191s
With one at a time (masked data): num_indexes=7 => 0.20084s
With one at a time (masked data): num_indexes=8 => 0.00180s
With one at a time (masked data): num_indexes=9 => 0.00201s
With one at a time (masked data): num_indexes=10 => 0.00169s
Installed Versions
commit : a671b5a
python : 3.9.18.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-88-generic
Version : #98 SMP Mon Oct 9 16:43:45 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.1.4
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.18.1
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.4
qtpy : None
pyqt5 : None
Prior Performance
No response