Skip to content

PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245

Closed
@snazzer

Description

@snazzer

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Given the following example:

from hashlib import sha256
import numpy as np
import pandas as pd

import time

size = int(1e4)
while size <= int(3e5):
    df = pd.DataFrame(np.random.normal(size=(size, 50)))
    df_t = df.T
    start = time.process_time()
    hashed_rows = pd.util.hash_pandas_object(df_t.reset_index()).values
    t = (size, time.process_time() - start)
    print(f'{t[0}, {t[1]}')
    size = size + 10000

The runtimes between version 2.1.0 and 2.1.1 are drastically different, with 2.1.1 showing polynomial/exponential runtime.

image (8)
image (7)

Installed Versions

For 2.1.0

INSTALLED VERSIONS
------------------
commit              : ba1cccd19da778f0c3a7d6a885685da16a072870
python              : 3.9.18.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 5.15.49-linuxkit-pr
Version             : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023
machine             : aarch64
processor           : 
byteorder           : little
LC_ALL              : None
LANG                : C.UTF-8
LOCALE              : en_US.UTF-8

pandas              : 2.1.0
numpy               : 1.26.0
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 67.8.0
pip                 : 23.2.1
Cython              : 3.0.2
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : 3.1.2
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.9.1
gcsfs               : None
matplotlib          : 3.8.0
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : None
pandas_gbq          : None
pyarrow             : 13.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : None
scipy               : 1.11.2
sqlalchemy          : None
tables              : None
tabulate            : None
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.3
qtpy                : None
pyqt5               : None

For 2.1.1

INSTALLED VERSIONS
------------------
commit              : e86ed377639948c64c429059127bcf5b359ab6be
python              : 3.9.18.final.0
python-bits         : 64
OS                  : Linux
OS-release          : 5.15.49-linuxkit-pr
Version             : #1 SMP PREEMPT Thu May 25 07:27:39 UTC 2023
machine             : aarch64
processor           : 
byteorder           : little
LC_ALL              : None
LANG                : C.UTF-8
LOCALE              : en_US.UTF-8

pandas              : 2.1.1
numpy               : 1.26.0
pytz                : 2023.3.post1
dateutil            : 2.8.2
setuptools          : 67.8.0
pip                 : 23.2.1
Cython              : 3.0.2
pytest              : None
hypothesis          : None
sphinx              : None
blosc               : None
feather             : None
xlsxwriter          : None
lxml.etree          : None
html5lib            : None
pymysql             : None
psycopg2            : None
jinja2              : 3.1.2
IPython             : None
pandas_datareader   : None
bs4                 : None
bottleneck          : None
dataframe-api-compat: None
fastparquet         : None
fsspec              : 2023.9.1
gcsfs               : None
matplotlib          : 3.8.0
numba               : None
numexpr             : None
odfpy               : None
openpyxl            : None
pandas_gbq          : None
pyarrow             : 13.0.0
pyreadstat          : None
pyxlsb              : None
s3fs                : None
scipy               : 1.11.2
sqlalchemy          : None
tables              : None
tabulate            : None
xarray              : None
xlrd                : None
zstandard           : None
tzdata              : 2023.3
qtpy                : None
pyqt5               : None

Prior Performance

Same example as above. 2.1.0 is fine. 2.1.1 is not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Copy / view semanticsPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas versionhashinghash_pandas_object

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions