Skip to content

PERF: Melt 2x slower when future.infer_string option enabled #59657

Closed
@maver1ck

Description

@maver1ck

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# This configuration option makes this code slow
pd.options.future.infer_string = True

# Define dimensions
n_rows = 10000
n_cols = 10000

# Generate random IDs for the rows
ids = [f"string_id_{i}" for i in range(1, n_rows + 1)]

# Generate a random sparse matrix with 10% non-NaN values
data = np.random.choice([np.nan, 1], size=(n_rows, n_cols), p=[0.9, 0.1])

# Create a DataFrame from the sparse matrix and add the 'Id' column
df = pd.DataFrame(data, columns=[f"column_name_{i}" for i in range(1, n_cols + 1)])
df.insert(0, 'Id', ids)

# Melt the DataFrame
df_melted = df.melt(id_vars=['Id'], var_name='Column', value_name='Value')

# Display the first few rows of the melted DataFrame
df_melted.head()

Installed Versions

INSTALLED VERSIONS
------------------
commit                : d9cdd2ee5a58015ef6f4d15c7226110c9aab8140
python                : 3.12.5.final.0
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Mon Jul 29 21:14:30 PDT 2024; root:xnu-10063.141.2~1/RELEASE_ARM64_T6000
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : pl_PL.UTF-8
LOCALE                : pl_PL.UTF-8

pandas                : 2.2.2
numpy                 : 2.1.0
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : 73.0.1
pip                   : 24.1.2
Cython                : None
pytest                : 8.3.2
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.4
IPython               : 8.26.0
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : 2024.6.1
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : 17.0.0
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : 2024.6.1
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Prior Performance

This code with pd.options.future.infer_string = False runs in:
5.23 s ± 1.35 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Memory consumption is around 14 GB.

Enabling pd.options.future.infer_string = True makes it 2 times slower:
10.6 s ± 40.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Also memory consumption is bigger with peak around 25GB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceReshapingConcat, Merge/Join, Stack/Unstack, ExplodeStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions