Skip to content

PERF: Selecting columns from MultiIndex no longer consolidates (pandas 2.0 regression?) #53119

Closed
@muttermachine

Description

@muttermachine

Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.

I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.

As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).

import timeit

import numpy as np
import pandas as pd


def _debug(force_consolidate):
    df = _make_column_multilevel_df_via_concat(force_consolidate)
    level_a_cols = df.columns.unique('A')

    print(f'Running once {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running once again {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))

    print(f'Running ten times {force_consolidate=}')
    print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))


def _make_column_multilevel_df_via_concat(force_consolidate):
    a_values = list(range(16))
    b_values = list(range(50))
    idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
    template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
    df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])

    if force_consolidate:
        df._consolidate_inplace()

    return df


def select_each_column(df, cols):
    # noinspection PyStatementEffect
    [df[c] for c in cols]


if __name__ == '__main__':
    pd.show_versions()
    _debug(False)
    print()
    _debug(True)

1.5.3

Running once force_consolidate=False
0.027367000002413988
Running once again force_consolidate=False
0.0026708999648690224
Running ten times force_consolidate=False
0.024182399967685342

Running once force_consolidate=True
0.0028768000192940235
Running once again force_consolidate=True
0.0026675001718103886
Running ten times force_consolidate=True
0.024797999998554587

2.0.1

Running once force_consolidate=False
0.1871038000099361
Running once again force_consolidate=False
0.18837619991973042
Running ten times force_consolidate=False
1.8612669999711215

Running once force_consolidate=True
0.002792700193822384
Running once again force_consolidate=True
0.002532100072130561
Running ten times force_consolidate=True
0.02379040000960231

You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.

Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.

Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.

I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?

I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!

Installed Versions 1.5.3
INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas : 1.5.3
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

2.0.0

INSTALLED VERSIONS
------------------
commit           : 478d340667831908b5b4bf09a2787a11a14560c9
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.0
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

2.0.1

INSTALLED VERSIONS
------------------
commit           : 37ea63d540fd27274cad6585082c91b1283f963d
python           : 3.10.9.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_Canada.1252

pandas           : 2.0.1
numpy            : 1.24.3
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsIndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions