Description
Comparing vanilla installs of Pandas 1.5.3 vs 2.0.1 (and also vs 2.0.0) when selecting columns from a DataFrame with columns having a MultiIndex constructed via pd.concat the performance is noticeably slower in 2.0.0, unless manual consolidation is forced.
I was unable to fully follow through the code differences between the Pandas versions, but it appears that in 1.5.3 the first select performs consolidation whereas 2.0.0+ never automatically consolidates and thus causes each subsequent select to be much slower.
As parts of my production code follow a pattern of creating DataFrames via concat and then looping over column subsets this introduces a ~100x slowdown in parts (due to very wide multilevel DFs).
import timeit
import numpy as np
import pandas as pd
def _debug(force_consolidate):
df = _make_column_multilevel_df_via_concat(force_consolidate)
level_a_cols = df.columns.unique('A')
print(f'Running once {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running once again {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=1))
print(f'Running ten times {force_consolidate=}')
print(timeit.timeit(lambda: select_each_column(df, level_a_cols), number=10))
def _make_column_multilevel_df_via_concat(force_consolidate):
a_values = list(range(16))
b_values = list(range(50))
idx = pd.bdate_range('1991-01-01', '2023-12-31', name='date')
template_frame = pd.DataFrame(np.zeros((len(idx), len(b_values))), index=idx, columns=pd.Index(b_values, name='B'))
df = pd.concat({a: template_frame for a in a_values}, axis=1, names=['A'])
if force_consolidate:
df._consolidate_inplace()
return df
def select_each_column(df, cols):
# noinspection PyStatementEffect
[df[c] for c in cols]
if __name__ == '__main__':
pd.show_versions()
_debug(False)
print()
_debug(True)
1.5.3
Running once force_consolidate=False
0.027367000002413988
Running once again force_consolidate=False
0.0026708999648690224
Running ten times force_consolidate=False
0.024182399967685342
Running once force_consolidate=True
0.0028768000192940235
Running once again force_consolidate=True
0.0026675001718103886
Running ten times force_consolidate=True
0.024797999998554587
2.0.1
Running once force_consolidate=False
0.1871038000099361
Running once again force_consolidate=False
0.18837619991973042
Running ten times force_consolidate=False
1.8612669999711215
Running once force_consolidate=True
0.002792700193822384
Running once again force_consolidate=True
0.002532100072130561
Running ten times force_consolidate=True
0.02379040000960231
You can see that force_consolidate=True makes no difference for 1.5.3 but does for 2.0.0. Further, "Running once" vs "Running once again" in 1.5.3 shows the cost of the consolidation happening where it is has not already been forced.
Note, the benchmark above ignores the cost of the consolidate in 2.0.0, but I hope the difference in behaviour between versions is clear.
Reading the release notes, a lot of work has gone into the block manager and I can see that other users might not want the previous behaviour as the cost of consolidation could outweigh the cost of selecting from the DataFrame.
I am thus unsure if the change in behaviour is expected or not. If it is expected, is there a cleaner way to force consolidation other than calling the undocumented consolidate/consolidate_inplace?
I also feel obliged to say thank you for all of the excellent work in this project -- it is greatly appreciated!
Installed Versions
1.5.3INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252pandas : 1.5.3
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
2.0.0
INSTALLED VERSIONS ------------------ commit : 478d340667831908b5b4bf09a2787a11a14560c9 python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252 pandas : 2.0.0 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None
2.0.1
INSTALLED VERSIONS ------------------ commit : 37ea63d540fd27274cad6585082c91b1283f963d python : 3.10.9.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19045 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_Canada.1252 pandas : 2.0.1 numpy : 1.24.3 pytz : 2023.3 dateutil : 2.8.2 setuptools : 65.5.0 pip : 22.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None zstandard : None tzdata : 2023.3 qtpy : None pyqt5 : None