Open
Description
Problem description
When concatenating columns of the same dtype, even with copy=False
option, the columns are consolidated together which involves a copy and a costly vstack
. The performance is actually worse for copy=False
than the default copy=True
which is misleading.
There are use cases where consolidated data is not required for an application, so this unneeded performance penalty is undesired.
Sample Program
import time
import pandas as pd
template_series = pd.Series(list(range(10000)))
series_ls = []
for i in range(1000):
series_ls.append(template_series.copy())
start_time = time.time()
df_no_copy = pd.concat(series_ls, copy=False)
print("No copy elapsed", time.time() - start_time)
start_time = time.time()
df_copy = pd.concat(series_ls, copy=True) # The default setting
print("Copy elapsed", time.time() - start_time)
Execution Time
No copy elapsed 0.07740044593811035
Copy elapsed 0.05434751510620117
Execution time is in seconds.
Problem Trace
The consolidation occurs as a result of:
if not self.copy:
new_data._consolidate_inplace()
located near "pandas/core/reshape/concat.py:499".