Description
Similar to an observation on reddit I noticed that there is a huge performance difference between the default pandas pd.options.mode.chained_assignment = 'warn'
over setting it to None
.
Code Sample
import time
import pandas as pd
import numpy as np
def gen_data(N=10000):
df = pd.DataFrame(index=range(N))
for c in range(10):
df[str(c)] = np.random.uniform(size=N)
df["id"] = np.random.choice(range(500), size=len(df))
return df
def do_something_on_df(df):
""" Dummy computation that contains inplace mutations """
for c in range(df.shape[1]):
df[str(c)] = np.random.uniform(size=df.shape[0])
return 42
def run_test(mode="warn"):
pd.options.mode.chained_assignment = mode
df = gen_data()
t1 = time.time()
for key, group_df in df.groupby("id"):
do_something_on_df(group_df)
t2 = time.time()
print("Runtime: {:10.3f} sec".format(t2 - t1))
if __name__ == "__main__":
run_test(mode="warn")
run_test(mode=None)
Problem description
The run times vary a lot depending on the whether the SettingWithCopyWarning
is enabled or disable. I tried with a few different Pandas/Python versions:
Debian VM, Python 3.6.2, pandas 0.21.0
Runtime: 46.693 sec
Runtime: 0.731 sec
Debian VM, Python 2.7.9, pandas 0.20.0
Runtime: 101.204 sec
Runtime: 0.622 sec
Ubuntu (host), Python 2.7.3, pandas 0.21.0
Runtime: 35.363 sec
Runtime: 0.517 sec
Ideally, there should not be such a big penalty for SettingWithCopyWarning
.
From profiling results it looks like the reason might be this call to gc.collect
.
Output of pd.show_versions()
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None