Closed
Description
# Run on Pandas 0.19.1
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(1000,5000), columns=['C'+str(c) for c in range(5000)])
%%prun
for col in df:
df[col]
65011 function calls (65010 primitive calls) in 2.012 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5000 1.935 0.000 1.955 0.000 base.py:1362(__contains__)
5000 0.025 0.000 2.002 0.000 frame.py:2033(__getitem__)
1 0.011 0.011 2.012 2.012 <string>:2(<module>)
5000 0.010 0.000 0.016 0.000 frame.py:2059(_getitem_column)
5001 0.007 0.000 0.007 0.000 {method 'view' of 'numpy.ndarray' objects}
5001 0.004 0.000 0.012 0.000 base.py:492(values)
5000 0.004 0.000 0.007 0.000 generic.py:1381(_get_item_cache)
5000 0.004 0.000 0.018 0.000 base.py:1275(<lambda>)
5000 0.003 0.000 0.014 0.000 base.py:874(_values)
5000 0.002 0.000 0.002 0.000 {built-in method builtins.isinstance}
5000 0.002 0.000 0.003 0.000 common.py:438(_apply_if_callable)
5000 0.002 0.000 0.002 0.000 {method 'get' of 'dict' objects}
5000 0.001 0.000 0.001 0.000 {built-in method builtins.hash}
5000 0.001 0.000 0.001 0.000 {built-in method builtins.callable}
1 0.000 0.000 2.012 2.012 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 generic.py:833(__iter__)
1 0.000 0.000 0.000 0.000 generic.py:401(_info_axis)
2/1 0.000 0.000 0.000 0.000 {built-in method builtins.iter}
1 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
1 0.000 0.000 0.000 0.000 base.py:1315(__iter__)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 4.39 s
Problem description
It appears that get_item_cache()
or __contains__
may have something to do with it. This affects other functionality such as df.info()
which is now also ~100X slower.
Expected Output
# Run on Pandas 0.18.1
In [6]: %%prun
...: for col in df:
...: df[col]
...:
45011 function calls (45010 primitive calls) in 0.032 seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
5000 0.012 0.000 0.028 0.000 frame.py:1973(__getitem__)
5000 0.005 0.000 0.008 0.000 frame.py:1999(_getitem_column)
5000 0.004 0.000 0.005 0.000 base.py:1233(__contains__)
1 0.004 0.004 0.032 0.032 <string>:2(<module>)
5000 0.002 0.000 0.003 0.000 generic.py:1345(_get_item_cache)
5000 0.001 0.000 0.002 0.000 common.py:1846(_apply_if_callable)
5000 0.001 0.000 0.001 0.000 {built-in method builtins.isinstance}
5000 0.001 0.000 0.001 0.000 {method 'get' of 'dict' objects}
5000 0.001 0.000 0.001 0.000 {built-in method builtins.hash}
5000 0.000 0.000 0.000 0.000 {built-in method builtins.callable}
1 0.000 0.000 0.032 0.032 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 generic.py:808(__iter__)
1 0.000 0.000 0.000 0.000 base.py:440(values)
2/1 0.000 0.000 0.000 0.000 {built-in method builtins.iter}
1 0.000 0.000 0.000 0.000 {method 'view' of 'numpy.ndarray' objects}
1 0.000 0.000 0.000 0.000 generic.py:381(_info_axis)
1 0.000 0.000 0.000 0.000 base.py:1186(__iter__)
1 0.000 0.000 0.000 0.000 {built-in method builtins.getattr}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 40 ms
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None