Skip to content

DataFrame __getitem__ ~100X slower on Pandas 0.19.1 vs 0.18.1 possibly getitem caching? #14930

Closed
@dragoljub

Description

@dragoljub
# Run on Pandas 0.19.1

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randn(1000,5000), columns=['C'+str(c) for c in range(5000)])

%%prun
for col in df:
    df[col]

         65011 function calls (65010 primitive calls) in 2.012 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    1.935    0.000    1.955    0.000 base.py:1362(__contains__)
     5000    0.025    0.000    2.002    0.000 frame.py:2033(__getitem__)
        1    0.011    0.011    2.012    2.012 <string>:2(<module>)
     5000    0.010    0.000    0.016    0.000 frame.py:2059(_getitem_column)
     5001    0.007    0.000    0.007    0.000 {method 'view' of 'numpy.ndarray' objects}
     5001    0.004    0.000    0.012    0.000 base.py:492(values)
     5000    0.004    0.000    0.007    0.000 generic.py:1381(_get_item_cache)
     5000    0.004    0.000    0.018    0.000 base.py:1275(<lambda>)
     5000    0.003    0.000    0.014    0.000 base.py:874(_values)
     5000    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}
     5000    0.002    0.000    0.003    0.000 common.py:438(_apply_if_callable)
     5000    0.002    0.000    0.002    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
        1    0.000    0.000    2.012    2.012 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:833(__iter__)
        1    0.000    0.000    0.000    0.000 generic.py:401(_info_axis)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 base.py:1315(__iter__)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 4.39 s

Problem description

It appears that get_item_cache() or __contains__ may have something to do with it. This affects other functionality such as df.info() which is now also ~100X slower.

Expected Output

# Run on Pandas 0.18.1

In [6]: %%prun
   ...: for col in df:
   ...:     df[col]
   ...:
         45011 function calls (45010 primitive calls) in 0.032 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     5000    0.012    0.000    0.028    0.000 frame.py:1973(__getitem__)
     5000    0.005    0.000    0.008    0.000 frame.py:1999(_getitem_column)
     5000    0.004    0.000    0.005    0.000 base.py:1233(__contains__)
        1    0.004    0.004    0.032    0.032 <string>:2(<module>)
     5000    0.002    0.000    0.003    0.000 generic.py:1345(_get_item_cache)
     5000    0.001    0.000    0.002    0.000 common.py:1846(_apply_if_callable)
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     5000    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
     5000    0.001    0.000    0.001    0.000 {built-in method builtins.hash}
     5000    0.000    0.000    0.000    0.000 {built-in method builtins.callable}
        1    0.000    0.000    0.032    0.032 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 generic.py:808(__iter__)
        1    0.000    0.000    0.000    0.000 base.py:440(values)
      2/1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'view' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 generic.py:381(_info_axis)
        1    0.000    0.000    0.000    0.000 base.py:1186(__iter__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.getattr}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

%time df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 5000 entries, C0 to C4999
dtypes: float64(5000)
memory usage: 38.1 MB
Wall time: 40 ms

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.10.1
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.7.1
IPython: 4.2.0
sphinx: 1.3.6
patsy: 0.4.0
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.2.5
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: None
lxml: 3.6.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions