Skip to content

Performance pd.HDFStore().keys() slow #17593

Closed
@exrich

Description

@exrich

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
    for i, df in enumerate(dataframes):
        store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)
%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1998                                               def iter_nodes(self, where, classname=None):
  1999                                                   """Iterate over children nodes hanging from where.
  2000                                           
  2001                                                   Parameters
  2002                                                   ----------
  2003                                                   where
  2004                                                       This argument works as in :meth:`File.get_node`, referencing the
  2005                                                       node to be acted upon.
  2006                                                   classname
  2007                                                       If the name of a class derived from
  2008                                                       Node (see :ref:`NodeClassDescr`) is supplied, only instances of
  2009                                                       that class (or subclasses of it) will be returned.
  2010                                           
  2011                                                   Notes
  2012                                                   -----
  2013                                                   The returned nodes are alphanumerically sorted by their name.
  2014                                                   This is an iterator version of :meth:`File.list_nodes`.
  2015                                           
  2016                                                   """
  2017                                           
  2018      6001       125237     20.9     75.4          group = self.get_node(where)  # Does the parent exist?
  2019      6001        26549      4.4     16.0          self._check_group(group)  # Is it a group?
  2020                                           
  2021      6001        14216      2.4      8.6          return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don't think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.11.3
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: 2.45.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO HDF5read_hdf, HDFStorePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions