Skip to content

BUG: HDF5 Files cannot be read concurrently #14692

Closed
@dragonator4

Description

@dragonator4

A small, complete example of the issue

import pandas as pd
import numpy as np
from multiprocessing import Pool
import warnings

# To avoid natural name warnings
warnings.filterwarnings('ignore')

def init(hdf_store):
    global hdf_buff
    hdf_buff = hdf_store

def reader(name):
    df = hdf_buff[name]
    return (name, df)

def main():
    # Creating the store
    with pd.HDFStore('storage.h5', 'w') as store:
        for i in range(100):
            df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
            store.append(str(i), df, index=False, expectedrows=5)
    # Reading concurrently with one connection
    with pd.HDFStore('storage.h5', 'r') as store:
        with Pool(4, initializer=init, initargs=(store,)) as p:
            ret = pd.concat(dict(p.map(reader, [str(i) for i in range(100)])))

if __name__ == '__main__':
    main()

The above code either fails loudly with the following error:

tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5Dio.c", line 173, in H5Dread
    can't read data
  File "H5Dio.c", line 554, in H5D__read
    can't read data
  File "H5Dchunk.c", line 1856, in H5D__chunk_read
    error looking up chunk address
  File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
    can't query chunk address
  File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
    can't get chunk info
  File "H5B.c", line 340, in H5B_find
    unable to load B-tree node
  File "H5AC.c", line 1262, in H5AC_protect
    H5C_protect() failed.
  File "H5C.c", line 3574, in H5C_protect
    can't load entry
  File "H5C.c", line 7954, in H5C_load_entry
    unable to load entry
  File "H5Bcache.c", line 143, in H5B__load
    wrong B-tree signature

End of HDF5 error back trace

Or with the following error:

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/kartik/miniconda3/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/kartik/Documents/Code/Scripts/Benchmarking and optimization stuff/HDF_Concurrent.py", line 13, in reader
    df = hdf_buff[name]
  File "/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py", line 461, in __getitem__
    return self.get(key)
  File "/home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py", line 677, in get
    raise KeyError('No object named %s in the file' % key)
KeyError: 'No object named 7 in the file'
"""

But in this case, object 7 clearly exists in the table. Any help?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-47-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.3
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestIO HDF5read_hdf, HDFStore

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions