Skip to content

HDF5 (PyTables) fsync support #5364

Closed
Closed
@benalexau

Description

@benalexau

I'm using Pandas HDF5 files on a shared network file system with a distributed lock manager (based on Kazoo) to ensure a single writer at a time.

When there is significant lock contention, a rapid series of flush, close, lock release and lock acquisition in a different process will cause corruption to the HDF5 file. This appears to be due to the operating system cache.

This is resolved by:

store = pd.HDFStore(filename, complevel=9, complib='blosc')
f_fd = store._handle.fileno()
store.flush()
os.fsync(f_fd)
store.close()

However this requires using _handle, which may be removed in the future.

Would you please either expose _handle for supported end user use, or the fileno, or perhaps modify flush() to accept a fsync=True parameter to support this use case. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO DataIO issues that don't fit into a more specific labelIO HDF5read_hdf, HDFStore

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions