Closed
Description
I'm using Pandas HDF5 files on a shared network file system with a distributed lock manager (based on Kazoo) to ensure a single writer at a time.
When there is significant lock contention, a rapid series of flush, close, lock release and lock acquisition in a different process will cause corruption to the HDF5 file. This appears to be due to the operating system cache.
This is resolved by:
store = pd.HDFStore(filename, complevel=9, complib='blosc')
f_fd = store._handle.fileno()
store.flush()
os.fsync(f_fd)
store.close()
However this requires using _handle
, which may be removed in the future.
Would you please either expose _handle
for supported end user use, or the fileno
, or perhaps modify flush()
to accept a fsync=True
parameter to support this use case. Thanks.