Skip to content

Commit 9bac34d

Browse files
committed
ENH: HDFStore.flush() to optionally perform fsync (GH5364)
1 parent 1695320 commit 9bac34d

File tree

4 files changed

+35
-3
lines changed

4 files changed

+35
-3
lines changed

doc/source/io.rst

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2745,6 +2745,12 @@ Notes & Caveats
27452745
need to serialize these operations in a single thread in a single
27462746
process. You will corrupt your data otherwise. See the issue
27472747
(:`2397`) for more information.
2748+
- If serializing all write operations via a single thread in a single
2749+
process is not an option, another alternative is to use an external
2750+
distributed lock manager to ensure there is only a single writer at a
2751+
time and all readers close the file during writes and re-open it after any
2752+
writes. In this case you should use ``store.flush(fsync=True)`` prior to
2753+
releasing any write locks. See the issue (:`5364`) for more information.
27482754
- ``PyTables`` only supports fixed-width string columns in
27492755
``tables``. The sizes of a string based indexing column
27502756
(e.g. *columns* or *minor_axis*) are determined as the maximum size

doc/source/release.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,8 @@ API Changes
275275
- store `datetime.date` objects as ordinals rather then timetuples to avoid
276276
timezone issues (:issue:`2852`), thanks @tavistmorph and @numpand
277277
- ``numexpr`` 2.2.2 fixes incompatiblity in PyTables 2.4 (:issue:`4908`)
278+
- ``flush`` now accepts an ``fsync`` parameter, which defaults to ``False``
279+
(:issue:`5364`)
278280
- ``JSON``
279281

280282
- added ``date_unit`` parameter to specify resolution of timestamps.

pandas/io/pytables.py

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
import copy
1111
import itertools
1212
import warnings
13+
import os
1314

1415
import numpy as np
1516
from pandas import (Series, TimeSeries, DataFrame, Panel, Panel4D, Index,
@@ -525,12 +526,30 @@ def is_open(self):
525526
return False
526527
return bool(self._handle.isopen)
527528

528-
def flush(self):
529+
def flush(self, fsync=False):
529530
"""
530-
Force all buffered modifications to be written to disk
531+
Force all buffered modifications to be written to disk.
532+
533+
By default this method requests PyTables to flush, and PyTables in turn
534+
requests the HDF5 library to flush any changes to the operating system.
535+
There is no guarantee the operating system will actually commit writes
536+
to disk.
537+
538+
To request the operating system to write the file to disk, pass
539+
``fsync=True``. The method will then block until the operating system
540+
reports completion, although be aware there might be other caching
541+
layers (eg disk controllers, disks themselves etc) which further delay
542+
durability.
543+
544+
Parameters
545+
----------
546+
fsync : boolean, invoke fsync for the file handle, default False
547+
531548
"""
532549
if self._handle is not None:
533550
self._handle.flush()
551+
if fsync:
552+
os.fsync(self._handle.fileno())
534553

535554
def get(self, key):
536555
"""
@@ -4072,5 +4091,4 @@ def timeit(key, df, fn=None, remove=True, **kwargs):
40724091
store.close()
40734092

40744093
if remove:
4075-
import os
40764094
os.remove(fn)

pandas/io/tests/test_pytables.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,12 @@ def test_flush(self):
466466
store['a'] = tm.makeTimeSeries()
467467
store.flush()
468468

469+
def test_flush_fsync(self):
470+
471+
with ensure_clean(self.path) as store:
472+
store['a'] = tm.makeTimeSeries()
473+
store.flush(fsync=True)
474+
469475
def test_get(self):
470476

471477
with ensure_clean(self.path) as store:

0 commit comments

Comments
 (0)