Open
Description
Old HDF format
>>> %timeit Session('demo.h5')
2.09 s
Faster/current HDF format
>>> %timeit Session('demo_fast.h5')
1.25 s
Pure Pandas
This gives an approximate lower bound of what we could achieve via #724 -- maybe Pandas does a bit too much but I doubt we would get below 500ms
>>> import pandas as pd
>>> sto = pd.HDFStore('demo_fast.h5')
>>> %timeit {k: sto[k] for k in sto.keys()}
781 ms
My working proof of concept for a format based on Feather files & PyArrow
This is 8x as fast as the current best format and at least 3x as fast as what I think we could achieve using raw PyTables (as of now (*)).
>>> %timeit Session('demo4.laf')
152 ms
>>> Session('demo4.laf').equals(Session('demo_fast.h5'))
True
(*) There is some in-progress projet to use a new HDF mechanism in PyTables to provide (much) faster I/O but this still a WIP and there is no guarantee it will be completed & integrated "soon" (the project is supposed to end by the end of the year).