Skip to content

read_hdf is much slower than it should be #563

Closed
@gdementen

Description

@gdementen
>>> from larray_eurostat import *
>>> arr = eurostat_get('migr_pop3ctb')
>>> arr.to_hdf('test_arr.h5', 'arr')
>>> import pandas as pd
>>> %timeit pd.read_hdf('test_arr.h5', 'arr')
114 ms ± 8.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit read_hdf('test_arr.h5', 'arr')
1.24 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The problem is mostly in inout.array.cartesian_product_df which is very expensive while it should do nothing for .h5 arrays saved from an LArray (it was already "densified").

>>> from larray.inout.array import cartesian_product_df
>>> df = pd.read_hdf('test_arr.h5', 'arr')
>>> %timeit cartesian_product_df(df)
968 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> from larray.inout.array import df_labels
>>> %timeit labels = df_labels(df, False)
288 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> labels = df_labels(df, False)
>>> from itertools import product
>>> %timeit new_index = pd.MultiIndex.from_tuples(list(product(*labels)))
440 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> new_index = pd.MultiIndex.from_tuples(list(product(*labels)))
>>> columns = list(df.columns)
>>> %timeit np.array_equal(df.index.values, new_index.values)
23.3 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I think I had already fixed this (or at least partially) in the very old pandasbased3 branch.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions