Closed
Description
>>> from larray_eurostat import *
>>> arr = eurostat_get('migr_pop3ctb')
>>> arr.to_hdf('test_arr.h5', 'arr')
>>> import pandas as pd
>>> %timeit pd.read_hdf('test_arr.h5', 'arr')
114 ms ± 8.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit read_hdf('test_arr.h5', 'arr')
1.24 s ± 35.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The problem is mostly in inout.array.cartesian_product_df which is very expensive while it should do nothing for .h5 arrays saved from an LArray (it was already "densified").
>>> from larray.inout.array import cartesian_product_df
>>> df = pd.read_hdf('test_arr.h5', 'arr')
>>> %timeit cartesian_product_df(df)
968 ms ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> from larray.inout.array import df_labels
>>> %timeit labels = df_labels(df, False)
288 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> labels = df_labels(df, False)
>>> from itertools import product
>>> %timeit new_index = pd.MultiIndex.from_tuples(list(product(*labels)))
440 ms ± 3.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> new_index = pd.MultiIndex.from_tuples(list(product(*labels)))
>>> columns = list(df.columns)
>>> %timeit np.array_equal(df.index.values, new_index.values)
23.3 ms ± 638 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I think I had already fixed this (or at least partially) in the very old pandasbased3 branch.