Description
This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ
Summary: storing a large multi-index takes a lot of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)
The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz
Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.
I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is just under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)