High memory usage for MultiIndexes

This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ

Summary: storing a large multi-index takes a _lot_ of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)

The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz

Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.

I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is _just_ under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

High memory usage for MultiIndexes #1752

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

High memory usage for MultiIndexes #1752

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions