Skip to content

High memory usage for MultiIndexes #1752

Closed
@njsmith

Description

@njsmith

This was on the list, but so it doesn't get forgotten: https://groups.google.com/d/msg/pydata/EaPRB4KNeTQ/Y-3kG5gW3xMJ

Summary: storing a large multi-index takes a lot of memory, with current pandas master, once it has been indexed into once. ~200 megabytes for a 1 mega-entry index. (As compared to ~20 megabytes pickled.)

The offending dataframe is available here (this is different from the url in the email thread): http://vorpus.org/~njs/tmp/df-pandas-master.pickle.gz

Partly this is just a bug; there is a large object array of tuples that Wes says shouldn't exist, but seems to anyway.

I was going to also say that it would be nice to be able to just opt out of the hash table entirely, but code inspection suggests that this already happens for indexes with more than pandas/src/engines.pyx:_SIZE_CUTOFF = 1000000 entries. The array above is just under this threshold. Perhaps this cutoff should be lowered for indices that have higher memory overhead, like multiindexes? Or even be user-controllable when the index is created? (The code I'm using to load in and set up these dataframes already takes elaborate measures to manage memory overhead, so this wouldn't be a hassle.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions