Skip to content

Add MultiIndex._data and MultiIndex.array #27138

Closed
@topper-123

Description

@topper-123

I propose adding a MultiIndex._data that is of type List[Categorical], where all the underlying data of a MultiIndex would be stored. A multiIndex.array property would also be added, that accesses the _data.

This has the advantage of collecting the data that is underlying MultiIndex into one data structure, that is human readable, and also makes access to zero-copy data very easy, e.g. would mi.array[1] return the data of the second level as a Categorical, in a easy-to-read form.

A MultiIndex could with the above changes be explained as just "a container over a list of Categoricals", which is easier to explain than the current mode. The MultiIndex could also be related to CategoricalIndex, which is "a container over a single Categorical".

This change means that MultiIndex.levels will become a property that returns a FrozenList(cat.categories for cat in self._data), and MultiIndex.codes will be a property that returns FrozenList(cat.codes for cat in self._data).

MultiIndex.array will be added and will simply be a property that returns a FrozenList of self._data.

Performance will not be affected, as most operations would still go through MultiIndex.codes and MultiIndex.levels.

Moving names from MultiIndex.levels to MultiIndex._names

Currently the levels' names are stored at each level's name attribute. This is not very compatible with extracting the categories from _data. (the .categories is actually part of the dtype, which ideally should be immutable, so we shouldn't set or change its name attribute).

To make my suggestion practically possible, the level names should be stored in MultiIndex._names instead, and MultiIndex.names will become a property that reads from/writes to MultiIndex._names. I think this change simplifies the MultiIndex a bit, as data and names are dealt with separately. This is a small backward breaking change though.

So, I suggest making two PRs:

  1. Separating the names from the levels (to be included in 0.25)
  2. Add _data, array and change levels and codes into properties.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Closing CandidateMay be closeable, needs more eyeballsEnhancementExtensionArrayExtending pandas with custom dtypes or arrays.MultiIndexNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions