Description
I propose adding a MultiIndex._data
that is of type List[Categorical]
, where all the underlying data of a MultiIndex would be stored. A multiIndex.array
property would also be added, that accesses the _data
.
This has the advantage of collecting the data that is underlying MultiIndex into one data structure, that is human readable, and also makes access to zero-copy data very easy, e.g. would mi.array[1]
return the data of the second level as a Categorical
, in a easy-to-read form.
A MultiIndex
could with the above changes be explained as just "a container over a list of Categoricals", which is easier to explain than the current mode. The MultiIndex
could also be related to CategoricalIndex
, which is "a container over a single Categorical".
This change means that MultiIndex.levels
will become a property that returns a FrozenList(cat.categories for cat in self._data)
, and MultiIndex.codes
will be a property that returns FrozenList(cat.codes for cat in self._data)
.
MultiIndex.array
will be added and will simply be a property that returns a FrozenList of self._data
.
Performance will not be affected, as most operations would still go through MultiIndex.codes
and MultiIndex.levels
.
Moving names from MultiIndex.levels to MultiIndex._names
Currently the levels' names are stored at each level's name
attribute. This is not very compatible with extracting the categories from _data
. (the .categories
is actually part of the dtype, which ideally should be immutable, so we shouldn't set or change its name attribute).
To make my suggestion practically possible, the level names should be stored in MultiIndex._names
instead, and MultiIndex.names
will become a property that reads from/writes to MultiIndex._names
. I think this change simplifies the MultiIndex a bit, as data and names are dealt with separately. This is a small backward breaking change though.
So, I suggest making two PRs:
- Separating the names from the levels (to be included in 0.25)
- Add
_data
,array
and changelevels
andcodes
into properties.