Description
Minimal Verifiable Complete Exemple
Below a MVCE of the behavior:
import pandas as pd
# Trial Data:
data = {
'key1': list(range(6))*2
,'key2': [100, 100, 100, 100, 200, 200, 200, 300, 300, None, None, None]
,'data': ['a']*12
}
# Load Data:
df0 = pd.DataFrame(data)
# Index (Int64 upcasted to Float64, because of None converted into NaN)
df1 = df0.set_index('key2')
# MultiIndex containing NaN on second level:
df2 = df0.set_index(['key1', 'key2'])
# NaN values are replaced by last existing value:
idx = df2.index.remove_unused_levels()
# Then, Index are not equal:
idx.equals(df2.index) # False
Problem description
Using method remove_unused_levels
on MultiIndex containing NaN
create a new MultiIndex that is not equal to the original as documentation says:
The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
This is why I suspect it is a bug.
Float Index
Single Index
uses NaN
as modality:
>>> df1.index
Float64Index([100.0, 100.0, 100.0, 100.0, 200.0, 200.0, 200.0, 300.0, 300.0,
nan, nan, nan],
dtype='float64', name='key2')
But, MultiIndex
does not, it has negative modality index instead:
>>> df2.index
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, -1, -1, -1]],
names=['key1', 'key2'])
MultiIndex corruption
When refreshed, NaN
values point to a copy of the last float
modality (here 300.0
) of the level, this lead to a kind of corrupted index because those auto-filled value do not have any meaning.
>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 3, 3, 3, 3, 3]],
names=['key1', 'key2'])
As a consequence Index are not equal (which contradicts documentation):
>>> df2.index.remove_unused_levels().equals(df2.index)
False
Even worse, original value (300.0
is not referenced anymore), and then it is a unused value/modality in the newly generated index.
To confirm it, lets apply the method twice, we get:
>>> df2.index.remove_unused_levels().remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2]],
names=['key1', 'key2'])
Expected Output
I believe expected output of set_index
and remove_unused_levels
should be:
>>> df2.index.remove_unused_levels()
MultiIndex(levels=[[0, 1, 2, 3, 4, 5], [100.0, 200.0, 300.0, nan]],
labels=[[0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5], [0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]],
names=['key1', 'key2'])
The problem also occurs when rows are removed from the DataFrame, and then it makes sense to use the method remove_unused_levels
to clean up index. Anyway, when building the MCVE I found it was working on the whole Index whatever the level order.
Pandas Versions
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-75-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None