Skip to content

BUG: reset_index on a MultiIndex with duplicate levels raises a ValueError #44410

Closed
@pymrkc

Description

@pymrkc

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling                                                                                                                                                                                     
                          stock
label  label  date             
apple  apple  2021-10-27  100.0
              2021-10-28  150.0
              2021-10-29  350.0
carrot carrot 2021-10-27  150.0
              2021-10-28  225.0
              2021-10-29  245.0
>>> df_rolling.index                                                                                                                                                                               
MultiIndex([( 'apple',  'apple', '2021-10-27'),
            ( 'apple',  'apple', '2021-10-28'),
            ( 'apple',  'apple', '2021-10-29'),
            ('carrot', 'carrot', '2021-10-27'),
            ('carrot', 'carrot', '2021-10-28'),
            ('carrot', 'carrot', '2021-10-29')],
           names=['label', 'label', 'date'])
>>> df_rolling = df_rolling.reset_index()                                                                                                                                                          
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-8b81c1e32ea2> in <module>
----> 1 df_rolling = df_rolling.reset_index()

/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in reset_index(self, level, drop, inplace, col_level, col_fill)
   5797                     )
   5798 
-> 5799                 new_obj.insert(0, name, level_values)
   5800 
   5801         new_obj.index = new_index

/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   4412         if not allow_duplicates and column in self.columns:
   4413             # Should this be a different kind of error??
-> 4414             raise ValueError(f"cannot insert {column}, already exists")
   4415         if not isinstance(loc, int):
   4416             raise TypeError("loc must be int")

ValueError: cannot insert label, already exists

Issue Description

In pandas 1.1.5, this code works fine; the MultiIndex looks as you'd expect it to, and the reset_index call works fine. This code breaks in 1.3.4.

Expected Behavior

>>> import pandas
>>> from pandas import Timestamp
>>> df = pandas.DataFrame({'label': {1000: 'apple',
  1001: 'carrot',
  1002: 'carrot',
  1003: 'apple',
  1004: 'apple',
  1005: 'carrot'},
 'date': {1000: Timestamp('2021-10-27 00:00:00'),
  1001: Timestamp('2021-10-27 00:00:00'),
  1002: Timestamp('2021-10-28 00:00:00'),
  1003: Timestamp('2021-10-28 00:00:00'),
  1004: Timestamp('2021-10-29 00:00:00'),
  1005: Timestamp('2021-10-29 00:00:00')},
 'stock': {1000: 100,
  1001: 150,
  1002: 75,
  1003: 50,
  1004: 200,
  1005: 20}})
>>> df_rolling = df.set_index(['label', 'date']).groupby(level='label').rolling(window=7, min_periods=1).sum()
>>> df_rolling
        stock
label        
apple   100.0
apple   150.0
apple   350.0
carrot  150.0
carrot  225.0
carrot  245.0
>>> df_rolling.index
MultiIndex([( 'apple',),
            ( 'apple',),
            ( 'apple',),
            ('carrot',),
            ('carrot',),
            ('carrot',)],
           names=['label'])
>>> df_rolling = df_rolling.reset_index()
>>> df_rolling.index
RangeIndex(start=0, stop=6, step=1)

Installed Versions

In [60]: pandas.show_versions()

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.0-37-generic
Version : #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.4
pytz : 2019.3
dateutil : 2.7.3
pip : 20.0.2
setuptools : 45.2.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions