Skip to content

BUG: index duplicates keys with non ascii chars #57942

Open
@aquirin

Description

@aquirin

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

When creating a dataframe with an index containing non-ascii chars, pandas is merging different keys into a single key.

import pandas as pd
car = "é".encode("latin1").decode('utf8', 'surrogateescape')
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c2", "c3"]).reset_index()
print(list(df["c3"]))

returns the same two keys:

['x-\udce9', 'x-\udce9']

Expected behavior:

['x-\udce9', 'y-\udce9']

Note that when using ascii chars, the behavior is correct:

import pandas as pd
car = "0"
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c2", "c3"]).reset_index()
print(list(df["c3"]))

returns two different keys:

['x-0', 'y-0']

Note that the behavior is correct with non-ascii char and using a single column in the index:

car = "é".encode("latin1").decode('utf8', 'surrogateescape')
data = [(1,"a-"+car, "x-"+car), (2, "b-"+car, "y-"+car)]
df = pd.DataFrame(data, columns=["c1", "c2", "c3"]).set_index(["c3"]).reset_index()
print(list(df["c3"]))

returns two different keys:

['x-\udce9', 'y-\udce9']

Issue Description

Creating a multi-index with non-ascii characters will not keep unique indices. Instead, keys are merged.

Expected Behavior

Creating a multi-index with non-ascii characters should keep unique keys.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : bdc79c146c2e32f2cab629be240f01658cfb6cc2
python                : 3.12.1.final.0
python-bits           : 64
OS                    : Linux
OS-release            : 3.10.105
Version               : #25556 SMP Sat Aug 28 02:13:34 CST 2021
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : en_US.UTF-8
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.1
numpy                 : 1.26.4
pytz                  : 2024.1
dateutil              : 2.9.0.post0
setuptools            : None
pip                   : 24.0
Cython                : None
pytest                : None
hypothesis            : None
sphinx                : None
blosc                 : None
feather               : None
xlsxwriter            : None
lxml.etree            : None
html5lib              : None
pymysql               : None
psycopg2              : None
jinja2                : 3.1.3
IPython               : 8.22.2
pandas_datareader     : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
gcsfs                 : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
pandas_gbq            : None
pyarrow               : None
pyreadstat            : None
python-calamine       : None
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
zstandard             : None
tzdata                : 2024.1
qtpy                  : None
pyqt5                 : None

Metadata

Metadata

Assignees

Labels

BugNeeds TriageIssue that has not been reviewed by a pandas team member

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions