Skip to content

BUG: when inner/outer-joining dataframes with categorical MultiIndex, the output index dtype depends on row ordering #50906

Closed
@mcrumiller

Description

@mcrumiller

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# a simple dataframe with column `c` = 0, 1, 2
df1 = pd.DataFrame({
    'a': pd.Categorical([0, 1, 2]),
    'b': pd.Categorical([0, 1, 2]),
    'c': [0, 1, 2]
}).set_index(['a', 'b'])

# identical
df2 = pd.DataFrame({
    'a': pd.Categorical([0, 1, 2]),
    'b': pd.Categorical([0, 1, 2]),
    'd': [0, 1, 2]
}).set_index(['a', 'b'])

# identical but different row ordering
df3 = pd.DataFrame({
    'a': pd.Categorical([0, 2, 1]),
    'b': pd.Categorical([0, 2, 1]),
    'e': [0, 2, 1]
}).set_index(['a', 'b'])

# a normal join returns `category` if indexes are identical
df1.join(df2).index.dtypes               # category, category
df1.join(df2, how="outer").index.dtypes  # category, category

# if index ordering is different, dtype of index depends on join type:
df1.join(df3).index.dtypes               # category, category
df1.join(df3, how="outer").index.dtypes  # int64, int64
df1.join(df3, how="inner").index.dtypes  # int64, int64
df1.join(df3, how="left").index.dtypes   # category, category
df1.join(df3, how="right").index.dtypes  # category, category

Issue Description

If two dataframes both are multi-indexed with categorical levels, then performing a join operation results in the dtype of the index being un-categorized depending on the ordering of the input. If the indexes match ordering, the output is categorical; if the indexes have different ordering, the output is cast to the underlying categorical dtype.

Expected Behavior

All joins shown above should produce categorical index levels.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 2e218d10984e9919f0296931d92ea851c6a6faf5
python           : 3.9.13.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19044
machine          : AMD64
processor        : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : English_United States.1252

pandas           : 1.5.3
numpy            : 1.23.4
pytz             : 2022.6
dateutil         : 2.8.2
setuptools       : 65.5.1
pip              : 22.3.1
Cython           : None
pytest           : 7.2.0
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : 3.0.3
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.9.5
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.6.2
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.10
pandas_gbq       : None
pyarrow          : 10.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : 1.4.43
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions