Closed
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# a simple dataframe with column `c` = 0, 1, 2
df1 = pd.DataFrame({
'a': pd.Categorical([0, 1, 2]),
'b': pd.Categorical([0, 1, 2]),
'c': [0, 1, 2]
}).set_index(['a', 'b'])
# identical
df2 = pd.DataFrame({
'a': pd.Categorical([0, 1, 2]),
'b': pd.Categorical([0, 1, 2]),
'd': [0, 1, 2]
}).set_index(['a', 'b'])
# identical but different row ordering
df3 = pd.DataFrame({
'a': pd.Categorical([0, 2, 1]),
'b': pd.Categorical([0, 2, 1]),
'e': [0, 2, 1]
}).set_index(['a', 'b'])
# a normal join returns `category` if indexes are identical
df1.join(df2).index.dtypes # category, category
df1.join(df2, how="outer").index.dtypes # category, category
# if index ordering is different, dtype of index depends on join type:
df1.join(df3).index.dtypes # category, category
df1.join(df3, how="outer").index.dtypes # int64, int64
df1.join(df3, how="inner").index.dtypes # int64, int64
df1.join(df3, how="left").index.dtypes # category, category
df1.join(df3, how="right").index.dtypes # category, category
Issue Description
If two dataframes both are multi-indexed with categorical levels, then performing a join
operation results in the dtype of the index being un-categorized depending on the ordering of the input. If the indexes match ordering, the output is categorical; if the indexes have different ordering, the output is cast to the underlying categorical dtype.
Expected Behavior
All joins shown above should produce categorical index levels.
Installed Versions
INSTALLED VERSIONS
------------------
commit : 2e218d10984e9919f0296931d92ea851c6a6faf5
python : 3.9.13.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : English_United States.1252
pandas : 1.5.3
numpy : 1.23.4
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : 1.4.43
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None