Skip to content

Alignment of CategoricalIndex will convert the index to int type? #28397

Closed
@bingtangben

Description

@bingtangben

Code Sample

import pandas as pd

idx1 = pd.Categorical(['A1', 'A2', 'A2'], categories=['A1', 'A2'])
idx2 = pd.Categorical(['A2', 'A1', 'A1'], categories=['A1', 'A2'])

tmp1 = pd.DataFrame(np.random.randn(6).reshape(3, 2), index=idx1)
tmp2 = pd.DataFrame(np.random.randn(6).reshape(3, 2), index=idx2)

tmp1 * tmp2

The result:

In [1]: tmp1
Out[1]: 
           0         1
A1 -0.880693  0.794644
A2 -0.671629  1.145027
A2  0.305152  0.379116

In [2]: tmp2
Out[2]: 
           0         1
A2  0.434122 -0.966894
A1 -0.162195  0.499271
A1  0.840569  0.043832

In [3]: tmp1 * tmp2
Out[3]: 
          0         1
0  0.142844  0.396742
0 -0.740283  0.034831
1 -0.291569 -1.107120
1  0.132473 -0.366565

In [4]: pd.__version__
Out[4]: '0.24.2'

Problem description

When getting multiplication of two DataFrames with CategoricalIndex based on the same categories, the result will convert the CategoricalIndex to the Int one. The problem may be from dataframe alignment.

In [7]: tmp1_align, tmp2_align =tmp1.align(tmp2)

In [8]: tmp1_align
Out[8]: 
          0         1
0  -0.880693  0.794644
0  -0.880693  0.794644
1 -0.671629  1.145027
1  0.305152  0.379116

I think the simple binary operations should preserve the index type for categorical index with the same categories. However, it doesn't. Then I find sometimes the operation preserve the categorical index type but convert to object plain-text index which cost large memory. So I am confused about it and haven't found any explaination yet.

Expected Output

The result with categorical index like this:

         0         1
A1  0.142844  0.396742
A1 -0.740283  0.034831
A2 -0.291569 -1.107120
A2  0.132473 -0.366565

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.3.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: en LOCALE: None.None

pandas: 0.24.2
pytest: 5.0.1
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.12
numpy: 1.16.4
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.6.1
sphinx: 2.1.2
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.2
numexpr: 2.6.9
feather: None
matplotlib: 3.1.0
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.4
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.5
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions