Description
Code Sample
import pandas
import numpy
import time
numpy.random.seed(11)
N=1_000_000
### 1. Merging on category index (3.85 seconds -slower)
# Create two (random order) dataframes with a column A with unique category dtype set as index
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")
df_1 = df_1.set_index("A")
array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")
df_2 = df_2.set_index("A")
# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_index=True, right_index=True)
print(time.time()-start)
### 2. Merging on category column (0.85 seconds - faster)
# Create two (random order) dataframes with a column A with unique category dtype
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")
array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")
# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_on="A", right_on="A")
print(time.time()-start)
Problem description
I noticed that when I perform a merge on a unique category dtype column, the merge gets significantly (x4) slower if I set this column as the index. This was unexpected for me as for all other dtypes I have tried (str, int, datetime etc.) the merge is significantly faster when using the indexed column.
I investigated this a bit further. Even though the category column "A" in the above example is unique (no repeated categories), the merge method still calls the pandas.core.indexes.category.CategoricalIndex.get_indexer_non_unique
method.
This is because the check self.is_unique and self.equals(target)
evaluates to False on
pandas/pandas/core/indexes/category.py
Line 662 in 67ee16a
self.equals(target)
is False pandas/pandas/core/indexes/category.py
Line 309 in 67ee16a
Additionally, I am not sure why get_index_non_unique for a categorical index should be significantly slower than a merge on a categorical column.
Let me know if you have any other insights into why the merge would be slower for a category type when using the index, or if I can provide any additional information / assistance. Thanks for your help!
Output of pd.show_versions()
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None