Skip to content

Performance of merge for categorical index vs category column #30513

Closed
@timhunderwood

Description

@timhunderwood

Code Sample

import pandas
import numpy
import time
numpy.random.seed(11)
N=1_000_000

### 1. Merging on category index (3.85 seconds -slower)
# Create two (random order) dataframes with a column A with unique category dtype set as index
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")
df_1 = df_1.set_index("A")

array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")
df_2 = df_2.set_index("A")

# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_index=True, right_index=True)
print(time.time()-start)

### 2. Merging on category column (0.85 seconds - faster)
# Create two (random order) dataframes with a column A with unique category dtype
array_1 = numpy.arange(N)
numpy.random.shuffle(array_1)
df_1 = pandas.DataFrame(array_1, columns=["A"])
df_1["B"] = df_1["A"] * 2
df_1["A"] = df_1["A"].astype("category")

array_2 = array_1.copy()
numpy.random.shuffle(array_2)
df_2 = pandas.DataFrame(array_2, columns=["A"])
df_2["B"] = df_2["A"] * 2
df_2["A"] = df_2["A"].astype("category")

# merge two dataframes and time
start = time.time()
df_1.merge(df_2, how="left", left_on="A", right_on="A")
print(time.time()-start)

Problem description

I noticed that when I perform a merge on a unique category dtype column, the merge gets significantly (x4) slower if I set this column as the index. This was unexpected for me as for all other dtypes I have tried (str, int, datetime etc.) the merge is significantly faster when using the indexed column.

I investigated this a bit further. Even though the category column "A" in the above example is unique (no repeated categories), the merge method still calls the pandas.core.indexes.category.CategoricalIndex.get_indexer_non_unique method.

This is because the check self.is_unique and self.equals(target) evaluates to False on

if self.is_unique and self.equals(target):
. This is because self.equals(target) is False
def equals(self, other):
. I am not sure if this should be evaluating as True or False? Should two categorical indexes be equal if they have the exact same categories in different orders?

Additionally, I am not sure why get_index_non_unique for a categorical index should be significantly slower than a merge on a categorical column.

Let me know if you have any other insights into why the merge would be slower for a category type when using the index, or if I can provide any additional information / assistance. Thanks for your help!

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.5.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Metadata

Metadata

Assignees

Labels

BenchmarkPerformance (ASV) benchmarksCategoricalCategorical Data TypeReshapingConcat, Merge/Join, Stack/Unstack, Explodegood first issue

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions