Skip to content

pandas.Series.isin() is slow on large sets due to conversion of set to list #25507

Open
@amerberg

Description

@amerberg

Code Sample

import pandas as pd
import time

squares = set(a**2 for a in range(100000000))
series = pd.Series(range(100))

start = time.time()
apply_result = series.apply(lambda x: x in squares)
apply_end = time.time()
isin_result = series.isin(squares)
isin_end = time.time()

assert((apply_result==isin_result).all())

print("pandas.Series.apply() took {} seconds and pandas.Series.isin() took {} seconds.".format(apply_end - start, isin_end - apply_end))

Output:

pandas.Series.apply() took 0.0044422149658203125 seconds and pandas.Series.isin() took 72.23143887519836 seconds.

Problem description

When a set is passed to pandas.Series.isin, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.

Suggested improvements

The quick and dirty workaround is to use pandas.Series.apply (as in the above code sample) instead of pandas.Series.isin. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround into isin directly. I would suggest, however, that at a minimum the documentation for isin be updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 3.0.5
pip: 19.0.3
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.16.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml.etree: 3.7.2
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformanceMemory or execution speed performanceisinisin method

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions