Description
Code Sample
import pandas as pd
import time
squares = set(a**2 for a in range(100000000))
series = pd.Series(range(100))
start = time.time()
apply_result = series.apply(lambda x: x in squares)
apply_end = time.time()
isin_result = series.isin(squares)
isin_end = time.time()
assert((apply_result==isin_result).all())
print("pandas.Series.apply() took {} seconds and pandas.Series.isin() took {} seconds.".format(apply_end - start, isin_end - apply_end))
Output:
pandas.Series.apply() took 0.0044422149658203125 seconds and pandas.Series.isin() took 72.23143887519836 seconds.
Problem description
When a set is passed to pandas.Series.isin
, the set is converted to a list, before being converted back to a hash table. Consequently, the run time is linear in the size of the set, which is not ideal because one of the main reasons to use a set is that membership can be tested in constant time.
Suggested improvements
The quick and dirty workaround is to use pandas.Series.apply
(as in the above code sample) instead of pandas.Series.isin
. I'm not familiar enough with pandas internals to know whether there are edge cases where this would fail or whether it would be a bad idea to incorporate this workaround into isin
directly. I would suggest, however, that at a minimum the documentation for isin
be updated to mention that a set will be converted to a list and that this has performance implications, so that users can choose an alternative approach. (I am happy to contribute the documentation if this is the preferred solution.)
Output of pd.show_versions()
pandas: 0.24.1
pytest: 3.0.5
pip: 19.0.3
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.16.2
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml.etree: 3.7.2
bs4: 4.5.3
html5lib: 0.9999999
sqlalchemy: 1.1.5
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None