Closed
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
import timeit
# data set up:
data_size = 1234
query_size = 10000
data_ndarray = np.random.randint(100000, size=data_size)
data_series = pd.Series(data_ndarray)
list_ = list(range(query_size))
range_index = pd.Index(range(query_size))
array_index = pd.Index(list_)
series = array_index.to_series()
ndarray = array_index.to_numpy()
print(f"{data_series.dtype=}, {range_index.dtype=}, {array_index.dtype=}")
N = 1000
def run(name, f):
return name, timeit.timeit(f, number=N) / N
df = pd.DataFrame(
[
run("list", lambda: data_series.isin(list_)),
run("range index", lambda: data_series.isin(range_index)),
run("array index", lambda: data_series.isin(array_index)),
run("series", lambda: data_series.isin(series)),
run("ndarray", lambda: data_series.isin(ndarray)),
# variations on using indices
run("array index.to_numpy()", lambda: data_series.isin(array_index.to_numpy())),
run("range index.__contains__", lambda: data_series.apply(range_index.__contains__)),
run("array index.__contains__", lambda: data_series.apply(array_index.__contains__)),
run("array index.get_indexer", lambda: array_index.get_indexer(data_series.values) != -1),
# poke into the internals
run("array index._engine.mapping.__contains__", lambda: data_series.apply(array_index._engine.mapping.__contains__)),
run("array index._engine.get_indexer", lambda: array_index._engine.get_indexer(data_series.values) != -1),
# numpy for comparison
run("np.isin", lambda: np.isin(data_ndarray, ndarray)),
],
columns=["name", "time"]
).set_index("name")
# double check that the get_indexer version works correctly
assert (data_series.isin(array_index) == (array_index._engine.get_indexer(data_series.values) != -1)).all()
print(df.sort_values("time"))
Output:
data_series.dtype=dtype('int64'), range_index.dtype=dtype('int64'), array_index.dtype=dtype('int64')
time
name
array index._engine.get_indexer 0.000016
array index.get_indexer 0.000048
array index.to_numpy() 0.000199
ndarray 0.000211
series 0.000224
np.isin 0.000227
array index._engine.mapping.__contains__ 0.000292
array index.__contains__ 0.000586
list 0.000624
range index.__contains__ 0.000755
range index 0.001444
array index 0.001457
Problem description
The exact ratios and numbers depend very much on the data , especially because the get_indexer
forms are O(data_size)
(with O(1) indexing into the index's hash table), while the other forms likely depend on query_size
, at least to pay the cost of converting data.
Expected Output
I'd expect:
- using a pandas type to always be faster than a Python types like
list
, - it to not be much slower than even a simple variant like
.isin(index.to_numpy())
, - it to use the hash table if it exists, given how much faster that is
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 18.6.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None