Skip to content

PERF: Improve performance of Series.isin() on sets #25812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions asv_bench/benchmarks/series_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,14 @@ class IsIn(object):
def setup(self, dtype):
self.s = Series(np.random.randint(1, 10, 100000)).astype(dtype)
self.values = [1, 2]
self.values_set = set(self.values)

def time_isin(self, dtypes):
self.s.isin(self.values)

def time_isin_set(self, dtypes):
self.s.isin(self.values_set)


class IsInFloat64(object):

Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ Performance Improvements
int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`)
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)
- Improved performance of :meth:`read_csv` by faster tokenizing and faster parsing of small float numbers (:issue:`25784`)
- Improved performance of :meth:`Series.isin` and :meth:`DataFrame.isin` when passing a set (:issue:`25507`).
- Improved performance of :meth:`read_csv` by faster parsing of N/A and boolean values (:issue:`25804`)

.. _whatsnew_0250.bug_fixes:
Expand Down
14 changes: 11 additions & 3 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@
is_complex_dtype, is_datetime64_any_dtype, is_datetime64tz_dtype,
is_datetimelike, is_extension_array_dtype, is_float_dtype, is_integer,
is_integer_dtype, is_interval_dtype, is_list_like, is_numeric_dtype,
is_object_dtype, is_period_dtype, is_scalar, is_signed_integer_dtype,
is_sparse, is_timedelta64_dtype, is_unsigned_integer_dtype,
needs_i8_conversion)
is_object_dtype, is_period_dtype, is_scalar, is_set_like,
is_signed_integer_dtype, is_sparse, is_timedelta64_dtype,
is_unsigned_integer_dtype, needs_i8_conversion)
from pandas.core.dtypes.generic import ABCIndex, ABCIndexClass, ABCSeries
from pandas.core.dtypes.missing import isna, na_value_for_dtype

Expand Down Expand Up @@ -395,6 +395,14 @@ def isin(comps, values):
" to isin(), you passed a [{values_type}]"
.format(values_type=type(values).__name__))

# GH 25507
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like you to integrate this with the other impl., meaning skip the set-ifying if its a set, but otherwise dispatch the actual isin operation. I suspect this will work only in some of the tests cases (but we aren't fully testing it here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by the other implementation or the actual isin operator. Do you mean let the rest of this function run but without the type cast? Or attempt to call comps.isin? It seems that would fail if comps was an ndarray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what i mean is you can skip the list-ifying if its a set, but don't actually do the comp in values. as I said I suspect this actually doesn't work (some tests are failing) and if we add sets to a fair number of tests they will fail. The reason is that these all must be the same types exactly e.g. comp is often a int while values maybe be an np.int. we already fully handle this path, so you need to fit in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests that failed are unrelated to my changes (conda connection issues on setup). I will add more tests for set values, but dispatching the current path of isin will destroy the performance gain this change intended to fix.

I can't find any situations which cause this change to fail. The int vs np.int works fine, and I can create tests around such cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are creating a new path which makes the codes more complex ; we already have too many paths here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then feel free to close this pull request; the current path cannot support this performance change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said this needs to integrate with the existing isin tests. this needs to happen before this patch is considered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the new patch works well for every test case except NaN types. The current paths ensure that an array containing NaN when compared against a collection with NaN in it returns True. However using Python sets NaN != NaN so it will return False (not sure of the reasoning, but I think current implementation goes against IEEE).

This means that .isin({np.nan}) and .isin([np.nan]) will have different results, and I don't know of a way to resolve this without introducing an O(n) check that destroys the performance fix.

Closing this pull request as I don't think we can implement this performance change without this regression.

# if `values` is a set, directly use it instead of hashing a list
if is_set_like(values):
result = np.empty_like(comps, dtype=np.bool)
for i, comp in enumerate(comps):
result[i] = comp in values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each test in test_algorithms for isin, add an arg that also is a set (for as many tests as possible)

return result

if not isinstance(values, (ABCIndex, ABCSeries, np.ndarray)):
values = construct_1d_object_array_from_listlike(list(values))

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
is_array_like, is_bool, is_complex, is_decimal, is_dict_like, is_file_like,
is_float, is_hashable, is_integer, is_interval, is_iterator, is_list_like,
is_named_tuple, is_nested_list_like, is_number, is_re, is_re_compilable,
is_scalar, is_sequence, is_string_like)
is_scalar, is_sequence, is_set_like, is_string_like)

_POSSIBLY_CAST_DTYPES = {np.dtype(t).name
for t in ['O', 'int8', 'uint8', 'int16', 'uint16',
Expand Down
30 changes: 30 additions & 0 deletions pandas/core/dtypes/inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,36 @@ def is_list_like(obj, allow_sets=True):
and not (allow_sets is False and isinstance(obj, Set)))


def is_set_like(obj):
"""
Check if the object is set-like.

Parameters
----------
obj : The object to check

Returns
-------
is_set_like : bool
Whether `obj` has set-like properties.

Examples
--------
>>> is_set_like({1, 2})
True
>>> is_set_like(frozenset([1, 2]))
True
>>> is_set_like(set())
True
>>> is_set_like(set)
False
>>> is_set_like({1: 2, 3: 4})
False
"""

return isinstance(obj, (set, frozenset))


def is_array_like(obj):
"""
Check if the object is array-like.
Expand Down
18 changes: 17 additions & 1 deletion pandas/tests/dtypes/test_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
from pandas.core.dtypes.common import (
ensure_categorical, ensure_int32, is_bool, is_datetime64_any_dtype,
is_datetime64_dtype, is_datetime64_ns_dtype, is_datetime64tz_dtype,
is_float, is_integer, is_number, is_scalar, is_scipy_sparse,
is_float, is_integer, is_number, is_scalar, is_scipy_sparse, is_set_like,
is_timedelta64_dtype, is_timedelta64_ns_dtype)

import pandas as pd
Expand Down Expand Up @@ -103,6 +103,22 @@ def test_is_list_like_disallow_sets(maybe_list_like):
assert inference.is_list_like(obj, allow_sets=False) == expected


@pytest.mark.parametrize('obj,expected', [
({1, 2}, True),
(set(), True),
(set, False),
([1, 2], False),
({1: 2}, False),
(frozenset([1, 2]), True),
(frozenset(), True),
(frozenset, False),
((1, 2), False),
([], False),
])
def test_is_set_like(obj, expected):
assert is_set_like(obj) == expected


def test_is_sequence():
is_seq = inference.is_sequence
assert (is_seq((1, 2)))
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -803,6 +803,21 @@ def test_different_nans_as_float64(self):
expected = np.array([True, True])
tm.assert_numpy_array_equal(result, expected)

@pytest.mark.parametrize("comps,values,expected", [
(np.array([1, 2, 3]), {1, np.int(2)}, np.array([True, True, False])),
(np.array(['a', 'b']), {1, 'b'}, np.array([False, True])),
(np.array([1.0, 2.0]), {1, 2}, np.array([True, True])),
(pd.date_range("2019-01-01", "2019-01-03"),
{datetime(2019, 1, 2)}, np.array([False, True, False])),
(pd.Categorical(['a', 'b']), {0, 'b'}, np.array([False, True])),
(np.array([np.nan, float('nan')]), {float('nan')},
np.array([False, False]))
])
def test_set(self, comps, values, expected):
# GH 25507
actual = algos.isin(comps, values)
assert tm.assert_numpy_array_equal(actual, expected)


class TestValueCounts(object):

Expand Down