Skip to content

fix series.isin slow issue with Dtype IntegerArray #38379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Jan 20, 2021
Merged
Changes from 6 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
109c0e7
fix series.isin slow issue with Dtype IntegerArray
tushushu Dec 9, 2020
e9f96ea
Move isinstance(comps, IntegerArray) to algo.isin
tushushu Dec 9, 2020
a6be9c8
cannot import IntegerArray due to circular import
tushushu Dec 9, 2020
415b590
fix bug in pandas (Linux py38_np_dev)
tushushu Dec 9, 2020
f3e5afb
fix pre commit issue.
tushushu Dec 9, 2020
14579fc
fix the code style issue.
tushushu Dec 9, 2020
562c918
move the logic to elif block.
tushushu Dec 11, 2020
1449d3c
remove blank line.
tushushu Dec 11, 2020
3ccc917
copy codes from #38422
tushushu Dec 27, 2020
98a0683
make `isin` correct for pd.NA
tushushu Dec 27, 2020
6e2917e
sort imports
tushushu Dec 27, 2020
a4b6503
Avoiding import pandas as pd.
tushushu Dec 31, 2020
f95fde9
fix cannot import NA issue.
tushushu Dec 31, 2020
2348a60
Merge pull request #1 from pandas-dev/master
tushushu Jan 2, 2021
c963183
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 7, 2021
c151102
Merge pull request #2 from pandas-dev/master
tushushu Jan 7, 2021
ef99e86
Merge branch 'ENH-implement-fast-isin' of github.com:tushushu/pandas …
tushushu Jan 7, 2021
f23ba94
Merge remote-tracking branch 'origin/master' into ENH-implement-fast-…
tushushu Jan 9, 2021
13dc64f
Adding Int64 and Float64 for benchmarks.
tushushu Jan 9, 2021
cc38088
Adding isin benchmarks for Boolean array
tushushu Jan 9, 2021
94846cc
Adding what's new note.
tushushu Jan 9, 2021
f4cb5ce
fix IsInFloat64 benchmarks
tushushu Jan 9, 2021
2ee8b05
always return false for null values.
tushushu Jan 9, 2021
7763b18
fix flake8 error.
tushushu Jan 9, 2021
87dfff9
remove unused lines.
tushushu Jan 10, 2021
d2d32d1
refactors for series benchmarks.
tushushu Jan 10, 2021
90ef57a
fix flake8 errors.
tushushu Jan 10, 2021
a48e00b
Change back to see if can pass the tests.
tushushu Jan 10, 2021
2279519
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 14, 2021
c726d4a
makes NA isin [NA] return True.
tushushu Jan 14, 2021
1134ad6
remove redundant codes.
tushushu Jan 14, 2021
bce0e3e
makes performance better.
tushushu Jan 14, 2021
bf788e5
fix flake8 errors.
tushushu Jan 14, 2021
40950be
polish codes
tushushu Jan 16, 2021
570d640
not import NA
tushushu Jan 16, 2021
0f89578
fix code style
tushushu Jan 16, 2021
199c11c
fix black error.
tushushu Jan 16, 2021
9f35b5b
fix CI
tushushu Jan 17, 2021
2238dc5
Merge remote-tracking branch 'upstream/master' into ENH-implement-fas…
tushushu Jan 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,8 @@ def isin(comps: AnyArrayLike, values: AnyArrayLike) -> np.ndarray:
else:
values = extract_array(values, extract_numpy=True)

if type(comps).__name__ == "IntegerArray":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if you handle it in the extension array elif instead of separately up here?

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
return isin(np.asarray(comps), np.asarray(values))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the issue here is that we actually need to implement .isin on the EAs (numeric dtypes); we already do this on the internal EAs (datetime, interval etc).

cc @jbrockmendel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if you handle it in the extension array elif instead of separately up here?

elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype(
values.dtype
):
return isin(np.asarray(comps), np.asarray(values))

Hi Andrew,
Are you suggesting something like this:

 elif is_extension_array_dtype(comps.dtype) or is_extension_array_dtype( 
     values.dtype 
 ): 
     if type(comps).__name__ == "IntegerArray":
         comps = comps._data
     return isin(np.asarray(comps), np.asarray(values)) 

I've made a test the performance is good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that was what i meant

Does anything break, and do you get an isin speedup, if you relax the IntegerArray check and use comps._data for other EAs (such as FloatingArray)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the performance increases from 20ms to 2ms after the modification. Please refer to #38340 if you are interested in the performance tests.

Sure I could add other EAs here, will do that later. Thanks Andrew~

Copy link
Contributor Author

@tushushu tushushu Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found ExtensionArray does not has the '_data' member, and only subclass of BaseMaskedArray has the _data member, so I guess the classes below should be considered:
FloatingArray
IntegerArray
BooleanArray

comps = comps._data # type: ignore[attr-defined, assignment]
comps = _ensure_arraylike(comps)
comps = extract_array(comps, extract_numpy=True)
if is_categorical_dtype(comps.dtype):
Expand Down