Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
ser = pd.Series(range(10**6))
ser2 = ser.astype("int64[pyarrow]")
to_replace = {i: -i for i in range(100)}
%timeit ser.copy().replace(to_replace, inplace=True)
151 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit ser2.copy().replace(to_replace, inplace=True)
422 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# the copy() alone is 1.51ms for ser and 17.2µs for ser2
Issue Description
ATM replace goes through Block.replace which calls the EA's __eq__
(actually missing.mask_missing) and the EA's __setitem__
in a loop. This gets expensive for ArrowEA in part bc __setitem__
makes a copy on each __setitem__
, and (more surprising to me) bc mask_missing is slow.
In [17]: %prun -s cumtime for n in range(100): ser2.copy().replace(to_replace, inplace=True)
[...]
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 40.457 40.457 {built-in method builtins.exec}
1 0.003 0.003 40.457 40.457 <string>:1(<module>)
200/100 0.005 0.000 40.444 0.404 generic.py:7457(replace)
200 0.614 0.003 40.406 0.202 managers.py:309(apply)
100 0.001 0.000 40.403 0.404 managers.py:489(replace_list)
100 0.063 0.001 39.787 0.398 blocks.py:730(replace_list)
10100 0.037 0.000 22.112 0.002 blocks.py:784(<genexpr>)
10000 4.480 0.000 22.075 0.002 missing.py:67(mask_missing)
20000 0.048 0.000 21.941 0.001 compute.py:231(wrapper)
20000 21.860 0.001 21.884 0.001 {method 'call' of 'pyarrow._compute.Function' objects}
10000 0.021 0.000 17.562 0.002 blocks.py:835(_replace_coerce)
10000 0.059 0.000 17.474 0.002 blocks.py:576(replace)
10000 0.027 0.000 16.248 0.002 putmask.py:29(putmask_inplace)
10000 0.090 0.000 16.191 0.002 array.py:1570(__setitem__)
10000 0.051 0.000 15.533 0.002 array.py:1849(_replace_with_mask)
10000 0.080 0.000 10.005 0.001 array.py:1208(to_numpy)
10000 9.853 0.001 9.853 0.001 {method 'to_numpy' of 'pyarrow.lib.ChunkedArray' objects}
10000 0.018 0.000 6.916 0.001 common.py:71(new_method)
10000 0.012 0.000 6.874 0.001 arraylike.py:38(__eq__)
10000 0.045 0.000 6.862 0.001 array.py:609(_cmp_method)
In [18]: %prun -s cumtime for n in range(100): ser.copy().replace(to_replace, inplace=True)
[...]
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 15.838 15.838 {built-in method builtins.exec}
1 0.057 0.057 15.838 15.838 <string>:1(<module>)
200 0.577 0.003 15.736 0.079 managers.py:309(apply)
200/100 0.005 0.000 15.653 0.157 generic.py:7457(replace)
100 0.001 0.000 15.616 0.156 managers.py:489(replace_list)
100 0.039 0.000 15.039 0.150 blocks.py:730(replace_list)
10100 0.044 0.000 8.166 0.001 blocks.py:784(<genexpr>)
10000 7.466 0.001 8.122 0.001 missing.py:67(mask_missing)
10000 0.012 0.000 6.721 0.001 blocks.py:835(_replace_coerce)
10000 0.041 0.000 6.665 0.001 blocks.py:576(replace)
10000 0.022 0.000 5.530 0.001 putmask.py:29(putmask_inplace)
The mask_missing part we'll need to investigate. To address the setitem/copy perf, I think we need to make replace
an EA method. This way Arrow EA can only do one copy at the end instead of a copy at each step.
This will also make the special-casing for Categorical no longer special.
Expected Behavior
NA
Installed Versions
Replace this line with the output of pd.show_versions()