Skip to content

REF/PERF: replace with ArrowEA #53711

Open
@jbrockmendel

Description

@jbrockmendel

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

ser = pd.Series(range(10**6))
ser2 = ser.astype("int64[pyarrow]")

to_replace = {i: -i for i in range(100)}

%timeit ser.copy().replace(to_replace, inplace=True)
151 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit ser2.copy().replace(to_replace, inplace=True)
422 ms ± 4.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# the copy() alone is 1.51ms for ser and 17.2µs for ser2

Issue Description

ATM replace goes through Block.replace which calls the EA's __eq__ (actually missing.mask_missing) and the EA's __setitem__ in a loop. This gets expensive for ArrowEA in part bc __setitem__ makes a copy on each __setitem__, and (more surprising to me) bc mask_missing is slow.

In [17]: %prun -s cumtime for n in range(100): ser2.copy().replace(to_replace, inplace=True)
[...]
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   40.457   40.457 {built-in method builtins.exec}
        1    0.003    0.003   40.457   40.457 <string>:1(<module>)
  200/100    0.005    0.000   40.444    0.404 generic.py:7457(replace)
      200    0.614    0.003   40.406    0.202 managers.py:309(apply)
      100    0.001    0.000   40.403    0.404 managers.py:489(replace_list)
      100    0.063    0.001   39.787    0.398 blocks.py:730(replace_list)
    10100    0.037    0.000   22.112    0.002 blocks.py:784(<genexpr>)
    10000    4.480    0.000   22.075    0.002 missing.py:67(mask_missing)
    20000    0.048    0.000   21.941    0.001 compute.py:231(wrapper)
    20000   21.860    0.001   21.884    0.001 {method 'call' of 'pyarrow._compute.Function' objects}
    10000    0.021    0.000   17.562    0.002 blocks.py:835(_replace_coerce)
    10000    0.059    0.000   17.474    0.002 blocks.py:576(replace)
    10000    0.027    0.000   16.248    0.002 putmask.py:29(putmask_inplace)
    10000    0.090    0.000   16.191    0.002 array.py:1570(__setitem__)
    10000    0.051    0.000   15.533    0.002 array.py:1849(_replace_with_mask)
    10000    0.080    0.000   10.005    0.001 array.py:1208(to_numpy)
    10000    9.853    0.001    9.853    0.001 {method 'to_numpy' of 'pyarrow.lib.ChunkedArray' objects}
    10000    0.018    0.000    6.916    0.001 common.py:71(new_method)
    10000    0.012    0.000    6.874    0.001 arraylike.py:38(__eq__)
    10000    0.045    0.000    6.862    0.001 array.py:609(_cmp_method)

In [18]: %prun -s cumtime for n in range(100): ser.copy().replace(to_replace, inplace=True)
[...]
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   15.838   15.838 {built-in method builtins.exec}
        1    0.057    0.057   15.838   15.838 <string>:1(<module>)
      200    0.577    0.003   15.736    0.079 managers.py:309(apply)
  200/100    0.005    0.000   15.653    0.157 generic.py:7457(replace)
      100    0.001    0.000   15.616    0.156 managers.py:489(replace_list)
      100    0.039    0.000   15.039    0.150 blocks.py:730(replace_list)
    10100    0.044    0.000    8.166    0.001 blocks.py:784(<genexpr>)
    10000    7.466    0.001    8.122    0.001 missing.py:67(mask_missing)
    10000    0.012    0.000    6.721    0.001 blocks.py:835(_replace_coerce)
    10000    0.041    0.000    6.665    0.001 blocks.py:576(replace)
    10000    0.022    0.000    5.530    0.001 putmask.py:29(putmask_inplace)

The mask_missing part we'll need to investigate. To address the setitem/copy perf, I think we need to make replace an EA method. This way Arrow EA can only do one copy at the end instead of a copy at each step.

This will also make the special-casing for Categorical no longer special.

Expected Behavior

NA

Installed Versions

Replace this line with the output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityPerformanceMemory or execution speed performancereplacereplace method

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions