Skip to content

ENH: Support min/max on ArrowStringArray #42597

Closed
@mrocklin

Description

@mrocklin

Motivation

In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.

To do this it is useful to compute min/max values.

What actually breaks

When I try to do this on columns of type string[pyarrow] I get the following exception

import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10825         )
  10826         def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827             return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
  10828 
  10829         setattr(cls, "min", min)

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
  10348 
  10349     def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350         return self._stat_function(
  10351             "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
  10352         )

~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
  10343                 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
  10344             )
> 10345         return self._reduce(
  10346             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  10347         )

~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4380         if isinstance(delegate, ExtensionArray):
   4381             # dispatch to ExtensionArray interface
-> 4382             return delegate._reduce(name, skipna=skipna, **kwds)
   4383 
   4384         else:

~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
    377     def _reduce(self, name: str, skipna: bool = True, **kwargs):
    378         if name in ["min", "max"]:
--> 379             return getattr(self, name)(skipna=skipna)
    380 
    381         raise TypeError(f"Cannot perform reduction '{name}' with string dtype")

AttributeError: 'ArrowStringArray' object has no attribute 'min'

Solution

I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugReduction Operationssum, mean, min, max, etc.StringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions