Closed
Description
Motivation
In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.
To do this it is useful to compute min/max values.
What actually breaks
When I try to do this on columns of type string[pyarrow]
I get the following exception
import pandas as pd
s = pd.Series(["a", "b", "c"]).astype("string[pyarrow]")
s.min()
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10825 )
10826 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10827 return NDFrame.min(self, axis, skipna, level, numeric_only, **kwargs)
10828
10829 setattr(cls, "min", min)
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in min(self, axis, skipna, level, numeric_only, **kwargs)
10348
10349 def min(self, axis=None, skipna=None, level=None, numeric_only=None, **kwargs):
> 10350 return self._stat_function(
10351 "min", nanops.nanmin, axis, skipna, level, numeric_only, **kwargs
10352 )
~/miniconda/lib/python3.8/site-packages/pandas/core/generic.py in _stat_function(self, name, func, axis, skipna, level, numeric_only, **kwargs)
10343 name, axis=axis, level=level, skipna=skipna, numeric_only=numeric_only
10344 )
> 10345 return self._reduce(
10346 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
10347 )
~/miniconda/lib/python3.8/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
4380 if isinstance(delegate, ExtensionArray):
4381 # dispatch to ExtensionArray interface
-> 4382 return delegate._reduce(name, skipna=skipna, **kwds)
4383
4384 else:
~/miniconda/lib/python3.8/site-packages/pandas/core/arrays/string_arrow.py in _reduce(self, name, skipna, **kwargs)
377 def _reduce(self, name: str, skipna: bool = True, **kwargs):
378 if name in ["min", "max"]:
--> 379 return getattr(self, name)(skipna=skipna)
380
381 raise TypeError(f"Cannot perform reduction '{name}' with string dtype")
AttributeError: 'ArrowStringArray' object has no attribute 'min'
Solution
I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.