Skip to content

WIP [DOC/EA]: developer docs for implementing Series.round/sum/etc in EA #26918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions doc/source/development/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,155 @@ will
2. call ``result = op(values, ExtensionArray)``
3. re-box the result in a ``Series``

:class:`~pandas.api.extensions.ExtensionArray` Series Operations Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. versionadded:: 0.25.0

In addition to operators like `__mul__` and `__add__`, the pandas Series
namespace provides a long list of useful operations such as :meth:`Series.round`,
:meth:`Series.sum`, :meth:`Series.abs`, etc'. Some of these are handled by
pandas own algorithm implementations (via a dispatch function), while others
simply call an equivalent numpy function with data from the underlying array.
In order to support this operations in a new ExtensionArray, you must provide
an implementation for them.

As of 0.25.0, pandas provides its own implementations for some
reduction operations such as min/max/sum/etc'. For your ExtensionArray
to support these methods, it must include an implementation of
:meth:`ExtensionArray._reduce`. See its docstring for a complete list of
the series operations it handles. Once your EA implements
:meth:`ExtensionArray._reduce`, your implementation will be cailled
whenever one of the related Series method is called. All these
methods are reduction functions, and so are expected to return a scalar value
of some type.

Series operations which are not handled by :meth:`ExtensionArray._reduce`,
such as :meth:`Series.round`, will generally invoke an equivalent numpy
function with your extension array as the argument. Pandas only guarantees
that your array will be passed to a numpy function, it does not dictate
how your ExtensionArray should interact with numpy's dispatch logic
in order to achieve its goal, since there are several alternative ways
of achieving similar results.

For the most basic support, the default implemntation of :meth:`ExtensionArray.__array__`
will transperantly convert your EA to a numpy object array. You can also
override it to return any numpy array which suits your case. However,
this solution usually falls short, becase any series methods you then
use casts your EA into an object ndarray, while you usually want the
result to remain an instance of your EA.

In most cases, you will want to provide your own implementations of the
methods. This takes more work, but does a proper job of maintaining the
ExtensionArray's dtype through operations. Understanding how to do this
requires a more detailed understanding of how numpy functions operate on non
ndarray objects.

Just as pandas handles some operation via :meth:`ExtensionArray._reduce`
and others by delegating to numpy, numpy makes a distinction between
between two types of operations: ufuncs (such as `np.floor`, `np.ceil`,
and `np.abs`), and non-ufuncs (for example `np.round`, and `np.repeat`).

.. note::
Although your methods will override numpy's own methods, they
are *not* required to return numpy arrays or builtin python types. In
fact, you will often want your method to return a new instance of your
:class:`pandas.api.extensions.ExtensionArray` as the return value.

We will deal with ufuncs first. You can find a list of numpy's ufuncs here
(TBD). In order to support numpy ufuncs, a convenient approach is to implement
numpy's `__array_ufunc__` interface, specified in
`NEP-13 <https://www.numpy.org/neps/nep-0013-ufunc-overrides.html>`__
if your ExtensionArray implements a compliant `__array_ufunc__` interface,
when a numpy ufunc such as `np.floor` is invoked on your array, its
implementation of `__array_ufunc__` will be called first and given the
opportunity to compute the function. The return value needn't be a numpy
ndarray (though it can be). In general, you want the return value to be an
instance of your ExtensionArray.

With ufuncs out of the way, we turn to the remaining numpy operations, such as
`np.round`. The simplest way to support these operations is to simply
implement a compatible method on your ExtensionArray. For example, if your
ExtensionArray has a compatible `round` method on your ExtensionArray, When
:meth:`Series.round` is called, it in turn calls `np.round(self.array)`,
passing your EA into numpy's dispatch logic. Numpy will detect that your EA
implements a compatible `round` method and use it instead of its own
version. As in the ufunc case, your implementation will perform
the calculation on its internal data, and then usually wrap the
result in anew instance of your EA class, and return that as the result.

It is usually possible to write generic code to handle most ufuncs,
instead of providing a special case for each. For an example, see TBD.

.. important::

When providing implementations of numpy functions such as `np.round`,
You muse ensure that the method signature is compatible with the numpy method
it implements. If the signatures do not match, numpy will ignore it.

For example, the signature for `np.round` is `np.round(a, decimals=0, out=None)`.
if you implement a round function which omits the `out` keyword,

.. code-block:: python

def round(self, decimals=0):
pass


\... numpy will ignore it. The following will work however:

.. code-block:: python

def round(self, decimals=0, **kwds):
pass


An second possible approach to implementing individual operations, is to override
`__getattr__` in your ExtensionArray, and to intercept requests for method
names which you wish to support (such as `round`). For most functions,
you can return a dynamically generated function, which simply calls
the numpy function on your existing backing numeric array, wraps
the result in your ExtensionArray, and returns it. This approach can
reduce boilerplate significantly, but you do have to maintain a whitelist,
and may require more than one case, based on signature.

A third possible approach, is to use the `__array_function__` mechanism
introduced by numpy's
`NEP-18 <https://www.numpy.org/neps/nep-0018-array-function-protocol.html>`__
proposal. NEP-18 is an experimental mechanism introduced in numpy 1.16, and is
enabled by default starting with numpy 1.17 (to enable it in 1.16, you must
set the environment variable `NUMPY_EXPERIMENTAL_ARRAY_FUNCTION` in your
shell). NEP-18 is an "opt-in, all-in" solution, meaning that if you choose to
make use of it in your class, by implementing the `__array_function__`
interface, it will always be used when (non-ufunc) numpy methods are called
with an instance of your EA as the argument. Numpy will not make use of an `__array__`
method if you have one. If you include both a `__array_function__` and an
implementation of `round`, for example, numpy will always invoke `__array_function__`
when `np.round` is passed an instance of your EA.

.. important::
Even if you choose to implement `__array_function__`, you still need to
implement `__array_ufunc__` in order to override ufuncs. Each of these
two interfaces covers a seperate portion of numpy's functionality.


With this overview in hand, you hopefully have the necessary information in order
to develop rich, full-featured ExtensionArrays that seamlessly plug in to pandas.
EA support is still being actively worked on, so if you encounter a bug, or behaviour
which does not behave as described, please report it to the team.

.. important::
You are not required to provide implementations for the full complement of Series
operations in your ExtensionArray. In fact, some of them may not even make sense
within its context. You may also choose to add implementations incrementally,
as the need arises.


Formatting Extension Arrays
^^^^^^^^^^^^^^^^^^^^^^^^^^^

TBD

.. _extending.extension.testing:

Testing Extension Arrays
Expand Down