-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
WIP [DOC/EA]: developer docs for implementing Series.round/sum/etc in EA #26918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
0ea3548
d137a10
92d4b9b
d1bf105
81cd3cf
cb3fd56
c90c597
ee3ad20
901579e
1e253c7
696b320
e70dbb9
ab418c1
d5db7e6
d7ebacf
0fec603
54b7822
d70ffec
1296267
b8187b4
387fd68
b800b67
be69f0f
4d948fa
eef58bc
068ff61
4098825
2d94b82
5a9125b
6866b66
1f689c1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -208,6 +208,194 @@ will | |
2. call ``result = op(values, ExtensionArray)`` | ||
3. re-box the result in a ``Series`` | ||
|
||
:class:`~pandas.api.extensions.ExtensionArray` Series Operations Support | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
.. versionadded:: 0.25.0 | ||
|
||
In addition to operators like `__mul__` and `__add__`, the pandas Series | ||
namespace provides a long list of useful operations such as :meth:`Series.round`, | ||
:meth:`Series.sum`, :meth:`Series.abs`, etc'. Some of these are handled by | ||
pandas own algorithm implementations (via a dispatch function), while others | ||
simply call an equivalent numpy function with data from the underlying array. | ||
In order to support this operations in a new ExtensionArray, you must provide | ||
an implementation for them. | ||
|
||
.. note:: | ||
|
||
There is a third category of operation which live on the `pandas` | ||
namespace, for example `:meth:pd.concat`. There is an equivalent | ||
numpy function `:meth:np.concatenate`, but this is not used. n | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
general, these function should just work with your EA, you do not | ||
need to impelment more than the general EA interface. | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
|
||
As of 0.25.0, the list of series operations which pandas' provides its own | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
implementations are: :meth:`Series.any`, :meth:`Series.all`, | ||
:meth:`Series.min`, :meth:`Series.max`, :meth:`Series.sum`, | ||
:meth:`Series.mean`, :meth:`Series.median`, :meth:`Series.prod` | ||
(and its alias :meth:`Series.product`), :meth:`Series.std`, | ||
:meth:`Series.var`, :meth:`Series.sem`, | ||
:meth:`Series.kurt`, and :meth:`Series.skew`. | ||
|
||
In order to implement any of this functions, your ExtensionArray must include | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
an Implementation of :meth:`ExtensionArray._reduce`. Once you provide an | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
implementation of :meth:`ExtensionArray._reduce` which handles a particular | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
method, calling the method on the Series will invoke the implementation | ||
on your ExtensionArray. All these methods are reduction functions, and | ||
so are expected to return a scalr value of some type. However it is perfectly | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
acceptable to return some instance of an :class:`pandas.api.extensions.ExtensionDtype`. | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
|
||
Series operations which are not handled by :meth:`ExtensionArray._reduce`, | ||
such as :meth:`Series.round`, will generally invoke an equivalent numpy | ||
function with your extension array as the argument. Pandas only guarantees | ||
that your array will be passed to a numpy function, it does not dictate | ||
how your ExtensionArray should interact with numpy's dispatch logic | ||
in order to achieve its goal, since there are several alternative ways | ||
of achieving similar results. | ||
|
||
However, the details of numpy's dispatch logic are not entirely simple, and | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
there nuances which you should be aware of. For that reason, and in order to | ||
make it easier to create new pandas extensions, we will now cover some of | ||
possible approaches of dealing with numpy. | ||
|
||
The first alternative, and the simplest, is to simply provide an `__array__` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The base ExtensionArray class already provides a |
||
method for your ExtensionArray. This is a standard numpy function documented | ||
here (TBD), which must return a numpy array equivalent of your ExtensionArray. | ||
This will usually be an array whose dtype is `object` and whose values are | ||
instances of some class which your ExtensionArray wraps into an array. For | ||
example, the pandas tests include an ExtensionArray example called | ||
`DecimalArray`, if it used this method, its `__array__` method would return an | ||
ndarray of `decimal.Decimal` objects. | ||
|
||
Implementing `__array__` is easy, but it usually isn't satisfactory because | ||
it means most Series operations will return a Series of object dtype, instead | ||
of maintaining your ExtensionArray's dtype. | ||
|
||
The second approach is more involved, but it does a proper job of maintaining | ||
the ExtensionArray's dtype through operations. It requires a detailed | ||
understanding of how numpy functions operate on non ndarray objects. | ||
|
||
Just as pandas handles some operation via :meth:`ExtensionArray._reduce` | ||
and others by delegating to numpy, numpy makes a distinction between | ||
between two types of opersions: ufuncs (such as `np.floor`, `np.ceil`, | ||
and `np.abs`), and non-ufuncs (for example `np.round`, and `np.repeat`). | ||
|
||
We will deal with ufuncs first. You can find a list of numpy's ufuncs here | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
(TBD). In order to support numpy ufuncs, a convenient approach is to implement | ||
numpy's `__array_ufunc__` interface, specified in | ||
[NEP13](https://www.numpy.org/neps/nep-0013-ufunc-overrides.html). In brief, | ||
if your ExtensionArray implements a compliant `__array_ufunc__` interface, | ||
when a numpy ufunc such as `np.floor` is invoked on your array, its | ||
implementation of `__array_ufunc__` will bec called first and given the | ||
opportunity to compute the function. The return value needn't be a numpy | ||
ndarray (though it can be). In general, you want the return value to be an | ||
instance of your ExtensionArray. In some cases, your implementation can | ||
calculate the result itself (see for example TBD), or, if your ExtensionArray | ||
already has a numeric ndarray backing it, your implementation will itself | ||
invoke the numpy ufunc itself on it (see for example TBD). In either case, | ||
after computing the values, you will usually want to wrap the result as a new | ||
ExtensionArray instance and return it to the caller. Pandas will automatically | ||
use that Array as the backing ExtensionArray for a new Series object. | ||
|
||
.. note:: | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
Before [NEP13](https://www.numpy.org/neps/nep-0013-ufunc-overrides.html), | ||
numpy already provides a way of wrapping ufunc functions via `__array_prepare__` | ||
and `__array_wrap__`, as documented in the Numpy docuemntation section | ||
["Subclassing Numpy"](http://docs.python.org/doc/numpy/user/basics.subclassing.html). | ||
However, NEP13 seems to have largely superceded that mechanism. | ||
|
||
|
||
With ufuncs out of the way, we turn to the remaining numpy operations, such | ||
as `np.round`. The simplest way to support these operations is to simply | ||
implement a compatible method on your ExtensionArray. For example, if your | ||
ExtensionArray has a compatible `round` method on your ExtensionArray, | ||
When python involes `ser.round()`, :meth:``Series.round` will invoke | ||
`np.round(self.array)`, which will pass your ExtensionArray to the `np.round` | ||
method. Numpy will detect that your EA implements a compatible `round` | ||
and will invoke it to perform the operation. As in the ufunc case, | ||
your implemntation will generally perform the calculaion itself, | ||
or call numpy on its own acking numeric array, and in either case | ||
will wrap the result as a new instance of ExtensionArray and return that | ||
as a result. It is usually possible to write generic code to handle | ||
most ufuncs without having to provide a special case for each. For an example, see TBD. | ||
|
||
.. important:: | ||
|
||
When providing implementations of numpy functions such as `np.round`, | ||
It essential that function signature is compatible with the numpy original. | ||
Otherwise,, numpy will ignore it. | ||
|
||
For example, the signature for `np.round` is `np.round(a, decimals=0, out=None)`. | ||
if you implement a round function which omits the `out` keyword: | ||
|
||
.. code-block:: python | ||
|
||
def round(self, decimals=0): | ||
pass | ||
|
||
|
||
numpy will ignore it. The following will work however: | ||
|
||
.. code-block:: python | ||
|
||
def round(self, decimals=0, **kwds): | ||
pass | ||
|
||
An alternative approach to implementing individual functions, is to override | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wouldn't recommend this. |
||
`__getattr__` in your ExtensionArray, and to intercept requests for method | ||
names which you wish to support (such as `round`). For most functions, | ||
you can return a dynamiclly generated function, which simply calls | ||
the numpy function on your existing backing numeric array, wraps | ||
the result in your ExtensionArray, and returns it. This approach can | ||
reduce boilerplate significantly, but you do have to maintain a whitelist, | ||
and may require more than one case, based on signature. | ||
|
||
|
||
A third possible approach, is to use the `__array_function__` | ||
mechanism introduced by [NEP18](https://www.numpy.org/neps/nep-0018-array-function-protocol.html). | ||
This is an opt-in mechanism in numpy 1.16 (by setting an environment variable), and | ||
is enabled by default starting with numpy 1.17. As of 1.17 it is still considered | ||
experimental, and its design is actively being revised. We will not discuss it further | ||
here, but it is certainly possible to make use of it to achieve the same goal. Your | ||
mileage may vary. | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
|
||
.. important:: | ||
Implementing `__array_function__` is not a substitute for implementing `__array_ufunc__`. | ||
The `__array_function__` mechanism complements (and to a degree copies) the`__array_ufunc__` | ||
mechanism, by providing the same flexibility for non-ufuncs. | ||
|
||
.. important:: | ||
`__array_function__` is an "all-in" solution. That means that if you cannot mix it with | ||
explicit implementations for some methods and using `__array_function__` for some. | ||
If you both `__array_function__` and also provide an implementation of `round`, numpy | ||
will invoke `__array_function__` for all the operations in the specification, **including** | ||
`round`. | ||
|
||
With this overview in hand, you hopefully have the necessary information in order | ||
to develop rich, full-featured ExtensionArrays that seamlessly plug in to pandas. | ||
|
||
.. important:: | ||
You are not required to provide implementations for the full complemnt of Series | ||
operations in your ExtensionArray. In fact, some of them may not even make sense | ||
within your context. You amay also choose to ass implementations incrementally, | ||
as the need arised. | ||
|
||
TBD: should we have a standard way of signalling not supported instead of a | ||
random AttributeError exception being thrown. | ||
|
||
.. important:: | ||
|
||
The above description currently leads the state of the code considerably. Many Series | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
methods need to be updated to conform to this model of EA support. If you find a | ||
bug, or something else which does not behave as described, please report it to | ||
the pandas team by opening an issue. | ||
|
||
|
||
Formatting Extension Arrays | ||
^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
TBD | ||
|
||
.. _extending.extension.testing: | ||
|
||
Testing Extension Arrays | ||
|
Uh oh!
There was an error while loading. Please reload this page.