Description
This issue is meant to summarize the current status and likely future direction of the NumPy array protocols, and their relevance to the array API standard.
What are these array protocols?
In summary, they are dispatching mechanisms that allow calling the public NumPy API with other numpy.ndarray
-like arrays (e.g. CuPy or Dask arrays, or any other array that implements the protocols) and have the function call dispatch to that library. There are two protocols, __array_ufunc__
and __array_function__
, that are very similar - the difference is that with __array_ufunc__
the library being dispatched to knows it's getting a ufunc and it can therefore make use of some properties all ufuncs have. The dispatching works the same for both protocols though.
Why were they created?
__array_ufunc__
was created first, the original driver was to be able to call numpy ufuncs on scipy.sparse
matrices. __array_function__
was created later, to be able to cover most of the NumPy API (every function that takes an array as input) and use the NumPy API with other array/tensor implementations:
What is the current status?
The protocols have been adopted by:
- CuPy
- Dask
- Xarray
- MXNet
- PyData Sparse
- Pint
They have not (or not yet) been adopted by:
- Tensorflow (because no compatible API to dispatch to, interest of maintainers unclear)
- PyTorch (because no compatible API to dispatch to, maintainers do have interest)
- JAX (concerns about added value and backwards compatibility - see NEP 37 introduction)
scipy.sparse
(semantics not compatible)
The RAPIDS ecosystem, which builds on Dask and CuPy, has been particularly happy with these protocols, and use them heavily. There they've also run into some of the limitations, the most painful one being that array creation functions cannot be dispatched on.
What is likely to change in the near future?
There is still active exploration of new ideas and design alternatives (or additions to) the array protocols. There's 3 main "contenders":
- extend the protocols to cover the most painful shortcomings: NEP 30 (
__duckarray__
) + NEP 35 (like=
). - use a separate module namespace: NEP 37 (
__array_module__
) - use a multiple dispatch library: NEP 31 (
unumpy
)
At the moment, the most likely outcome is doing both (1) and (2). It needs prototyping and testing though - any solution should only be accepted when it's clear that it not only solves the immediate pain points RAPIDS ran into, but also that libraries like scikit-learn and SciPy can then adopt it.
What is the relationship of the array protocols with an API standard?
There's several connections:
- The original idea of
__array_function__
(figure above) doesn't require an API that's the same as the NumPy one, but in practice the protocols can only be adopted when there's an API with matching signatures and semantics. - The lack of an API standard has meant that it's hard to predict what NumPy functions will work for another array library that implements the protocols.
- The separate namespaces (
__array_module__
,unumpy
) provide a good opportunity to introduce a new API standard once that's agreed on.
References
- NEP 13 - A Mechanism for Overriding Ufuncs
- NEP 18 - A dispatch mechanism for NumPy’s high level array functions
- NEP 22 - Duck typing for NumPy arrays – high level overview
- NEP 30 - https://numpy.org/neps/nep-0030-duck-array-protocol.html
- NEP 31 - Context-local and global overrides of the NumPy API
- NEP 35 - Array Creation Dispatching With
__array_function__
- NEP 37 - A dispatch protocol for NumPy-like modules
- Meeting minutes of a recent conversation on NEPs 30, 31, 35 and 37