-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Support ExtensionArray operators via a mixin #21261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 5 commits
5b0ebc7
d7596c6
7f2b0a1
ec96841
a07bb49
1d7b2b3
7bad559
dfcda3b
aaaa8fd
4bcf978
f958d7b
ef83c3a
41dc5ca
be6656b
a0f503c
700d75b
87e8f55
97bd291
8fc93e4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,27 @@ New features | |
|
||
.. _whatsnew_0240.enhancements.other: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a ref here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
``ExtensionArray`` operator support | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
An ``ExtensionArray`` subclass consisting of objects that have arithmetic and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you include most of this in It'd be good to mention the performance implications of using these mixins. |
||
comparison operators defined on the underlying objects can easily support | ||
those operators on the ``ExtensionArray``, and therefore the operators | ||
on ``Series`` built on those ``ExtensionArray`` classes will work as expected. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as above (use bullets / numbers) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
Two new mixin classes, :class:`ExtensionScalarArithmeticMixin` and | ||
:class:`ExtensionScalarComparisonMixin`, support this capability. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. put a ref to the extending.rst doc section for this (e.g. see docs) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
If developing an ``ExtensionArray`` subclass, for example ``MyExtensionArray``, | ||
simply include ``ExtensionScalarArithmeticMixin`` and/or | ||
``ExtensionScalarComparisonMixin`` as parent classes of ``MyExtensionArray`` | ||
as follows: | ||
|
||
.. code-block:: python | ||
|
||
class MyExtensionArray(ExtensionArray, ExtensionScalarArithmeticMixin, | ||
ExtensionScalarComparisonMixin): | ||
|
||
|
||
Other Enhancements | ||
^^^^^^^^^^^^^^^^^^ | ||
- :func:`to_datetime` now supports the ``%Z`` and ``%z`` directive when passed into ``format`` (:issue:`13486`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,5 @@ | ||
from .base import ExtensionArray # noqa | ||
from .base import (ExtensionArray, # noqa | ||
ExtensionScalarArithmeticMixin, | ||
ExtensionScalarComparisonMixin, | ||
ExtensionScalarOpsMixin) | ||
from .categorical import Categorical # noqa |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,8 +7,15 @@ | |
""" | ||
import numpy as np | ||
|
||
import operator | ||
|
||
from pandas.errors import AbstractMethodError | ||
from pandas.compat.numpy import function as nv | ||
from pandas.compat import set_function_name, PY3 | ||
from pandas.core.dtypes.common import ( | ||
is_extension_array_dtype, | ||
is_list_like) | ||
from pandas.core import ops | ||
|
||
_not_implemented_message = "{} does not implement {}." | ||
|
||
|
@@ -610,3 +617,124 @@ def _ndarray_values(self): | |
used for interacting with our indexers. | ||
""" | ||
return np.array(self) | ||
|
||
|
||
class ExtensionScalarOpsMixin(object): | ||
""" | ||
A base class for the mixins for different operators. | ||
Can also be used to define an individual method for a specific | ||
operator using the class method create_method() | ||
""" | ||
|
||
@classmethod | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. blank line above this for PEP8? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It passed the PEP8 tests....... But I will add it in. |
||
def _create_method(cls, op, coerce_to_dtype=True): | ||
""" | ||
A class method that returns a method that will correspond to an | ||
operator for an ExtensionArray subclass, by dispatching to the | ||
relevant operator defined on the individual elements of the | ||
ExtensionArray. | ||
|
||
Parameters | ||
---------- | ||
op: An operator that takes arguments op(a, b) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you use numpydoc standard formatting?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
coerce_to_dtype: boolean indicating whether to attempt to convert | ||
the result to the underlying ExtensionArray dtype | ||
(default True) | ||
|
||
Returns | ||
------- | ||
A method that can be bound to a method of a class | ||
|
||
Example | ||
------- | ||
Given an ExtensionArray subclass called MyClass, use | ||
|
||
>>> __add__ = ExtensionScalarOpsMixin.create_method(operator.add) | ||
|
||
in the class definition of MyClass to create the operator | ||
for addition. | ||
|
||
""" | ||
|
||
op_name = ops._get_op_name(op, False) | ||
|
||
def _binop(self, other): | ||
def convert_values(parm): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. minor detail, but I find 'values' (or 'param') is clearer than 'parm' There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed |
||
if isinstance(parm, ExtensionArray) or is_list_like(parm): | ||
ovalues = parm | ||
elif is_extension_array_dtype(parm): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A Series is also 'list-like', so I think this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
ovalues = parm.values | ||
else: # Assume its an object | ||
ovalues = [parm] * len(self) | ||
return ovalues | ||
lvalues = convert_values(self) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should not be needed as the calling object will always be already an ExtensionArray? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. Fixed |
||
rvalues = convert_values(other) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should probably do alignment as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't need to, for a few reasons:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, OK, this is also tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche Yes, this is tested because I am using the same tests that are used for general operators on Series, which tests things that are misaligned, etc. See As an aside, doing it this way uncovered the issues with |
||
|
||
# If the operator is not defined for the underlying objects, | ||
# a TypeError should be raised | ||
res = [op(a, b) for (a, b) in zip(lvalues, rvalues)] | ||
|
||
if coerce_to_dtype: | ||
try: | ||
res = self._from_sequence(res) | ||
except TypeError: | ||
pass | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this fails, I think we should still convert it to an array instead of keeping it as a list? Or does that happen on another level? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we convert to an array, then we could have a dtype problem. This allows the result to be of any type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It will be converted to an array anyhow, if not here, then at the level above when the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche But why repeat that logic? If we leave it as a list, then the Series constructor will do the inference on the dtype. |
||
|
||
return res | ||
|
||
name = '__{name}__'.format(name=op_name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you defining create_method as anything other than NotImplementedError this is way too opionated - if u want to define a function to do this and an author can use this ok maybe if u want to create an ExtensionOpsMixinForScalars (long name though) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please take a closer look at the diff if you give such an opinionated (pun intended :-)) comment
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche maybe I wasn't clear.
I think the ops implementation is so important that this must be an explicit choice. The point of the mixin is to avoid the boilerplate of writing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Again, this is what this PR is doing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback @jorisvandenbossche Maybe the naming is misleading. So let me clarify the intent of the PR. The goal here is that if someone extends An example is The use of these 3 classes ( __add__ = ExtensionOpsBase.create_method(operator.add) IMHO, it seems that @jorisvandenbossche understands what I'm trying to do, but @jreback may not fully understand it. If you think there is a better name for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jorisvandenbossche I've tried a variety of implementations without the base class, and couldn't get it to work. I'll rename as you suggest. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. renaming looks ok to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Next commit will use the renaming There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed on the naming conventions. Maybe the NotImplemented versions of the methods belong on the base There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel The NotImplemented versions are there by default for any class, so I don't think we need to do anything. In other words, if you write a class There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can also pass True to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed |
||
return set_function_name(_binop, name, cls) | ||
|
||
|
||
class ExtensionScalarArithmeticMixin(ExtensionScalarOpsMixin): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. still not clear to why this inherits the way this is written it is entirely too error prone. You drop in these classes, then it might work incorrectly! e.g. maybe I happen to have a scalar ops defined. This should be very explict. I dont' see ANY advantage of defaulting to a scalar ops schema here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback I think I see what you are saying, so I have pushed a commit that I think does what you suggest. I've created infrastructure that allows a developer to define their own I hope this addresses your concern. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally, I find this much more complicated than what you had before, and I am also not really sure if it will be that useful in practice (i.e. the usecase of wanting to implement your own There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@jreback again, 1) it's not defaulting for the generic ExtensionArray, the scalar ops is only implemented for this optional mixin 2) this is entirely the point of this PR. This PR is not intending to solve the general ops issue for ExtensionArrays, its purpose was to provide a easy solution for a very specific use case. |
||
"""A mixin for defining the arithmetic operations on an ExtensionArray | ||
class, where it assumed that the underlying objects have the operators | ||
already defined. | ||
|
||
Usage | ||
------ | ||
If you have defined a subclass MyClass(ExtensionArray), then | ||
use MyClass(ExtensionArray, ExtensionScalarArithmeticMixin) to | ||
get the arithmetic operators | ||
""" | ||
|
||
__add__ = ExtensionScalarOpsMixin._create_method(operator.add) | ||
__radd__ = ExtensionScalarOpsMixin._create_method(ops.radd) | ||
__sub__ = ExtensionScalarOpsMixin._create_method(operator.sub) | ||
__rsub__ = ExtensionScalarOpsMixin._create_method(ops.rsub) | ||
__mul__ = ExtensionScalarOpsMixin._create_method(operator.mul) | ||
__rmul__ = ExtensionScalarOpsMixin._create_method(ops.rmul) | ||
__pow__ = ExtensionScalarOpsMixin._create_method(operator.pow) | ||
__rpow__ = ExtensionScalarOpsMixin._create_method(ops.rpow) | ||
__mod__ = ExtensionScalarOpsMixin._create_method(operator.mod) | ||
__rmod__ = ExtensionScalarOpsMixin._create_method(ops.rmod) | ||
__floordiv__ = ExtensionScalarOpsMixin._create_method(operator.floordiv) | ||
__rfloordiv__ = ExtensionScalarOpsMixin._create_method(ops.rfloordiv) | ||
__truediv__ = ExtensionScalarOpsMixin._create_method(operator.truediv) | ||
__rtruediv__ = ExtensionScalarOpsMixin._create_method(ops.rtruediv) | ||
if not PY3: | ||
__div__ = ExtensionScalarOpsMixin._create_method(operator.div) | ||
__rdiv__ = ExtensionScalarOpsMixin._create_method(ops.rdiv) | ||
|
||
__divmod__ = ExtensionScalarOpsMixin._create_method(divmod) | ||
__rdivmod__ = ExtensionScalarOpsMixin._create_method(ops.rdivmod) | ||
|
||
|
||
class ExtensionScalarComparisonMixin(ExtensionScalarOpsMixin): | ||
"""A mixin for defining the comparison operations on an ExtensionArray | ||
class, where it assumed that the underlying objects have the operators | ||
already defined. | ||
|
||
Usage | ||
------ | ||
If you have defined a subclass MyClass(ExtensionArray), then | ||
use MyClass(ExtensionArray, ExtensionComparisonMixin) to | ||
get the comparison operators | ||
""" | ||
|
||
__eq__ = ExtensionScalarOpsMixin._create_method(operator.eq, False) | ||
__ne__ = ExtensionScalarOpsMixin._create_method(operator.ne, False) | ||
__lt__ = ExtensionScalarOpsMixin._create_method(operator.lt, False) | ||
__gt__ = ExtensionScalarOpsMixin._create_method(operator.gt, False) | ||
__le__ = ExtensionScalarOpsMixin._create_method(operator.le, False) | ||
__ge__ = ExtensionScalarOpsMixin._create_method(operator.ge, False) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2972,16 +2972,20 @@ def get_value(self, series, key): | |
# use this, e.g. DatetimeIndex | ||
s = getattr(series, '_values', None) | ||
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key): | ||
# GH 20825 | ||
# GH 20882, 21257 | ||
# Unify Index and ExtensionArray treatment | ||
# First try to convert the key to a location | ||
# If that fails, see if key is an integer, and | ||
# If that fails, raise a KeyError if an integer | ||
# index, otherwise, see if key is an integer, and | ||
# try that | ||
try: | ||
iloc = self.get_loc(key) | ||
return s[iloc] | ||
except KeyError: | ||
if is_integer(key): | ||
if (len(self) > 0 and | ||
self.inferred_type in ['integer', 'boolean']): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What makes this necessary? I'm not clear on how it relates to the Mixin classes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For this you can comment on the relevant other PR (#21260) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel the other PR (#21260) was needed to make the tests here work right. |
||
raise | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls move all non-ops code to another PR There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same answer as above: this is needed for the test to pass untill the other PRs are merged (to say: it is already moved to separate PRs) |
||
elif is_integer(key): | ||
return s[key] | ||
|
||
s = com._values_from_object(series) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2203,23 +2203,22 @@ def _binop(self, other, func, level=None, fill_value=None): | |
result.name = None | ||
return result | ||
|
||
def combine(self, other, func, fill_value=np.nan): | ||
def combine(self, other, func, fill_value=None): | ||
""" | ||
Perform elementwise binary operation on two Series using given function | ||
with optional fill value when an index is missing from one Series or | ||
the other | ||
|
||
Parameters | ||
---------- | ||
other : Series or scalar value | ||
func : function | ||
Function that takes two scalars as inputs and return a scalar | ||
fill_value : scalar value | ||
|
||
The default specifies to use the appropriate NaN value for | ||
the underlying dtype of the Series | ||
Returns | ||
------- | ||
result : Series | ||
|
||
Examples | ||
-------- | ||
>>> s1 = Series([1, 2]) | ||
|
@@ -2228,26 +2227,36 @@ def combine(self, other, func, fill_value=np.nan): | |
0 0 | ||
1 2 | ||
dtype: int64 | ||
|
||
See Also | ||
-------- | ||
Series.combine_first : Combine Series values, choosing the calling | ||
Series's values first | ||
""" | ||
self_is_ext = is_extension_array_dtype(self.values) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would move all of this (combine stuff) to another PR on top of this one. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
if fill_value is None: | ||
fill_value = na_value_for_dtype(self.dtype, False) | ||
|
||
if isinstance(other, Series): | ||
new_index = self.index.union(other.index) | ||
new_name = ops.get_op_result_name(self, other) | ||
new_values = np.empty(len(new_index), dtype=self.dtype) | ||
for i, idx in enumerate(new_index): | ||
new_values = [] | ||
for idx in new_index: | ||
lv = self.get(idx, fill_value) | ||
rv = other.get(idx, fill_value) | ||
with np.errstate(all='ignore'): | ||
new_values[i] = func(lv, rv) | ||
new_values.append(func(lv, rv)) | ||
else: | ||
new_index = self.index | ||
with np.errstate(all='ignore'): | ||
new_values = func(self._values, other) | ||
new_values = [func(lv, other) for lv in self._values] | ||
new_name = self.name | ||
|
||
if self_is_ext and not is_categorical_dtype(self.values): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't use shorthands like this make this simpler, something like
much more readable There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment that in #21183 |
||
try: | ||
new_values = self._values._from_sequence(new_values) | ||
except TypeError: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are you catching a TypeError? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is also for the other PR, but: because the result might not necessarily be an ExtensionArray. This is of course a bit an unclear area of |
||
pass | ||
|
||
return self._constructor(new_values, index=new_index, name=new_name) | ||
|
||
def combine_first(self, other): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this label belongs to the title below your added content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed