-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Implement Kleene logic for BooleanArray #29842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
bb904cb
13c7ea3
fff786f
4067e7f
708c553
c56894e
2e9d547
373aaab
7f78a64
36b171b
747e046
d0a8cca
fe061b0
9f9e44c
0a34257
2ba0034
2d1129a
a24fc22
77dd1fc
7b9002c
c18046b
1237caa
2ecf9b8
87aeb09
969b6dc
1c9ba49
8eec954
cb47b6a
2a946b9
efb6f8b
004238e
5a2c81c
7032318
bbb7f9b
ce763b4
5bc5328
457bd08
31c2bc6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
.. currentmodule:: pandas | ||
|
||
.. ipython:: python | ||
:suppress: | ||
|
||
import pandas as pd | ||
import numpy as np | ||
|
||
.. _boolean: | ||
|
||
************************** | ||
Nullable Boolean Data Type | ||
************************** | ||
|
||
.. versionadded:: 1.0.0 | ||
|
||
.. _boolean.klean: | ||
|
||
Kleene Logic | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
------------ | ||
|
||
:class:`arrays.BooleanArray` implements Kleene logic (sometimes called three-value logic) for | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
logical operations like ``&`` (and), ``|`` (or) and ``^`` (exclusive-or). | ||
|
||
This table demonstrates the results for every combination. These operations are symmetrical, | ||
so flipping the left- and right-hand side makes no difference in the result. | ||
|
||
================= ========= | ||
Expression Result | ||
================= ========= | ||
``True & True`` ``True`` | ||
``True & False`` ``False`` | ||
``True & NA`` ``NA`` | ||
``False & False`` ``False`` | ||
``False & NA`` ``False`` | ||
``NA & NA`` ``NA`` | ||
``True | True`` ``True`` | ||
``True | False`` ``True`` | ||
``True | NA`` ``True`` | ||
``False | False`` ``False`` | ||
``False | NA`` ``NA`` | ||
``NA | NA`` ``NA`` | ||
``True ^ True`` ``False`` | ||
``True ^ False`` ``True`` | ||
``True ^ NA`` ``NA`` | ||
``False ^ False`` ``False`` | ||
``False ^ NA`` ``NA`` | ||
``NA ^ NA`` ``NA`` | ||
================= ========= | ||
|
||
When an ``NA`` is present in an operation, the output value is ``NA`` only if | ||
the result cannot be determined soley based on the other input. For example, | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
``True | NA`` is ``True``, because both ``True | True`` and ``True | False`` | ||
are ``True``. In that case, we don't actually need to consider the value | ||
of the ``NA``. | ||
|
||
On the other hand, ``True & NA`` is ``NA``. The result depends on whether | ||
the ``NA`` really is ``True`` or ``False``, since ``True & True`` is ``True``, | ||
but ``True & False`` is ``False``, so we can't determine the output. | ||
|
||
|
||
This differs from how ``np.nan`` behaves in logical operations. Pandas treated | ||
``np.nan`` is *always false in the output*. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In ``or`` | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([True, False, np.nan], dtype="object") | True | ||
pd.Series([True, False, np.nan], dtype="boolean") | True | ||
|
||
In ``and`` | ||
|
||
.. ipython:: python | ||
|
||
pd.Series([True, False, np.nan], dtype="object") & True | ||
pd.Series([True, False, np.nan], dtype="boolean") & True |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
import numbers | ||
from typing import TYPE_CHECKING, Type | ||
from typing import TYPE_CHECKING, Optional, Type, Union | ||
import warnings | ||
|
||
import numpy as np | ||
|
@@ -184,6 +184,9 @@ class BooleanArray(ExtensionArray, ExtensionOpsMixin): | |
represented by 2 numpy arrays: a boolean array with the data and | ||
a boolean array with the mask (True indicating missing). | ||
|
||
BooleanArray implements Kleene logic (sometimes called three-value | ||
logic) for logical operations. See :ref:`` for more. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is "ref" here a placeholder? |
||
|
||
To construct an BooleanArray from generic array-like input, use | ||
:func:`pandas.array` specifying ``dtype="boolean"`` (see examples | ||
below). | ||
|
@@ -283,7 +286,7 @@ def __getitem__(self, item): | |
|
||
def _coerce_to_ndarray(self, force_bool: bool = False): | ||
""" | ||
Coerce to an ndarary of object dtype or bool dtype (if force_bool=True). | ||
Coerce to an ndarray of object dtype or bool dtype (if force_bool=True). | ||
|
||
Parameters | ||
---------- | ||
|
@@ -559,35 +562,34 @@ def logical_method(self, other): | |
# Rely on pandas to unbox and dispatch to us. | ||
return NotImplemented | ||
|
||
assert op.__name__ in {"or_", "ror_", "and_", "rand_", "xor", "rxor"} | ||
other = lib.item_from_zerodim(other) | ||
other_is_booleanarray = isinstance(other, BooleanArray) | ||
other_is_scalar = lib.is_scalar(other) | ||
mask = None | ||
|
||
if isinstance(other, BooleanArray): | ||
if other_is_booleanarray: | ||
other, mask = other._data, other._mask | ||
elif is_list_like(other): | ||
other = np.asarray(other, dtype="bool") | ||
if other.ndim > 1: | ||
raise NotImplementedError( | ||
"can only perform ops with 1-d structures" | ||
) | ||
if len(self) != len(other): | ||
raise ValueError("Lengths must match to compare") | ||
other, mask = coerce_to_array(other, copy=False) | ||
|
||
# numpy will show a DeprecationWarning on invalid elementwise | ||
# comparisons, this will raise in the future | ||
with warnings.catch_warnings(): | ||
warnings.filterwarnings("ignore", "elementwise", FutureWarning) | ||
with np.errstate(all="ignore"): | ||
result = op(self._data, other) | ||
|
||
# nans propagate | ||
if mask is None: | ||
mask = self._mask | ||
else: | ||
mask = self._mask | mask | ||
if not other_is_scalar and len(self) != len(other): | ||
raise ValueError("Lengths must match to compare") | ||
|
||
return BooleanArray(result, mask) | ||
if op.__name__ in {"or_", "ror_"}: | ||
result, mask = kleene_or(self._data, other, self._mask, mask) | ||
return BooleanArray(result, mask) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
elif op.__name__ in {"and_", "rand_"}: | ||
result, mask = kleene_and(self._data, other, self._mask, mask) | ||
return BooleanArray(result, mask) | ||
elif op.__name__ in {"xor", "rxor"}: | ||
result, mask = kleene_xor(self._data, other, self._mask, mask) | ||
return BooleanArray(result, mask) | ||
|
||
name = "__{name}__".format(name=op.__name__) | ||
return set_function_name(logical_method, name, cls) | ||
|
@@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other): | |
return set_function_name(boolean_arithmetic_method, name, cls) | ||
|
||
|
||
def kleene_or( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would these be better in a separate module (in case we decide to re-use them) and this becomes less cluttered |
||
left: Union[bool, np.ndarray], | ||
right: Union[bool, np.ndarray], | ||
left_mask: Optional[np.ndarray], | ||
right_mask: Optional[np.ndarray], | ||
): | ||
""" | ||
Boolean ``or`` using Kleene logic. | ||
|
||
Values are NA where we have ``NA | NA`` or ``NA | False``. | ||
``NA | True`` is considered True. | ||
|
||
Parameters | ||
---------- | ||
left, right : ndarray, NA, or bool | ||
The values of the array. | ||
left_mask, right_mask : ndarray, optional | ||
The masks. When | ||
|
||
Returns | ||
------- | ||
result, mask: ndarray[bool] | ||
The result of the logical or, and the new mask. | ||
""" | ||
# To reduce the number of cases, we ensure that `left` & `left_mask` | ||
# always come from an array, not a scalar. This is safe, since because | ||
# A | B == B | A | ||
if left_mask is None: | ||
return kleene_or(right, left, right_mask, left_mask) | ||
|
||
raise_for_nan(right, method="or") | ||
|
||
mask = left_mask | ||
|
||
if right_mask is not None: | ||
mask = mask | right_mask | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this still needed with the new code below to create the mask? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe so, though I may be wrong... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah sorry, not needed after using your code. Thanks. |
||
else: | ||
mask = mask.copy() | ||
|
||
# handle scalars: | ||
# if right_is_scalar and right is libmissing.NA: | ||
# result = left.copy() | ||
# mask = left_mask.copy() | ||
# mask[~result] = True | ||
# return result, mask | ||
|
||
result = left | right | ||
mask[left & ~left_mask] = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A logical op is quite a bit faster than a setitem like this, so if we can get the final result by combining different logical ops, that might be more performant |
||
if right_mask is not None: | ||
mask[right & ~right_mask] = False | ||
elif right is True: | ||
mask[:] = False | ||
|
||
# update | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return result, mask | ||
|
||
|
||
def kleene_xor( | ||
left: Union[bool, np.ndarray], | ||
right: Union[bool, np.ndarray], | ||
left_mask: Optional[np.ndarray], | ||
right_mask: Optional[np.ndarray], | ||
): | ||
""" | ||
Boolean ``xor`` using Kleene logic. | ||
|
||
This is the same as ``or``, with the following adjustments | ||
|
||
* True, True -> False | ||
* True, NA -> NA | ||
|
||
Parameters | ||
---------- | ||
left, right : ndarray, NA, or bool | ||
The values of the array. | ||
left_mask, right_mask : ndarray, optional | ||
The masks. When | ||
|
||
Returns | ||
------- | ||
result, mask: ndarray[bool] | ||
The result of the logical xor, and the new mask. | ||
""" | ||
if left_mask is None: | ||
return kleene_xor(right, left, right_mask, left_mask) | ||
|
||
raise_for_nan(right, method="xor") | ||
# Re-use or, and update with adjustments. | ||
result, mask = kleene_or(left, right, left_mask, right_mask) | ||
|
||
# # TODO(pd.NA): change to pd.NA | ||
# if lib.is_scalar(right) and right is libmissing.NA: | ||
# # True | NA == True | ||
# # True ^ NA == NA | ||
# mask[result] = True | ||
|
||
result[left & right] = False | ||
mask[right & left_mask] = True | ||
if right_mask is not None: | ||
mask[left & right_mask] = True | ||
|
||
result[mask] = False | ||
return result, mask | ||
|
||
|
||
def kleene_and( | ||
left: Union[bool, np.ndarray], | ||
right: Union[bool, np.ndarray], | ||
left_mask: Optional[np.ndarray], | ||
right_mask: Optional[np.ndarray], | ||
): | ||
""" | ||
Boolean ``and`` using Kleene logic. | ||
|
||
Values are ``NA`` for ``NA & NA`` or ``True & NA``. | ||
|
||
Parameters | ||
---------- | ||
left, right : ndarray, NA, or bool | ||
The values of the array. | ||
left_mask, right_mask : ndarray, optional | ||
The masks. When | ||
|
||
Returns | ||
------- | ||
result, mask: ndarray[bool] | ||
The result of the logical xor, and the new mask. | ||
""" | ||
if left_mask is None: | ||
return kleene_and(right, left, right_mask, left_mask) | ||
|
||
raise_for_nan(right, method="and") | ||
mask = left_mask | ||
|
||
if right_mask is not None: | ||
mask = mask | right_mask | ||
else: | ||
mask = mask.copy() | ||
|
||
if lib.is_scalar(right): | ||
result = left.copy() | ||
mask = left_mask.copy() | ||
if np.isnan(right): | ||
# TODO(pd.NA): change to NA | ||
mask[result] = True | ||
else: | ||
result = result & right # already copied. | ||
if right is False: | ||
# unmask everything | ||
mask[:] = False | ||
else: | ||
result = left & right | ||
# unmask where either left or right is False | ||
mask[~left & ~left_mask] = False | ||
mask[~right & ~right_mask] = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment here, I think that something like this can be faster:
(avoiding setitem) And need to think if we can avoid some There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Further optimization:
Timing comparison:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And on bigger arrays, the difference is much bigger, it seems. For 100_000 elements, I get 775 µs vs 45 µs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks. I'll also add an asv for these ops. |
||
|
||
result[mask] = False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to do this (we can't rely on that in general anyway, and then this just gives a performance degradation) |
||
return result, mask | ||
|
||
|
||
def raise_for_nan(value, method): | ||
if lib.is_scalar(value) and isinstance(value, float) and np.isnan(value): | ||
raise ValueError(f"Cannot perform logical '{method}' with NaN") | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
BooleanArray._add_logical_ops() | ||
BooleanArray._add_comparison_ops() | ||
BooleanArray._add_arithmetic_ops() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.