Skip to content

ENH: Implement Kleene logic for BooleanArray #29842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
bb904cb
ENH: add BooleanArray extension array (#29555)
jorisvandenbossche Nov 25, 2019
13c7ea3
move
TomAugspurger Nov 26, 2019
fff786f
doc fixup
TomAugspurger Nov 26, 2019
4067e7f
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Nov 26, 2019
708c553
working
TomAugspurger Nov 26, 2019
c56894e
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Nov 27, 2019
2e9d547
updates
TomAugspurger Nov 27, 2019
373aaab
updates
TomAugspurger Nov 27, 2019
7f78a64
Raise for NaN
TomAugspurger Nov 27, 2019
36b171b
added tests for empty
TomAugspurger Nov 27, 2019
747e046
added tests for inplace mutation
TomAugspurger Nov 27, 2019
d0a8cca
Do not assume masked values are False
TomAugspurger Nov 27, 2019
fe061b0
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Nov 27, 2019
9f9e44c
mypy
TomAugspurger Nov 27, 2019
0a34257
doc fixups
TomAugspurger Nov 27, 2019
2ba0034
Added benchmarks
TomAugspurger Nov 27, 2019
2d1129a
update tests
TomAugspurger Nov 27, 2019
a24fc22
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Nov 27, 2019
77dd1fc
remove unneded setitem
TomAugspurger Nov 27, 2019
7b9002c
optimize
TomAugspurger Nov 27, 2019
c18046b
comments
TomAugspurger Nov 27, 2019
1237caa
just do the xor
TomAugspurger Nov 27, 2019
2ecf9b8
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Dec 2, 2019
87aeb09
fixup docstring
TomAugspurger Dec 2, 2019
969b6dc
fix label
TomAugspurger Dec 2, 2019
1c9ba49
PERF: faster or
TomAugspurger Dec 2, 2019
8eec954
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Dec 4, 2019
cb47b6a
handle pd.NA
TomAugspurger Dec 4, 2019
2a946b9
validate
TomAugspurger Dec 4, 2019
efb6f8b
please mypy
TomAugspurger Dec 4, 2019
004238e
move to nanops
TomAugspurger Dec 4, 2019
5a2c81c
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Dec 5, 2019
7032318
move
TomAugspurger Dec 5, 2019
bbb7f9b
numpy scalars
TomAugspurger Dec 5, 2019
ce763b4
doc note
TomAugspurger Dec 5, 2019
5bc5328
handle numpy bool
TomAugspurger Dec 5, 2019
457bd08
Merge remote-tracking branch 'upstream/master' into boolean-array-kleene
TomAugspurger Dec 6, 2019
31c2bc6
cleanup
TomAugspurger Dec 6, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/index.rst.template
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ See the :ref:`overview` for more detail about what's in the library.
* :doc:`user_guide/missing_data`
* :doc:`user_guide/categorical`
* :doc:`user_guide/integer_na`
* :doc:`user_guide/boolean`
* :doc:`user_guide/visualization`
* :doc:`user_guide/computation`
* :doc:`user_guide/groupby`
Expand Down
77 changes: 77 additions & 0 deletions doc/source/user_guide/boolean.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
.. currentmodule:: pandas

.. ipython:: python
:suppress:

import pandas as pd
import numpy as np

.. _boolean:

**************************
Nullable Boolean Data Type
**************************

.. versionadded:: 1.0.0

.. _boolean.klean:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. _boolean.klean:
.. _boolean.kleene:


Kleene Logic
------------

:class:`arrays.BooleanArray` implements Kleene logic (sometimes called three-value logic) for
logical operations like ``&`` (and), ``|`` (or) and ``^`` (exclusive-or).

This table demonstrates the results for every combination. These operations are symmetrical,
so flipping the left- and right-hand side makes no difference in the result.

================= =========
Expression Result
================= =========
``True & True`` ``True``
``True & False`` ``False``
``True & NA`` ``NA``
``False & False`` ``False``
``False & NA`` ``False``
``NA & NA`` ``NA``
``True | True`` ``True``
``True | False`` ``True``
``True | NA`` ``True``
``False | False`` ``False``
``False | NA`` ``NA``
``NA | NA`` ``NA``
``True ^ True`` ``False``
``True ^ False`` ``True``
``True ^ NA`` ``NA``
``False ^ False`` ``False``
``False ^ NA`` ``NA``
``NA ^ NA`` ``NA``
================= =========

When an ``NA`` is present in an operation, the output value is ``NA`` only if
the result cannot be determined soley based on the other input. For example,
``True | NA`` is ``True``, because both ``True | True`` and ``True | False``
are ``True``. In that case, we don't actually need to consider the value
of the ``NA``.

On the other hand, ``True & NA`` is ``NA``. The result depends on whether
the ``NA`` really is ``True`` or ``False``, since ``True & True`` is ``True``,
but ``True & False`` is ``False``, so we can't determine the output.


This differs from how ``np.nan`` behaves in logical operations. Pandas treated
``np.nan`` is *always false in the output*.

In ``or``

.. ipython:: python

pd.Series([True, False, np.nan], dtype="object") | True
pd.Series([True, False, np.nan], dtype="boolean") | True

In ``and``

.. ipython:: python

pd.Series([True, False, np.nan], dtype="object") & True
pd.Series([True, False, np.nan], dtype="boolean") & True
203 changes: 185 additions & 18 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import numbers
from typing import TYPE_CHECKING, Type
from typing import TYPE_CHECKING, Optional, Type, Union
import warnings

import numpy as np
Expand Down Expand Up @@ -184,6 +184,9 @@ class BooleanArray(ExtensionArray, ExtensionOpsMixin):
represented by 2 numpy arrays: a boolean array with the data and
a boolean array with the mask (True indicating missing).

BooleanArray implements Kleene logic (sometimes called three-value
logic) for logical operations. See :ref:`` for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "ref" here a placeholder?


To construct an BooleanArray from generic array-like input, use
:func:`pandas.array` specifying ``dtype="boolean"`` (see examples
below).
Expand Down Expand Up @@ -283,7 +286,7 @@ def __getitem__(self, item):

def _coerce_to_ndarray(self, force_bool: bool = False):
"""
Coerce to an ndarary of object dtype or bool dtype (if force_bool=True).
Coerce to an ndarray of object dtype or bool dtype (if force_bool=True).

Parameters
----------
Expand Down Expand Up @@ -559,35 +562,34 @@ def logical_method(self, other):
# Rely on pandas to unbox and dispatch to us.
return NotImplemented

assert op.__name__ in {"or_", "ror_", "and_", "rand_", "xor", "rxor"}
other = lib.item_from_zerodim(other)
other_is_booleanarray = isinstance(other, BooleanArray)
other_is_scalar = lib.is_scalar(other)
mask = None

if isinstance(other, BooleanArray):
if other_is_booleanarray:
other, mask = other._data, other._mask
elif is_list_like(other):
other = np.asarray(other, dtype="bool")
if other.ndim > 1:
raise NotImplementedError(
"can only perform ops with 1-d structures"
)
if len(self) != len(other):
raise ValueError("Lengths must match to compare")
other, mask = coerce_to_array(other, copy=False)

# numpy will show a DeprecationWarning on invalid elementwise
# comparisons, this will raise in the future
with warnings.catch_warnings():
warnings.filterwarnings("ignore", "elementwise", FutureWarning)
with np.errstate(all="ignore"):
result = op(self._data, other)

# nans propagate
if mask is None:
mask = self._mask
else:
mask = self._mask | mask
if not other_is_scalar and len(self) != len(other):
raise ValueError("Lengths must match to compare")

return BooleanArray(result, mask)
if op.__name__ in {"or_", "ror_"}:
result, mask = kleene_or(self._data, other, self._mask, mask)
return BooleanArray(result, mask)
elif op.__name__ in {"and_", "rand_"}:
result, mask = kleene_and(self._data, other, self._mask, mask)
return BooleanArray(result, mask)
elif op.__name__ in {"xor", "rxor"}:
result, mask = kleene_xor(self._data, other, self._mask, mask)
return BooleanArray(result, mask)

name = "__{name}__".format(name=op.__name__)
return set_function_name(logical_method, name, cls)
Expand Down Expand Up @@ -740,6 +742,171 @@ def boolean_arithmetic_method(self, other):
return set_function_name(boolean_arithmetic_method, name, cls)


def kleene_or(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would these be better in a separate module (in case we decide to re-use them) and this becomes less cluttered

left: Union[bool, np.ndarray],
right: Union[bool, np.ndarray],
left_mask: Optional[np.ndarray],
right_mask: Optional[np.ndarray],
):
"""
Boolean ``or`` using Kleene logic.

Values are NA where we have ``NA | NA`` or ``NA | False``.
``NA | True`` is considered True.

Parameters
----------
left, right : ndarray, NA, or bool
The values of the array.
left_mask, right_mask : ndarray, optional
The masks. When

Returns
-------
result, mask: ndarray[bool]
The result of the logical or, and the new mask.
"""
# To reduce the number of cases, we ensure that `left` & `left_mask`
# always come from an array, not a scalar. This is safe, since because
# A | B == B | A
if left_mask is None:
return kleene_or(right, left, right_mask, left_mask)

raise_for_nan(right, method="or")

mask = left_mask

if right_mask is not None:
mask = mask | right_mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed with the new code below to create the mask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so, though I may be wrong...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, not needed after using your code. Thanks.

else:
mask = mask.copy()

# handle scalars:
# if right_is_scalar and right is libmissing.NA:
# result = left.copy()
# mask = left_mask.copy()
# mask[~result] = True
# return result, mask

result = left | right
mask[left & ~left_mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A logical op is quite a bit faster than a setitem like this, so if we can get the final result by combining different logical ops, that might be more performant

if right_mask is not None:
mask[right & ~right_mask] = False
elif right is True:
mask[:] = False

# update
return result, mask


def kleene_xor(
left: Union[bool, np.ndarray],
right: Union[bool, np.ndarray],
left_mask: Optional[np.ndarray],
right_mask: Optional[np.ndarray],
):
"""
Boolean ``xor`` using Kleene logic.

This is the same as ``or``, with the following adjustments

* True, True -> False
* True, NA -> NA

Parameters
----------
left, right : ndarray, NA, or bool
The values of the array.
left_mask, right_mask : ndarray, optional
The masks. When

Returns
-------
result, mask: ndarray[bool]
The result of the logical xor, and the new mask.
"""
if left_mask is None:
return kleene_xor(right, left, right_mask, left_mask)

raise_for_nan(right, method="xor")
# Re-use or, and update with adjustments.
result, mask = kleene_or(left, right, left_mask, right_mask)

# # TODO(pd.NA): change to pd.NA
# if lib.is_scalar(right) and right is libmissing.NA:
# # True | NA == True
# # True ^ NA == NA
# mask[result] = True

result[left & right] = False
mask[right & left_mask] = True
if right_mask is not None:
mask[left & right_mask] = True

result[mask] = False
return result, mask


def kleene_and(
left: Union[bool, np.ndarray],
right: Union[bool, np.ndarray],
left_mask: Optional[np.ndarray],
right_mask: Optional[np.ndarray],
):
"""
Boolean ``and`` using Kleene logic.

Values are ``NA`` for ``NA & NA`` or ``True & NA``.

Parameters
----------
left, right : ndarray, NA, or bool
The values of the array.
left_mask, right_mask : ndarray, optional
The masks. When

Returns
-------
result, mask: ndarray[bool]
The result of the logical xor, and the new mask.
"""
if left_mask is None:
return kleene_and(right, left, right_mask, left_mask)

raise_for_nan(right, method="and")
mask = left_mask

if right_mask is not None:
mask = mask | right_mask
else:
mask = mask.copy()

if lib.is_scalar(right):
result = left.copy()
mask = left_mask.copy()
if np.isnan(right):
# TODO(pd.NA): change to NA
mask[result] = True
else:
result = result & right # already copied.
if right is False:
# unmask everything
mask[:] = False
else:
result = left & right
# unmask where either left or right is False
mask[~left & ~left_mask] = False
mask[~right & ~right_mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here, I think that something like this can be faster:

left_false = ~left & ~left_mask
right_false= ~right & ~right_mask

mask = (left_mask & ~right_false) | (right_mask & ~left_false)

(avoiding setitem)

And need to think if we can avoid some ~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further optimization:

left_false = ~(left | left_mask)
right_false= ~(right | right_mask)
mask = (left_mask & ~right_false) | (right_mask & ~left_false)

Timing comparison:

left = np.random.randint(0, 2, 1000).astype(bool)
right = np.random.randint(0, 2, 1000).astype(bool) 
left_mask = np.random.randint(0, 2, 1000).astype(bool) 
right_mask = np.random.randint(0, 2, 1000).astype(bool)
In [47]: %%timeit 
    ...: mask = left_mask | right_mask 
    ...: mask[~left & ~left_mask] = False 
    ...: mask[~right & ~right_mask] = False 

7.2 µs ± 106 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [58]: %%timeit 
    ...: left_false = ~(left | left_mask) 
    ...: right_false= ~(right | right_mask) 
    ...:  
    ...: mask = (left_mask & ~right_false) | (right_mask & ~left_false)
3.73 µs ± 275 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on bigger arrays, the difference is much bigger, it seems. For 100_000 elements, I get 775 µs vs 45 µs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll also add an asv for these ops.


result[mask] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this (we can't rely on that in general anyway, and then this just gives a performance degradation)

return result, mask


def raise_for_nan(value, method):
if lib.is_scalar(value) and isinstance(value, float) and np.isnan(value):
raise ValueError(f"Cannot perform logical '{method}' with NaN")


BooleanArray._add_logical_ops()
BooleanArray._add_comparison_ops()
BooleanArray._add_arithmetic_ops()
Loading