Skip to content

API: Uses pd.NA in IntegerArray #29964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Dec 30, 2019
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
1eec965
API: Uses pd.NA in IntegerArray
TomAugspurger Dec 2, 2019
f5f61ea
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 2, 2019
c569562
wip
TomAugspurger Dec 2, 2019
a8261a4
wip
TomAugspurger Dec 3, 2019
c8ff04f
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 3, 2019
cddc9df
fixup value counts
TomAugspurger Dec 3, 2019
9488d34
fixed to_numpy
TomAugspurger Dec 3, 2019
0d5aab8
doc
TomAugspurger Dec 3, 2019
fa61a6d
wip
TomAugspurger Dec 3, 2019
de2c6c6
wip
TomAugspurger Dec 3, 2019
60d7663
wip
TomAugspurger Dec 3, 2019
a4c4618
fixup extension
TomAugspurger Dec 3, 2019
0a500be
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
1c716f3
update tests
TomAugspurger Dec 4, 2019
67c8d51
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
22a2bc7
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 4, 2019
34de18e
updates
TomAugspurger Dec 4, 2019
78944d1
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 5, 2019
ffbe299
wip
TomAugspurger Dec 5, 2019
7abf40e
API: Handle pow & rpow special cases
TomAugspurger Dec 5, 2019
36d403d
move
TomAugspurger Dec 6, 2019
f6b4062
Merge remote-tracking branch 'upstream/master' into na-pow
TomAugspurger Dec 6, 2019
945e8cd
revert
TomAugspurger Dec 6, 2019
04546f3
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 6, 2019
a493965
Merge remote-tracking branch 'upstream/master' into na-pow
TomAugspurger Dec 6, 2019
8fc8b3a
fixup
TomAugspurger Dec 6, 2019
a49aa65
handle negative
TomAugspurger Dec 6, 2019
8ad166d
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 6, 2019
dd745c3
Merge branch 'na-pow' into NA-scalar+IntegerArray
TomAugspurger Dec 6, 2019
88fa412
expand test
TomAugspurger Dec 6, 2019
0902eef
wip
TomAugspurger Dec 6, 2019
721a1ea
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 9, 2019
c658307
fixup
TomAugspurger Dec 9, 2019
4f9d775
exceptions
TomAugspurger Dec 9, 2019
1244ef4
wip
TomAugspurger Dec 9, 2019
4a34b45
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 9, 2019
5293d87
fixup
TomAugspurger Dec 9, 2019
39f225a
arrow
TomAugspurger Dec 9, 2019
ea19b2d
update
TomAugspurger Dec 9, 2019
fe2d98e
fixup
TomAugspurger Dec 10, 2019
68fe155
update
TomAugspurger Dec 10, 2019
f27a5c2
fixup
TomAugspurger Dec 10, 2019
b97450b
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 16, 2019
5d62af8
updates
TomAugspurger Dec 16, 2019
2bf57d6
test, repr
TomAugspurger Dec 16, 2019
2f4e1cd
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 17, 2019
021dc7b
fixup
TomAugspurger Dec 17, 2019
197f18b
enable
TomAugspurger Dec 17, 2019
259b779
fixup
TomAugspurger Dec 17, 2019
c0cfef9
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 18, 2019
3183d53
ints
TomAugspurger Dec 18, 2019
4986d84
restore comment
TomAugspurger Dec 18, 2019
76806e9
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 30, 2019
64b4ccc
Merge remote-tracking branch 'upstream/master' into NA-scalar+Integer…
TomAugspurger Dec 30, 2019
b39dc60
docs
TomAugspurger Dec 30, 2019
800158d
docs
TomAugspurger Dec 30, 2019
e5d6832
fixup
TomAugspurger Dec 30, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion doc/source/user_guide/integer_na.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,16 @@ Nullable integer data type
IntegerArray is currently experimental. Its API or implementation may
change without warning.


In :ref:`missing_data`, we saw that pandas primarily uses ``NaN`` to represent
missing data. Because ``NaN`` is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.

Construction
------------

Pandas can represent integer data with possibly missing values using
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
implemented within pandas.
Expand All @@ -39,6 +41,12 @@ NumPy's ``'int64'`` dtype:

pd.array([1, 2, np.nan], dtype="Int64")

All NA-like values are replaced with :attr:`pandas.NA`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may want to add a versionchanged tag here (and below)


.. ipython:: python

pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")

This array can be stored in a :class:`DataFrame` or :class:`Series` like any
NumPy array.

Expand Down Expand Up @@ -78,6 +86,9 @@ with the dtype.
In the future, we may provide an option for :class:`Series` to infer a
nullable-integer dtype.

Operations
----------

Operations involving an integer array will behave similar to NumPy arrays.
Missing values will be propagated, and the data will be coerced to another
dtype if needed.
Expand Down Expand Up @@ -123,3 +134,15 @@ Reduction and groupby operations such as 'sum' work as well.

df.sum()
df.groupby('B').A.sum()

Scalar NA Value
---------------

:class:`arrays.IntegerArray` uses :attr:`pandas.NA` as its scalar
missing value. Slicing a single element that's missing will return
:attr:`pandas.NA`

.. ipython:: python

a = pd.array([1, None], dtype="Int64")
a[1]
17 changes: 15 additions & 2 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

from pandas._libs import lib
from pandas.compat import set_function_name
from pandas.compat.numpy import function as nv

from pandas.core.dtypes.base import ExtensionDtype
from pandas.core.dtypes.cast import astype_nansafe
Expand Down Expand Up @@ -554,7 +555,6 @@ def _values_for_argsort(self) -> np.ndarray:
@classmethod
def _create_logical_method(cls, op):
def logical_method(self, other):

if isinstance(other, (ABCDataFrame, ABCSeries, ABCIndexClass)):
# Rely on pandas to unbox and dispatch to us.
return NotImplemented
Expand Down Expand Up @@ -597,8 +597,11 @@ def _create_comparison_method(cls, op):
op_name = op.__name__

def cmp_method(self, other):
from pandas.arrays import IntegerArray

if isinstance(other, (ABCDataFrame, ABCSeries, ABCIndexClass)):
if isinstance(
other, (ABCDataFrame, ABCSeries, ABCIndexClass, IntegerArray)
):
# Rely on pandas to unbox and dispatch to us.
return NotImplemented

Expand Down Expand Up @@ -663,6 +666,16 @@ def _reduce(self, name, skipna=True, **kwargs):

return result

def any(self, axis=None, out=None, keepdims=False, skipna=True):
# Note: needed to implement for
# pandas/tests/arrays/test_integer.py::test_preserve_dtypes[sum]
nv.validate_any((), dict(out=out, keepdims=keepdims))
return self._reduce("any", skipna=skipna)

def all(self, axis=None, out=None, keepdims=False, skipna=True):
nv.validate_any((), dict(out=out, keepdims=keepdims))
return self._reduce("all", skipna=skipna)

def _maybe_mask_result(self, result, mask, other, op_name):
"""
Parameters
Expand Down
64 changes: 47 additions & 17 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import numbers
from typing import Type
from typing import Any, Tuple, Type
import warnings

import numpy as np

from pandas._libs import lib
from pandas._libs import lib, missing as libmissing
from pandas.compat import set_function_name
from pandas.util._decorators import cache_readonly

Expand Down Expand Up @@ -43,7 +43,7 @@ class _IntegerDtype(ExtensionDtype):
name: str
base = None
type: Type
na_value = np.nan
na_value = libmissing.NA

def __repr__(self) -> str:
sign = "U" if self.is_unsigned_integer else ""
Expand Down Expand Up @@ -267,6 +267,11 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):

.. versionadded:: 0.24.0

.. versionchanged:: 1.0.0

Now uses :attr:`pandas.NA` as its missing value, rather
than :attr:`numpy.nan`.

.. warning::

IntegerArray is currently experimental, and its API or internal
Expand Down Expand Up @@ -377,14 +382,28 @@ def __getitem__(self, item):
return self._data[item]
return type(self)(self._data[item], self._mask[item])

def _coerce_to_ndarray(self):
def _coerce_to_ndarray(self, dtype=None, na_value=libmissing.NA):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to make a to_array that's basically this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have to_numeric which is the canonical form of to_array (rather do conversions there)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeff, the discussion about to_numpy (the to_array was a typo I think) moved to #30038 in the mean time. Can you move your comment there if relevant?

Note that to_numeric is a function that converts any thing to a numeric type. While this function here is to convert a numeric type (this IntegerArray) to any other numpy dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to_array should have been to_numpy.

"""
coerce to an ndarary of object dtype
"""

# TODO(jreback) make this better
data = self._data.astype(object)
data[self._mask] = self._na_value
if dtype is None:
dtype = object
elif is_float_dtype(dtype) and na_value is libmissing.NA:
# XXX: Do we want to implicitly treat NA as NaN here?
# We should be deliberate in this decision.
na_value = np.nan

data = self._data.astype(dtype)

if (
is_integer_dtype(dtype)
and na_value is libmissing.NA
and not self._mask.any()
):
return data
else:
data[self._mask] = na_value
return data

__array_priority__ = 1000 # higher than ndarray so ops dispatch to us
Expand All @@ -394,7 +413,7 @@ def __array__(self, dtype=None):
the array interface, return my values
We return an object array here to preserve our scalar values
"""
return self._coerce_to_ndarray()
return self._coerce_to_ndarray(dtype=dtype)

def __arrow_array__(self, type=None):
"""
Expand Down Expand Up @@ -510,7 +529,7 @@ def isna(self):

@property
def _na_value(self):
return np.nan
return self.dtype.na_value

@classmethod
def _concat_same_type(cls, to_concat):
Expand Down Expand Up @@ -549,7 +568,7 @@ def astype(self, dtype, copy=True):
return type(self)(result, mask=self._mask, copy=False)

# coerce
data = self._coerce_to_ndarray()
data = self._coerce_to_ndarray(dtype=dtype)
return astype_nansafe(data, dtype, copy=None)

@property
Expand Down Expand Up @@ -604,12 +623,17 @@ def value_counts(self, dropna=True):
# w/o passing the dtype
array = np.append(array, [self._mask.sum()])
index = Index(
np.concatenate([index.values, np.array([np.nan], dtype=object)]),
np.concatenate(
[index.values, np.array([self.dtype.na_value], dtype=object)]
),
dtype=object,
)

return Series(array, index=index)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
return self._coerce_to_ndarray(na_value=np.nan), np.nan

def _values_for_argsort(self) -> np.ndarray:
"""Return values for sorting.

Expand All @@ -629,13 +653,13 @@ def _values_for_argsort(self) -> np.ndarray:

@classmethod
def _create_comparison_method(cls, op):
op_name = op.__name__

@unpack_zerodim_and_defer(op.__name__)
def cmp_method(self, other):
from pandas.arrays import BooleanArray

mask = None

if isinstance(other, IntegerArray):
if isinstance(other, (BooleanArray, IntegerArray)):
other, mask = other._data, other._mask

elif is_list_like(other):
Expand All @@ -660,8 +684,7 @@ def cmp_method(self, other):
else:
mask = self._mask | mask

result[mask] = op_name == "ne"
return result
return BooleanArray(result, mask)

name = "__{name}__".format(name=op.__name__)
return set_function_name(cmp_method, name, cls)
Expand All @@ -673,7 +696,8 @@ def _reduce(self, name, skipna=True, **kwargs):
# coerce to a nan-aware float if needed
if mask.any():
data = self._data.astype("float64")
data[mask] = self._na_value
# We explicitly use NaN within reductions.
data[mask] = np.nan

op = getattr(nanops, "nan" + name)
result = op(data, axis=0, skipna=skipna, mask=mask, **kwargs)
Expand Down Expand Up @@ -784,6 +808,12 @@ def integer_arithmetic_method(self, other):
_dtype_docstring = """
An ExtensionDtype for {dtype} integer data.

.. versionchanged:: 1.0.0

Now uses :attr:`pandas.NA` as its missing value,
rather than :attr:`numpy.nan`.


Attributes
----------
None
Expand Down
Loading