-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
SparseArray is an ExtensionArray #22325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 66 commits
ee187eb
32c1372
b265659
8dfc898
9c57725
13952ab
7a6e7fa
1016af1
072abec
0ad61cc
5b0b524
224744a
620b5fb
164c401
65f83d6
0b3c682
69a5d13
f2b5862
fa80fc5
3f20890
484adb0
1df1190
4246ac4
a849699
c4da319
a2f158f
26b671a
375e160
0a37050
3c2cb0f
27c6378
e52dae9
b6d8430
640c4a5
6b61597
427234f
e055629
a79359c
de3aa71
21f4ee3
c1e594a
dc7f93f
eb09d21
7dcf4b2
b39658a
a8b76bd
e041313
595535e
7700299
f1ff7da
33fa6f7
40c035e
1d49cc7
6f4b6b6
6f037b5
7da220e
bfbe4ab
c5666b6
ff6037c
5c362ef
55cac36
c4e8784
a00f987
a6d7eac
4b4f9bd
82801be
1a149dc
fde19d7
a7ba8f6
5064217
e31e8aa
79c8e9c
26993fe
6eeec11
50de326
5ef1747
f31970c
f1b860f
5c44275
33bc8f8
9bf13ad
de1fb5b
da580cd
88b73c3
afde64d
e603d3d
ec5eb9a
a72ee1a
f147635
c35c7c2
e159ef2
d48a8fa
3bcf57e
31d401f
a4369c2
608b499
14e60c9
550f163
821cc91
e21ed21
aeb8c8c
34c90ed
2103959
26af959
e5920c2
084a967
bb17760
dde7852
f1b4e6b
6a31077
02aa7f7
3a7ee2d
d6fe191
b1ea874
2213b83
94664c4
e54160c
04a2dbb
fb01d1a
f78ae81
11d5b40
ba70753
82bab3c
2990124
a9d0f17
0c52c37
998f113
38b0356
7206d94
fe771b5
12e424c
3bd567f
f816346
1a1dcf4
e3d9173
2715cdb
4e40599
0aa3934
a3becb6
5660b9a
dd3cba5
cc65b8a
06dce5f
f7351d3
2055494
f310322
0008164
027f6d8
c0d9875
44b218c
47fa73a
c2c489f
3729927
9ba49e1
543ac7c
f66ef6f
ba8fc9d
9185e33
11799ab
73e7626
ebece16
7db6990
be21f42
e857363
d0ee038
54f4417
2082d86
f846606
ce8e0ac
1f6590e
b758469
f6b0924
232518c
e8b37da
0197e0c
62326ae
f008c38
88c6126
5c8662e
78798cf
b051424
78979b6
2333db1
b41d473
d6a2479
a23c27c
7372eb3
cab8c54
52ae275
9c9b49e
f5d7492
b4b4cbc
bf98b9d
f3d2681
7d4d3ba
57c03c2
0dbc33e
c217cf5
2ea7a91
8f2f228
c83bed7
53e494e
627b9ce
df0293a
a590418
7821f19
ee26c52
40390f1
15a164d
88432c8
3e7ec90
7b0a179
20d8815
3e81c69
1098a7a
10d204a
69075d8
0764baa
a4a47c5
a5b6c39
70d8268
7aed79f
11e55aa
11606af
2f73179
1b3058a
f4ec928
8c67ca2
cc89ec7
3f713d4
886fe03
75099af
731fc06
f91141d
37a4b57
4aad8e1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -320,6 +320,22 @@ is the case with :attr:`Period.end_time`, for example | |
|
||
p.end_time | ||
|
||
.. _whatsnew_0240.api_breaking.sparse_values: | ||
|
||
``SparseArray`` is now an ``ExtensionArray`` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
This has some notable changes | ||
|
||
- ``SparseArray`` is no longer a subclass of :class:`numpy.ndarray` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we know of specific consequences that people might run into because of this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mmm not sure. The main thing is that there are many method implemented in ndarray that are not on SparseArray. Hard to say what's most used. |
||
- ``SparseArray.dtype`` and ``SparseSeries.dtype`` are now instances of ``SparseDtype``, rather than ``np.dtype``. Access the underlying dtype with ``SparseDtype.subdtype``. | ||
- :meth:`numpy.asarray(sparse_array)` now returns a dense array with all the values, not just the non-fill-value values (:issue:`todo`) | ||
- Providing a ``sparse_index`` to the SparseArray constructor no longer defaults the na-value to ``np.nan`` for all dtypes. The correct na_value for ``data.dtype`` is now used. | ||
- passing ``fill_value`` to ``SparseArray.take`` no longer implies ``allow_fill=True``. | ||
- ``SparseArray.astype(np.dtype)`` will create a dense NumPy array. To keep astype to a SparseArray with a different subdtype, use ``.astype(sparse_dtype)`` or a string like ``.astype('Sparse[float32]')``. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should the |
||
- Setting ``SparseArray.fill_value`` to a fill value with a different dtype is now allowed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will this then change the dtype of the SparseArray? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. It's a bad idea though (see SparseArray.fill_value.setter). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I may have misunderstood your question earlier. The answer may be no. SparseArray.dtype is a SparseDtype, which consists of two fields: the array dtype ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems a bit strange though to have sp_values and a fill_value that don't have compatible dtypes? |
||
- Bug in ``SparseArray.nbytes`` under-reporting its memory usage by not including the size of its sparse index. | ||
|
||
.. _whatsnew_0240.api.datetimelike.normalize: | ||
|
||
Tick DateOffset Normalize Restrictions | ||
|
@@ -418,7 +434,7 @@ ExtensionType Changes | |
- Bug in :meth:`Series.get` for ``Series`` using ``ExtensionArray`` and integer index (:issue:`21257`) | ||
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`) | ||
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`) | ||
- | ||
- Added ``ExtensionDtype._is_numeric`` for controlling whether an extension dtype is considered numeric. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should not be here anymore (since the other PRs are already merged?) (the same for the shift entry above) |
||
|
||
.. _whatsnew_0240.api.incompatibilities: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -15,7 +15,7 @@ | |
from pandas import compat | ||
from pandas.compat import iteritems, PY36, OrderedDict | ||
from pandas.core.dtypes.generic import ABCSeries, ABCIndex, ABCIndexClass | ||
from pandas.core.dtypes.common import is_integer | ||
from pandas.core.dtypes.common import is_integer, is_bool_dtype | ||
from pandas.core.dtypes.inference import _iterable_not_string | ||
from pandas.core.dtypes.missing import isna, isnull, notnull # noqa | ||
from pandas.core.dtypes.cast import construct_1d_object_array_from_listlike | ||
|
@@ -100,7 +100,12 @@ def maybe_box_datetimelike(value): | |
|
||
|
||
def is_bool_indexer(key): | ||
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex)): | ||
# TODO: This is currently broken for ExtensionArrays. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
# We currently special case SparseArray, but that should *maybe* be | ||
# just ExtensionArray. | ||
from pandas.core.sparse.api import SparseArray | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we have an ABCSparseArray |
||
|
||
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex, SparseArray)): | ||
if key.dtype == np.object_: | ||
key = np.asarray(values_from_object(key)) | ||
|
||
|
@@ -110,7 +115,7 @@ def is_bool_indexer(key): | |
'NA / NaN values') | ||
return False | ||
return True | ||
elif key.dtype == np.bool_: | ||
elif is_bool_dtype(key.dtype): | ||
return True | ||
elif isinstance(key, list): | ||
try: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,6 +11,7 @@ | |
DatetimeTZDtypeType, PeriodDtype, PeriodDtypeType, IntervalDtype, | ||
IntervalDtypeType, PandasExtensionDtype, ExtensionDtype, | ||
_pandas_registry) | ||
from pandas.core.sparse.dtype import SparseDtype | ||
from pandas.core.dtypes.generic import ( | ||
ABCCategorical, ABCPeriodIndex, ABCDatetimeIndex, ABCSeries, | ||
ABCSparseArray, ABCSparseSeries, ABCCategoricalIndex, ABCIndexClass, | ||
|
@@ -152,8 +153,22 @@ def is_sparse(arr): | |
>>> is_sparse(bsr_matrix([1, 2, 3])) | ||
False | ||
""" | ||
from pandas.core.sparse.array import SparseArray | ||
from pandas.core.sparse.dtype import SparseDtype | ||
from pandas.core.generic import ABCSeries | ||
from pandas.core.internals import BlockManager, Block | ||
|
||
return isinstance(arr, (ABCSparseArray, ABCSparseSeries)) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if isinstance(arr, BlockManager): | ||
# SparseArrays are only 1d | ||
if arr.ndim == 1: | ||
arr = arr.blocks[0] | ||
else: | ||
return False | ||
|
||
if isinstance(arr, (ABCSeries, Block)): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't think so, since that would densify a |
||
arr = arr.values | ||
|
||
return isinstance(arr, (SparseArray, ABCSparseSeries, SparseDtype)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it needed this function accepts dtype objects? (and also BlockManagers?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think so, though if we work on the Series constructor, this could maybe be avoided. This is so that pd.DataFrame({"A": pd.SparseSeries([1, 2])})['A'] is a SparseSeries. If we exclude the block handling, we get back a For including There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I suppose it's https://github.com/pandas-dev/pandas/pull/22325/files/fde19d74678507ae99f790c97189f030850c0250#diff-20f4f5e320ea4b1f8a4323e070f6883eR821, which could be changed. |
||
|
||
|
||
def is_scipy_sparse(arr): | ||
|
@@ -1608,8 +1623,9 @@ def is_bool_dtype(arr_or_dtype): | |
False | ||
>>> is_bool_dtype(np.array([True, False])) | ||
True | ||
>>> is_bool_dtype(pd.SparseArray([True, False])) | ||
True | ||
""" | ||
|
||
if arr_or_dtype is None: | ||
return False | ||
try: | ||
|
@@ -1626,7 +1642,8 @@ def is_bool_dtype(arr_or_dtype): | |
# guess this | ||
return (arr_or_dtype.is_object and | ||
arr_or_dtype.inferred_type == 'boolean') | ||
|
||
elif isinstance(arr_or_dtype, SparseDtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what is this for? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is for |
||
return issubclass(arr_or_dtype.subdtype.type, np.bool_) | ||
return issubclass(tipo, np.bool_) | ||
|
||
|
||
|
@@ -1706,6 +1723,8 @@ def is_extension_array_dtype(arr_or_dtype): | |
array interface. In pandas, this includes: | ||
|
||
* Categorical | ||
* Sparse | ||
* Interval | ||
|
||
Third-party libraries may implement arrays or types satisfying | ||
this interface as well. | ||
|
@@ -1828,7 +1847,8 @@ def _get_dtype(arr_or_dtype): | |
return PeriodDtype.construct_from_string(arr_or_dtype) | ||
elif is_interval_dtype(arr_or_dtype): | ||
return IntervalDtype.construct_from_string(arr_or_dtype) | ||
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex)): | ||
elif isinstance(arr_or_dtype, (ABCCategorical, ABCCategoricalIndex, | ||
ABCSparseArray, ABCSparseSeries)): | ||
return arr_or_dtype.dtype | ||
|
||
if hasattr(arr_or_dtype, 'dtype'): | ||
|
@@ -1876,6 +1896,10 @@ def _get_dtype_type(arr_or_dtype): | |
elif is_interval_dtype(arr_or_dtype): | ||
return IntervalDtypeType | ||
return _get_dtype_type(np.dtype(arr_or_dtype)) | ||
elif isinstance(arr_or_dtype, (ABCSparseSeries, ABCSparseArray, | ||
SparseDtype)): | ||
dtype = getattr(arr_or_dtype, 'dtype', arr_or_dtype) | ||
return dtype.type | ||
try: | ||
return arr_or_dtype.dtype.type | ||
except AttributeError: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,7 +97,9 @@ def _get_frame_result_type(result, objs): | |
otherwise, return 1st obj | ||
""" | ||
|
||
if result.blocks and all(b.is_sparse for b in result.blocks): | ||
if (result.blocks and ( | ||
all(is_sparse(b) for b in result.blocks) or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. related to my comment above. cannot is_sparse not simply check if its an EA and if it has a Sparse Dtype? then you simply need to pass the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll give that a shot. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment here, its not obvious what you are doing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how can obj be a SparseFrame here? is this tested? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a comment of mine may have been lost. This is hit in several places (e.g. What part can I clarify here? |
||
all(isinstance(obj, ABCSparseDataFrame) for obj in objs))): | ||
from pandas.core.sparse.api import SparseDataFrame | ||
return SparseDataFrame | ||
else: | ||
|
@@ -554,61 +556,23 @@ def _concat_sparse(to_concat, axis=0, typs=None): | |
a single array, preserving the combined dtypes | ||
""" | ||
|
||
from pandas.core.sparse.array import SparseArray, _make_index | ||
from pandas.core.sparse.array import SparseArray | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def convert_sparse(x, axis): | ||
# coerce to native type | ||
if isinstance(x, SparseArray): | ||
x = x.get_values() | ||
else: | ||
x = np.asarray(x) | ||
x = x.ravel() | ||
if axis > 0: | ||
x = np.atleast_2d(x) | ||
return x | ||
fill_values = [x.fill_value for x in to_concat | ||
if isinstance(x, SparseArray)] | ||
|
||
if typs is None: | ||
typs = get_dtype_kinds(to_concat) | ||
if len(set(fill_values)) > 1: | ||
raise ValueError("Cannot concatenate SparseArrays with different " | ||
"fill values") | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if len(typs) == 1: | ||
# concat input as it is if all inputs are sparse | ||
# and have the same fill_value | ||
fill_values = {c.fill_value for c in to_concat} | ||
if len(fill_values) == 1: | ||
sp_values = [c.sp_values for c in to_concat] | ||
indexes = [c.sp_index.to_int_index() for c in to_concat] | ||
|
||
indices = [] | ||
loc = 0 | ||
for idx in indexes: | ||
indices.append(idx.indices + loc) | ||
loc += idx.length | ||
sp_values = np.concatenate(sp_values) | ||
indices = np.concatenate(indices) | ||
sp_index = _make_index(loc, indices, kind=to_concat[0].sp_index) | ||
|
||
return SparseArray(sp_values, sparse_index=sp_index, | ||
fill_value=to_concat[0].fill_value) | ||
|
||
# input may be sparse / dense mixed and may have different fill_value | ||
# input must contain sparse at least 1 | ||
sparses = [c for c in to_concat if is_sparse(c)] | ||
fill_values = [c.fill_value for c in sparses] | ||
sp_indexes = [c.sp_index for c in sparses] | ||
|
||
# densify and regular concat | ||
to_concat = [convert_sparse(x, axis) for x in to_concat] | ||
result = np.concatenate(to_concat, axis=axis) | ||
|
||
if not len(typs - {'sparse', 'f', 'i'}): | ||
# sparsify if inputs are sparse and dense numerics | ||
# first sparse input's fill_value and SparseIndex is used | ||
result = SparseArray(result.ravel(), fill_value=fill_values[0], | ||
kind=sp_indexes[0]) | ||
else: | ||
# coerce to object if needed | ||
result = result.astype('object') | ||
return result | ||
fill_value = list(fill_values)[0] | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# TODO: Fix join unit generation so we aren't passed this. | ||
to_concat = [x if isinstance(x, SparseArray) | ||
else SparseArray(x.squeeze(), fill_value=fill_value) | ||
for x in to_concat] | ||
|
||
return SparseArray._concat_same_type(to_concat) | ||
|
||
|
||
def _concat_rangeindex_same_dtype(indexes): | ||
|
Uh oh!
There was an error while loading. Please reload this page.