-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: added array #23581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: added array #23581
Changes from 43 commits
bfefc96
51480a3
dcb7931
a635649
fb0d8bc
d58a320
fe06de4
72f7f06
c02e183
a2d3146
37901b0
4403010
9401dd3
838ce5e
5260b99
248e9e0
cf07c80
22490a8
5e0dc62
fe40189
7eb9d08
fa7b200
1ca14fe
4473899
382f57d
dd76a2b
159d3a2
5366950
c818a8f
ba8b807
77cd782
dfada7b
5eff701
9406400
8eb07c3
ea3a118
ecae340
fb814fc
a6f6d29
6c243f3
86b81b5
2c6cf97
50d4206
9e1b4e6
000967d
bf829c3
faf114d
3186ded
1c4da0e
36c6f00
932e119
45d07eb
d1aba73
981f735
1f3bb50
c8d3960
1b9e251
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,6 +152,28 @@ Reduction and groupby operations such as 'sum' work. | |
|
||
The Integer NA support currently uses the captilized dtype version, e.g. ``Int8`` as compared to the traditional ``int8``. This may be changed at a future date. | ||
|
||
.. _whatsnew_0240.enhancements.array: | ||
|
||
A new top-level method :func:`array` has been added for creating arrays (:issue:`22860`). | ||
This can be used to create any :ref:`extension array <extending.extension-types>`, including | ||
extension arrays registered by :ref:`3rd party libraries <ecosystem.extensions>`, or to | ||
create NumPy arrays. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. specify this as 1D here? |
||
|
||
.. ipython:: python | ||
|
||
pd.array([1, 2, np.nan], dtype='Int64') | ||
pd.array(['a', 'b', 'c'], dtype='category') | ||
pd.array([1, 2]) | ||
|
||
Notice that the default return value, if no ``dtype`` is specified, the type of | ||
array is inferred from the data. In particular, note that the first example of | ||
``[1, 2, np.nan]`` will return a floating-point NumPy array, since ``NaN`` | ||
is a float. | ||
|
||
.. ipython:: python | ||
|
||
pd.array([1, 2, np.nan]) | ||
|
||
.. _whatsnew_0240.enhancements.read_html: | ||
|
||
``read_html`` Enhancements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
""" | ||
All of pandas' ExtensionArrays and ExtensionDtypes. | ||
|
||
See :ref:`extending.extension-types` for more. | ||
""" | ||
from pandas.core.arrays import ( | ||
IntervalArray, PeriodArray, Categorical, SparseArray, IntegerArray, | ||
) | ||
|
||
|
||
__all__ = [ | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
'Categorical', | ||
'IntegerArray', | ||
'IntervalArray', | ||
'PeriodArray', | ||
'SparseArray', | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,184 @@ | ||
import numpy as np | ||
|
||
from pandas._libs import lib, tslibs | ||
|
||
from pandas.core.dtypes.common import is_extension_array_dtype | ||
from pandas.core.dtypes.dtypes import registry | ||
from pandas.core.dtypes.generic import ABCIndexClass, ABCSeries | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
from pandas import compat | ||
|
||
|
||
def array(data, # type: Sequence[object] | ||
dtype=None, # type: Optional[Union[str, np.dtype, ExtensionDtype]] | ||
copy=True, # type: bool | ||
): | ||
# type: (...) -> Union[str, np.dtype, ExtensionDtype] | ||
""" | ||
Create an array. | ||
|
||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
.. versionadded:: 0.24.0 | ||
|
||
Parameters | ||
---------- | ||
data : Sequence of objects | ||
The scalars inside `data` should be instances of the | ||
scalar type for `dtype`. | ||
|
||
When `data` is an Index or Series, the underlying array | ||
will be extracted from `data`. | ||
|
||
dtype : str, np.dtype, or ExtensionDtype, optional | ||
The dtype to use for the array. This may be a NumPy | ||
dtype or an extension type registered with pandas using | ||
:meth:`pandas.api.extensions.register_extension_dtype`. | ||
|
||
If not specified, there are two possibilities: | ||
|
||
1. When `data` is a :class:`Series`, :class:`Index`, or | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
:class:`ExtensionArray`, the `dtype` will be taken | ||
from the data. | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. Otherwise, pandas will attempt to infer the `dtype` | ||
from the data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I find this statement a bit misleading, as we actually don't infer from scalars (at least not in the sense of eg how Series does it), we only use numpy's inference? If we say that we infer, I would expect those to do the same:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, looking now further down in the implementation :-) I see you do handle this case for Period, so here it indeed infers:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't stated it explicitly, but it would be nice if 3rd parties could eventually hook into this as well. Right now I think it's just Period (and maybe interval?) that get inferred. Maybe timestamps with timezones once DatetimeArray is done. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And not Timestamps without timezones? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems that In [5]: lib.infer_dtype([pd.Timestamp('2017', tz='utc')])
Out[5]: 'datetime'
In [6]: lib.infer_dtype([pd.Timestamp('2017', tz='US/Central')])
Out[6]: 'datetime' I've added interval to what we'll infer. Perhaps we should be explicit in the docs for that? Though that kinda closes the door to inference for 3rd party arrays. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, I would do that.
How would you envision third party arrays participate in inference? That seems a bit difficult in any case (trying out all registered ones, ..?), and IMO more error prone for users (if you forget to import the 3rd party library, you silently get different results) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's what I want (long term).
Haven't thought about it beyond "check for 3rd party scalar types". I'm not familiar with how infer_dtype works. |
||
|
||
Note that when `data` is a NumPy array, ``data.dtype`` is | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
*not* used for inferring the array type. This is because | ||
NumPy cannot represent all the types of data that can be | ||
held in extension arrays. | ||
|
||
Currently, pandas will infer an extension dtype for sequences of | ||
|
||
========================== ================================== | ||
scalar type Array Type | ||
========================== ================================== | ||
* :class:`pandas.Interval` :class:`pandas.IntervalArray` | ||
* :class:`pandas.Period` :class:`pandas.arrays.PeriodArray` | ||
========================== ================================== | ||
|
||
For all other cases, NumPy's usual inference rules will be used. | ||
|
||
To avoid *future* breaking changing, pandas recommends using actual | ||
dtypes, and not string aliases, for `dtype`. In other words, use | ||
|
||
>>> pd.array([1, 2, 3], dtype=np.dtype("int32")) | ||
array([1, 2, 3], dtype=int32) | ||
|
||
rather than | ||
|
||
>>> pd.array([1, 2, 3], dtype="int32") | ||
array([1, 2, 3], dtype=int32) | ||
|
||
If and when pandas switches to a different backend for storing arrays, | ||
the meaning of the string aliases will change, while the actual | ||
dtypes will be unambiguous. | ||
|
||
copy : bool, default True | ||
Whether to copy the data, even if not necessary. Depending | ||
on the type of `data`, creating the new array may require | ||
copying data, even if ``copy=False``. | ||
|
||
Returns | ||
------- | ||
array : Union[numpy.ndarray, ExtensionArray] | ||
|
||
See Also | ||
-------- | ||
numpy.array : Construct a NumPy array. | ||
Series : Construct a pandas Series. | ||
|
||
Notes | ||
----- | ||
Omitting the `dtype` argument means pandas will attempt to infer the | ||
best array type from the values in the data. As new array types are | ||
added by pandas and 3rd party libraries, the "best" array type may | ||
change. We recommend specifying `dtype` to ensure that | ||
|
||
1. the correct array type for the data is returned | ||
2. the returned array type doesn't change as new extension types | ||
are added by pandas and third-party libraries | ||
|
||
Examples | ||
-------- | ||
If a dtype is not specified, `data` is passed through to | ||
:meth:`numpy.array`, and an ``ndarray`` is returned. | ||
|
||
>>> pd.array([1, 2]) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
array([1, 2]) | ||
|
||
Or the NumPy dtype can be specified | ||
|
||
>>> pd.array([1, 2], dtype=np.dtype("int32")) | ||
array([1, 2], dtype=int32) | ||
|
||
You can use the string alias for `dtype` | ||
|
||
>>> pd.array(['a', 'b', 'a'], dtype='category') | ||
[a, b, a] | ||
Categories (2, object): [a, b] | ||
|
||
Or specify the actual dtype | ||
|
||
>>> pd.array(['a', 'b', 'a'], | ||
... dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True)) | ||
[a, b, a] | ||
Categories (3, object): [a < b < c] | ||
|
||
Because omitting the `dtype` passes the data through to NumPy, | ||
a mixture of valid integers and NA will return a floating-point | ||
NumPy array. | ||
|
||
>>> pd.array([1, 2, np.nan]) | ||
array([ 1., 2., nan]) | ||
|
||
To use pandas' nullable :class:`pandas.arrays.IntegerArray`, specify | ||
the dtype: | ||
|
||
>>> pd.array([1, 2, np.nan], dtype='Int64') | ||
IntegerArray([1, 2, nan], dtype='Int64') | ||
|
||
Pandas will infer an ExtensionArray for some types of data: | ||
|
||
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")]) | ||
<PeriodArray> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add an example which raises for scalars, 2d? |
||
['2000-01-01', '2000-01-01'] | ||
Length: 2, dtype: period[D] | ||
""" | ||
from pandas.core.arrays import ( | ||
period_array, ExtensionArray, IntervalArray | ||
) | ||
|
||
if isinstance(data, (ABCSeries, ABCIndexClass)): | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data = data._values | ||
|
||
if dtype is None and isinstance(data, ExtensionArray): | ||
dtype = data.dtype | ||
|
||
# this returns None for not-found dtypes. | ||
if isinstance(dtype, compat.string_types): | ||
dtype = registry.find(dtype) or dtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger do you remember if there was any particular reason for using this pattern instead of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't recall. I wonder if this predates |
||
|
||
if is_extension_array_dtype(dtype): | ||
cls = dtype.construct_array_type() | ||
return cls._from_sequence(data, dtype=dtype, copy=copy) | ||
|
||
if dtype is None: | ||
inferred_dtype = lib.infer_dtype(data) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if inferred_dtype == 'period': | ||
try: | ||
return period_array(data, copy=copy) | ||
except tslibs.IncompatibleFrequency: | ||
# We may have a mixture of frequencies. | ||
# We choose to return an ndarray, rather than raising. | ||
pass | ||
elif inferred_dtype == 'interval': | ||
try: | ||
return IntervalArray(data, copy=copy) | ||
except ValueError: | ||
# We may have a mixture of `closed` here. | ||
# We choose to return an ndarray, rather than raising. | ||
pass | ||
|
||
# TODO(DatetimeArray): handle this type | ||
# TODO(BooleanArray): handle this type | ||
|
||
return np.array(data, dtype=dtype, copy=copy) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -81,7 +81,9 @@ | |
from_arrays | ||
from_tuples | ||
from_breaks | ||
overlaps | ||
set_closed | ||
to_tuples | ||
%(extra_methods)s\ | ||
|
||
See Also | ||
|
Uh oh!
There was an error while loading. Please reload this page.