-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrowStringArray] API: StringDtype parameterized by storage (python or pyarrow) #39908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 65 commits
4cb60e6
d242f2d
d39ab2c
2367810
9166d3b
8760705
d5b3fec
2c657df
647a6c2
0596fd7
c5a19c5
99680c9
69a6cc1
bd147ba
830275f
214e524
c9ba03c
7425536
68ac391
5cfa97a
74dbf96
3985943
3bda421
0c108a4
523e24c
279624c
80d231e
c5ced5a
459812c
d707b6b
71ccf24
daaac06
46626d1
3677bfa
42d382f
4fb1a0d
5d4eac1
15efb2e
b53cfe0
b7db53f
3399f08
e365f01
71d1e6c
9e23c35
c69a611
64b3206
d83a4ff
ef38660
aef1162
6247a5b
a6d066c
8adb08d
3ad0638
56714c9
6a1cc2b
1761a84
3e26baa
6b470b1
2ec6de0
a0b7a70
d9dcd20
4a37470
1d59c7a
e57c850
51f1b1d
fc95c06
ef02a43
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -171,6 +171,60 @@ a copy will no longer be made (:issue:`32960`) | |
The default behavior when not passing ``copy`` will remain unchanged, i.e. | ||
a copy will be made. | ||
|
||
.. _whatsnew_130.arrow_string: | ||
|
||
PyArrow backed string data type | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
We've enhanced the :class:`StringDtype`, an extension type dedicated to string data. | ||
(:issue:`39908`) | ||
|
||
It is now possible to specify a ``storage`` keyword option to :class:`StringDtype`. Use | ||
pandas options or specify the dtype using ``dtype='string[pyarrow]'`` to allow the | ||
StringArray to be backed by a PyArrow array instead of a NumPy array of Python objects. | ||
|
||
The PyArrow backed StringArray requires pyarrow 1.0.0 or greater to be installed. | ||
|
||
.. warning:: | ||
|
||
``string[pyarrow]`` is currently considered experimental. The implementation | ||
and parts of the API may change without warning. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['abc', None, 'def'], dtype=pd.StringDtype(storage="pyarrow")) | ||
|
||
You can use the alias ``"string[pyarrow]"`` as well. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['abc', None, 'def'], dtype="string[pyarrow]") | ||
s | ||
|
||
You can also create a PyArrow backed string array using pandas options. | ||
|
||
.. ipython:: python | ||
|
||
with pd.option_context("string_storage", "pyarrow"): | ||
s = pd.Series(['abc', None, 'def'], dtype="string") | ||
s | ||
|
||
The usual string accessor methods work. Where appropriate, the return type of the Series | ||
or columns of a DataFrame will also have string dtype. | ||
|
||
.. ipython:: python | ||
|
||
s.str.upper() | ||
s.str.split('b', expand=True).dtypes | ||
|
||
String accessor methods returning integers will return a value with :class:`Int64Dtype` | ||
|
||
.. ipython:: python | ||
|
||
s.str.count("a") | ||
|
||
See :ref:`text.types` for more. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there actually more content about ArrowString in that doc page? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not yet. updating text.rst will be done once we have the dtype sorted and have agreed on the array naming. will open a draft PR shortly with the changes so far (even though it's fluid) (and add the stuff removed from this PR #39908 (comment)) could remove this link till then. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We shouldn't need to update any user facing documentation if using pyarrow can be implemented on StringArray itself. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is not yet more content on the other page, I would just remove the link for now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done in ef02a43 |
||
|
||
Centered Datetime-Like Rolling Windows | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess should be consistent about using
string[python]
to be more explicit (rather than 'string'). i think its worth it in benchmarks for example. (and you do it on other benchmarks)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "string" used here denotes an object Index. These are not dtypes, but dictionary keys. There is no benchmark for StringArray.factorize
The last 3 are benchmarking arrays, all the others are Indexes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could maybe call it ArrowStringArray and rename the others for clarity. (does that affect the asv history?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about asv history, but nbd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string
->object
in 3e26baa