Skip to content

Commit e871245

Browse files
committed
Merge pull request #8423 from jreback/str
DOC: create text.rst with string methods (GH8416)
2 parents 5cfc9cf + a95d84a commit e871245

File tree

8 files changed

+255
-163
lines changed

8 files changed

+255
-163
lines changed

doc/source/10min.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -433,7 +433,12 @@ See more at :ref:`Histogramming and Discretization <basics.discretization>`
433433
String Methods
434434
~~~~~~~~~~~~~~
435435

436-
See more at :ref:`Vectorized String Methods <basics.string_methods>`
436+
Series is equipped with a set of string processing methods in the `str`
437+
attribute that make it easy to operate on each element of the array, as in the
438+
code snippet below. Note that pattern-matching in `str` generally uses `regular
439+
expressions <https://docs.python.org/2/library/re.html>`__ by default (and in
440+
some cases always uses them). See more at :ref:`Vectorized String Methods
441+
<text.string_methods>`.
437442

438443
.. ipython:: python
439444

doc/source/api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1410,7 +1410,7 @@ Computations / Descriptive Stats
14101410
GroupBy.mean
14111411
GroupBy.median
14121412
GroupBy.min
1413-
GroupBy.nth
1413+
GroupBy.nth
14141414
GroupBy.ohlc
14151415
GroupBy.prod
14161416
GroupBy.size

doc/source/basics.rst

Lines changed: 14 additions & 159 deletions
Original file line numberDiff line numberDiff line change
@@ -1159,172 +1159,27 @@ The ``.dt`` accessor works for period and timedelta dtypes.
11591159

11601160
``Series.dt`` will raise a ``TypeError`` if you access with a non-datetimelike values
11611161

1162-
.. _basics.string_methods:
1163-
11641162
Vectorized string methods
11651163
-------------------------
11661164

1167-
Series is equipped (as of pandas 0.8.1) with a set of string processing methods
1168-
that make it easy to operate on each element of the array. Perhaps most
1169-
importantly, these methods exclude missing/NA values automatically. These are
1170-
accessed via the Series's ``str`` attribute and generally have names matching
1171-
the equivalent (scalar) build-in string methods:
1172-
1173-
Splitting and Replacing Strings
1174-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1175-
1176-
.. ipython:: python
1177-
1178-
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1179-
s.str.lower()
1180-
s.str.upper()
1181-
s.str.len()
1182-
1183-
Methods like ``split`` return a Series of lists:
1184-
1185-
.. ipython:: python
1186-
1187-
s2 = Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'])
1188-
s2.str.split('_')
1189-
1190-
Elements in the split lists can be accessed using ``get`` or ``[]`` notation:
1191-
1192-
.. ipython:: python
1193-
1194-
s2.str.split('_').str.get(1)
1195-
s2.str.split('_').str[1]
1196-
1197-
Methods like ``replace`` and ``findall`` take regular expressions, too:
1198-
1199-
.. ipython:: python
1200-
1201-
s3 = Series(['A', 'B', 'C', 'Aaba', 'Baca',
1202-
'', np.nan, 'CABA', 'dog', 'cat'])
1203-
s3
1204-
s3.str.replace('^.a|dog', 'XX-XX ', case=False)
1205-
1206-
Extracting Substrings
1207-
~~~~~~~~~~~~~~~~~~~~~
1208-
1209-
The method ``extract`` (introduced in version 0.13) accepts regular expressions
1210-
with match groups. Extracting a regular expression with one group returns
1211-
a Series of strings.
1212-
1213-
.. ipython:: python
1214-
1215-
Series(['a1', 'b2', 'c3']).str.extract('[ab](\d)')
1216-
1217-
Elements that do not match return ``NaN``. Extracting a regular expression
1218-
with more than one group returns a DataFrame with one column per group.
1219-
1220-
.. ipython:: python
1221-
1222-
Series(['a1', 'b2', 'c3']).str.extract('([ab])(\d)')
1223-
1224-
Elements that do not match return a row filled with ``NaN``.
1225-
Thus, a Series of messy strings can be "converted" into a
1226-
like-indexed Series or DataFrame of cleaned-up or more useful strings,
1227-
without necessitating ``get()`` to access tuples or ``re.match`` objects.
1165+
Series is equipped with a set of string processing methods that make it easy to
1166+
operate on each element of the array. Perhaps most importantly, these methods
1167+
exclude missing/NA values automatically. These are accessed via the Series's
1168+
``str`` attribute and generally have names matching the equivalent (scalar)
1169+
built-in string methods. For example:
12281170

1229-
The results dtype always is object, even if no match is found and the result
1230-
only contains ``NaN``.
1171+
.. ipython:: python
12311172
1232-
Named groups like
1173+
s = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1174+
s.str.lower()
12331175
1234-
.. ipython:: python
1235-
1236-
Series(['a1', 'b2', 'c3']).str.extract('(?P<letter>[ab])(?P<digit>\d)')
1237-
1238-
and optional groups like
1239-
1240-
.. ipython:: python
1241-
1242-
Series(['a1', 'b2', '3']).str.extract('(?P<letter>[ab])?(?P<digit>\d)')
1243-
1244-
can also be used.
1245-
1246-
Testing for Strings that Match or Contain a Pattern
1247-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1248-
1249-
You can check whether elements contain a pattern:
1250-
1251-
.. ipython:: python
1252-
1253-
pattern = r'[a-z][0-9]'
1254-
Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern)
1255-
1256-
or match a pattern:
1257-
1258-
1259-
.. ipython:: python
1260-
1261-
Series(['1', '2', '3a', '3b', '03c']).str.match(pattern, as_indexer=True)
1262-
1263-
The distinction between ``match`` and ``contains`` is strictness: ``match``
1264-
relies on strict ``re.match``, while ``contains`` relies on ``re.search``.
1265-
1266-
.. warning::
1267-
1268-
In previous versions, ``match`` was for *extracting* groups,
1269-
returning a not-so-convenient Series of tuples. The new method ``extract``
1270-
(described in the previous section) is now preferred.
1271-
1272-
This old, deprecated behavior of ``match`` is still the default. As
1273-
demonstrated above, use the new behavior by setting ``as_indexer=True``.
1274-
In this mode, ``match`` is analogous to ``contains``, returning a boolean
1275-
Series. The new behavior will become the default behavior in a future
1276-
release.
1277-
1278-
Methods like ``match``, ``contains``, ``startswith``, and ``endswith`` take
1279-
an extra ``na`` argument so missing values can be considered True or False:
1280-
1281-
.. ipython:: python
1282-
1283-
s4 = Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
1284-
s4.str.contains('A', na=False)
1285-
1286-
.. csv-table::
1287-
:header: "Method", "Description"
1288-
:widths: 20, 80
1176+
Powerful pattern-matching methods are provided as well, but note that
1177+
pattern-matching generally uses `regular expressions
1178+
<https://docs.python.org/2/library/re.html>`__ by default (and in some cases
1179+
always uses them).
12891180

1290-
``cat``,Concatenate strings
1291-
``split``,Split strings on delimiter
1292-
``get``,Index into each element (retrieve i-th element)
1293-
``join``,Join strings in each element of the Series with passed separator
1294-
``contains``,Return boolean array if each string contains pattern/regex
1295-
``replace``,Replace occurrences of pattern/regex with some other string
1296-
``repeat``,Duplicate values (``s.str.repeat(3)`` equivalent to ``x * 3``)
1297-
``pad``,"Add whitespace to left, right, or both sides of strings"
1298-
``center``,Equivalent to ``pad(side='both')``
1299-
``wrap``,Split long strings into lines with length less than a given width
1300-
``slice``,Slice each string in the Series
1301-
``slice_replace``,Replace slice in each string with passed value
1302-
``count``,Count occurrences of pattern
1303-
``startswith``,Equivalent to ``str.startswith(pat)`` for each element
1304-
``endswith``,Equivalent to ``str.endswith(pat)`` for each element
1305-
``findall``,Compute list of all occurrences of pattern/regex for each string
1306-
``match``,"Call ``re.match`` on each element, returning matched groups as list"
1307-
``extract``,"Call ``re.match`` on each element, as ``match`` does, but return matched groups as strings for convenience."
1308-
``len``,Compute string lengths
1309-
``strip``,Equivalent to ``str.strip``
1310-
``rstrip``,Equivalent to ``str.rstrip``
1311-
``lstrip``,Equivalent to ``str.lstrip``
1312-
``lower``,Equivalent to ``str.lower``
1313-
``upper``,Equivalent to ``str.upper``
1314-
1315-
1316-
Getting indicator variables from separated strings
1317-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1318-
1319-
You can extract dummy variables from string columns.
1320-
For example if they are separated by a ``'|'``:
1321-
1322-
.. ipython:: python
1323-
1324-
s = pd.Series(['a', 'a|b', np.nan, 'a|c'])
1325-
s.str.get_dummies(sep='|')
1326-
1327-
See also :func:`~pandas.get_dummies`.
1181+
Please see :ref:`Vectorized String Methods <text.string_methods>` for a complete
1182+
description.
13281183

13291184
.. _basics.sorting:
13301185

doc/source/index.rst.template

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,7 @@ See the package overview for more detail about what's in the library.
122122
cookbook
123123
dsintro
124124
basics
125+
text
125126
options
126127
indexing
127128
advanced

0 commit comments

Comments
 (0)