@@ -1159,172 +1159,27 @@ The ``.dt`` accessor works for period and timedelta dtypes.
1159
1159
1160
1160
``Series.dt `` will raise a ``TypeError `` if you access with a non-datetimelike values
1161
1161
1162
- .. _basics.string_methods :
1163
-
1164
1162
Vectorized string methods
1165
1163
-------------------------
1166
1164
1167
- Series is equipped (as of pandas 0.8.1) with a set of string processing methods
1168
- that make it easy to operate on each element of the array. Perhaps most
1169
- importantly, these methods exclude missing/NA values automatically. These are
1170
- accessed via the Series's ``str `` attribute and generally have names matching
1171
- the equivalent (scalar) build-in string methods:
1172
-
1173
- Splitting and Replacing Strings
1174
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1175
-
1176
- .. ipython :: python
1177
-
1178
- s = Series([' A' , ' B' , ' C' , ' Aaba' , ' Baca' , np.nan, ' CABA' , ' dog' , ' cat' ])
1179
- s.str.lower()
1180
- s.str.upper()
1181
- s.str.len()
1182
-
1183
- Methods like ``split `` return a Series of lists:
1184
-
1185
- .. ipython :: python
1186
-
1187
- s2 = Series([' a_b_c' , ' c_d_e' , np.nan, ' f_g_h' ])
1188
- s2.str.split(' _' )
1189
-
1190
- Elements in the split lists can be accessed using ``get `` or ``[] `` notation:
1191
-
1192
- .. ipython :: python
1193
-
1194
- s2.str.split(' _' ).str.get(1 )
1195
- s2.str.split(' _' ).str[1 ]
1196
-
1197
- Methods like ``replace `` and ``findall `` take regular expressions, too:
1198
-
1199
- .. ipython :: python
1200
-
1201
- s3 = Series([' A' , ' B' , ' C' , ' Aaba' , ' Baca' ,
1202
- ' ' , np.nan, ' CABA' , ' dog' , ' cat' ])
1203
- s3
1204
- s3.str.replace(' ^.a|dog' , ' XX-XX ' , case = False )
1205
-
1206
- Extracting Substrings
1207
- ~~~~~~~~~~~~~~~~~~~~~
1208
-
1209
- The method ``extract `` (introduced in version 0.13) accepts regular expressions
1210
- with match groups. Extracting a regular expression with one group returns
1211
- a Series of strings.
1212
-
1213
- .. ipython :: python
1214
-
1215
- Series([' a1' , ' b2' , ' c3' ]).str.extract(' [ab](\d)' )
1216
-
1217
- Elements that do not match return ``NaN ``. Extracting a regular expression
1218
- with more than one group returns a DataFrame with one column per group.
1219
-
1220
- .. ipython :: python
1221
-
1222
- Series([' a1' , ' b2' , ' c3' ]).str.extract(' ([ab])(\d)' )
1223
-
1224
- Elements that do not match return a row filled with ``NaN ``.
1225
- Thus, a Series of messy strings can be "converted" into a
1226
- like-indexed Series or DataFrame of cleaned-up or more useful strings,
1227
- without necessitating ``get() `` to access tuples or ``re.match `` objects.
1165
+ Series is equipped with a set of string processing methods that make it easy to
1166
+ operate on each element of the array. Perhaps most importantly, these methods
1167
+ exclude missing/NA values automatically. These are accessed via the Series's
1168
+ ``str `` attribute and generally have names matching the equivalent (scalar)
1169
+ built-in string methods. For example:
1228
1170
1229
- The results dtype always is object, even if no match is found and the result
1230
- only contains ``NaN ``.
1171
+ .. ipython :: python
1231
1172
1232
- Named groups like
1173
+ s = Series([' A' , ' B' , ' C' , ' Aaba' , ' Baca' , np.nan, ' CABA' , ' dog' , ' cat' ])
1174
+ s.str.lower()
1233
1175
1234
- .. ipython :: python
1235
-
1236
- Series([' a1' , ' b2' , ' c3' ]).str.extract(' (?P<letter>[ab])(?P<digit>\d)' )
1237
-
1238
- and optional groups like
1239
-
1240
- .. ipython :: python
1241
-
1242
- Series([' a1' , ' b2' , ' 3' ]).str.extract(' (?P<letter>[ab])?(?P<digit>\d)' )
1243
-
1244
- can also be used.
1245
-
1246
- Testing for Strings that Match or Contain a Pattern
1247
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1248
-
1249
- You can check whether elements contain a pattern:
1250
-
1251
- .. ipython :: python
1252
-
1253
- pattern = r ' [a-z ][0-9 ]'
1254
- Series([' 1' , ' 2' , ' 3a' , ' 3b' , ' 03c' ]).str.contains(pattern)
1255
-
1256
- or match a pattern:
1257
-
1258
-
1259
- .. ipython :: python
1260
-
1261
- Series([' 1' , ' 2' , ' 3a' , ' 3b' , ' 03c' ]).str.match(pattern, as_indexer = True )
1262
-
1263
- The distinction between ``match `` and ``contains `` is strictness: ``match ``
1264
- relies on strict ``re.match ``, while ``contains `` relies on ``re.search ``.
1265
-
1266
- .. warning ::
1267
-
1268
- In previous versions, ``match `` was for *extracting * groups,
1269
- returning a not-so-convenient Series of tuples. The new method ``extract ``
1270
- (described in the previous section) is now preferred.
1271
-
1272
- This old, deprecated behavior of ``match `` is still the default. As
1273
- demonstrated above, use the new behavior by setting ``as_indexer=True ``.
1274
- In this mode, ``match `` is analogous to ``contains ``, returning a boolean
1275
- Series. The new behavior will become the default behavior in a future
1276
- release.
1277
-
1278
- Methods like ``match ``, ``contains ``, ``startswith ``, and ``endswith `` take
1279
- an extra ``na `` argument so missing values can be considered True or False:
1280
-
1281
- .. ipython :: python
1282
-
1283
- s4 = Series([' A' , ' B' , ' C' , ' Aaba' , ' Baca' , np.nan, ' CABA' , ' dog' , ' cat' ])
1284
- s4.str.contains(' A' , na = False )
1285
-
1286
- .. csv-table ::
1287
- :header: "Method", "Description"
1288
- :widths: 20, 80
1176
+ Powerful pattern-matching methods are provided as well, but note that
1177
+ pattern-matching generally uses `regular expressions
1178
+ <https://docs.python.org/2/library/re.html> `__ by default (and in some cases
1179
+ always uses them).
1289
1180
1290
- ``cat ``,Concatenate strings
1291
- ``split ``,Split strings on delimiter
1292
- ``get ``,Index into each element (retrieve i-th element)
1293
- ``join ``,Join strings in each element of the Series with passed separator
1294
- ``contains ``,Return boolean array if each string contains pattern/regex
1295
- ``replace ``,Replace occurrences of pattern/regex with some other string
1296
- ``repeat ``,Duplicate values (``s.str.repeat(3) `` equivalent to ``x * 3 ``)
1297
- ``pad ``,"Add whitespace to left, right, or both sides of strings"
1298
- ``center ``,Equivalent to ``pad(side='both') ``
1299
- ``wrap ``,Split long strings into lines with length less than a given width
1300
- ``slice ``,Slice each string in the Series
1301
- ``slice_replace ``,Replace slice in each string with passed value
1302
- ``count ``,Count occurrences of pattern
1303
- ``startswith ``,Equivalent to ``str.startswith(pat) `` for each element
1304
- ``endswith ``,Equivalent to ``str.endswith(pat) `` for each element
1305
- ``findall ``,Compute list of all occurrences of pattern/regex for each string
1306
- ``match ``,"Call ``re.match `` on each element, returning matched groups as list"
1307
- ``extract ``,"Call ``re.match `` on each element, as ``match `` does, but return matched groups as strings for convenience."
1308
- ``len ``,Compute string lengths
1309
- ``strip ``,Equivalent to ``str.strip ``
1310
- ``rstrip ``,Equivalent to ``str.rstrip ``
1311
- ``lstrip ``,Equivalent to ``str.lstrip ``
1312
- ``lower ``,Equivalent to ``str.lower ``
1313
- ``upper ``,Equivalent to ``str.upper ``
1314
-
1315
-
1316
- Getting indicator variables from separated strings
1317
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1318
-
1319
- You can extract dummy variables from string columns.
1320
- For example if they are separated by a ``'|' ``:
1321
-
1322
- .. ipython :: python
1323
-
1324
- s = pd.Series([' a' , ' a|b' , np.nan, ' a|c' ])
1325
- s.str.get_dummies(sep = ' |' )
1326
-
1327
- See also :func: `~pandas.get_dummies `.
1181
+ Please see :ref: `Vectorized String Methods <text.string_methods >` for a complete
1182
+ description.
1328
1183
1329
1184
.. _basics.sorting :
1330
1185
0 commit comments