Skip to content

Commit 574a9df

Browse files
Merge pull request #10680 from jorisvandenbossche/doc-iter
DOC: improve docs on iteration
2 parents 4024ec2 + 2226780 commit 574a9df

File tree

2 files changed

+174
-48
lines changed

2 files changed

+174
-48
lines changed

doc/source/basics.rst

Lines changed: 109 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1151,34 +1151,91 @@ parameter that is by default ``False`` and copies the underlying data. Pass
11511151
The Panel class has a related :meth:`~Panel.rename_axis` class which can rename
11521152
any of its three axes.
11531153

1154+
.. _basics.iteration:
1155+
11541156
Iteration
11551157
---------
11561158

1157-
Because Series is array-like, basic iteration produces the values. Other data
1158-
structures follow the dict-like convention of iterating over the "keys" of the
1159-
objects. In short:
1159+
The behavior of basic iteration over pandas objects depends on the type.
1160+
When iterating over a Series, it is regarded as array-like, and basic iteration
1161+
produces the values. Other data structures, like DataFrame and Panel,
1162+
follow the dict-like convention of iterating over the "keys" of the
1163+
objects.
1164+
1165+
In short, basic iteration (``for i in object``) produces:
11601166

1161-
* **Series**: values
1162-
* **DataFrame**: column labels
1163-
* **Panel**: item labels
1167+
* **Series**: values
1168+
* **DataFrame**: column labels
1169+
* **Panel**: item labels
11641170

1165-
Thus, for example:
1171+
Thus, for example, iterating over a DataFrame gives you the column names:
11661172

11671173
.. ipython::
11681174

1169-
In [0]: for col in df:
1170-
...: print(col)
1171-
...:
1175+
In [0]: df = pd.DataFrame({'col1' : np.random.randn(3), 'col2' : np.random.randn(3)},
1176+
...: index=['a', 'b', 'c'])
1177+
1178+
In [0]: for col in df:
1179+
...: print(col)
1180+
...:
1181+
1182+
Pandas objects also have the dict-like :meth:`~DataFrame.iteritems` method to
1183+
iterate over the (key, value) pairs.
1184+
1185+
To iterate over the rows of a DataFrame, you can use the following methods:
1186+
1187+
* :meth:`~DataFrame.iterrows`: Iterate over the rows of a DataFrame as (index, Series) pairs.
1188+
This converts the rows to Series objects, which can change the dtypes and has some
1189+
performance implications.
1190+
* :meth:`~DataFrame.itertuples`: Iterate over the rows of a DataFrame as tuples of the values.
1191+
This is a lot faster as :meth:`~DataFrame.iterrows`, and is in most cases preferable to
1192+
use to iterate over the values of a DataFrame.
1193+
1194+
.. warning::
1195+
1196+
Iterating through pandas objects is generally **slow**. In many cases,
1197+
iterating manually over the rows is not needed and can be avoided with
1198+
one of the following approaches:
1199+
1200+
* Look for a *vectorized* solution: many operations can be performed using
1201+
built-in methods or numpy functions, (boolean) indexing, ...
1202+
1203+
* When you have a function that cannot work on the full DataFrame/Series
1204+
at once, it is better to use :meth:`~DataFrame.apply` instead of iterating
1205+
over the values. See the docs on :ref:`function application <basics.apply>`.
1206+
1207+
* If you need to do iterative manipulations on the values but performance is
1208+
important, consider writing the inner loop using e.g. cython or numba.
1209+
See the :ref:`enhancing performance <enhancingperf>` section for some
1210+
examples of this approach.
1211+
1212+
.. warning::
1213+
1214+
You should **never modify** something you are iterating over.
1215+
This is not guaranteed to work in all cases. Depending on the
1216+
data types, the iterator returns a copy and not a view, and writing
1217+
to it will have no effect!
1218+
1219+
For example, in the following case setting the value has no effect:
1220+
1221+
.. ipython:: python
1222+
1223+
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
1224+
1225+
for index, row in df.iterrows():
1226+
row['a'] = 10
1227+
1228+
df
11721229
11731230
iteritems
11741231
~~~~~~~~~
11751232

11761233
Consistent with the dict-like interface, :meth:`~DataFrame.iteritems` iterates
11771234
through key-value pairs:
11781235

1179-
* **Series**: (index, scalar value) pairs
1180-
* **DataFrame**: (column, Series) pairs
1181-
* **Panel**: (item, DataFrame) pairs
1236+
* **Series**: (index, scalar value) pairs
1237+
* **DataFrame**: (column, Series) pairs
1238+
* **Panel**: (item, DataFrame) pairs
11821239

11831240
For example:
11841241

@@ -1189,22 +1246,46 @@ For example:
11891246
...: print(frame)
11901247
...:
11911248

1192-
11931249
.. _basics.iterrows:
11941250

11951251
iterrows
11961252
~~~~~~~~
11971253

1198-
New in v0.7 is the ability to iterate efficiently through rows of a
1199-
DataFrame with :meth:`~DataFrame.iterrows`. It returns an iterator yielding each
1254+
:meth:`~DataFrame.iterrows` allows you to iterate through the rows of a
1255+
DataFrame as Series objects. It returns an iterator yielding each
12001256
index value along with a Series containing the data in each row:
12011257

12021258
.. ipython::
12031259

1204-
In [0]: for row_index, row in df2.iterrows():
1260+
In [0]: for row_index, row in df.iterrows():
12051261
...: print('%s\n%s' % (row_index, row))
12061262
...:
12071263

1264+
.. note::
1265+
1266+
Because :meth:`~DataFrame.iterrows` returns a Series for each row,
1267+
it does **not** preserve dtypes across the rows (dtypes are
1268+
preserved across columns for DataFrames). For example,
1269+
1270+
.. ipython:: python
1271+
1272+
df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
1273+
df_orig.dtypes
1274+
row = next(df_orig.iterrows())[1]
1275+
row
1276+
1277+
All values in ``row``, returned as a Series, are now upcasted
1278+
to floats, also the original integer value in column `x`:
1279+
1280+
.. ipython:: python
1281+
1282+
row['int'].dtype
1283+
df_orig['int'].dtype
1284+
1285+
To preserve dtypes while iterating over the rows, it is better
1286+
to use :meth:`~DataFrame.itertuples` which returns tuples of the values
1287+
and which is generally much faster as ``iterrows``.
1288+
12081289
For instance, a contrived way to transpose the DataFrame would be:
12091290

12101291
.. ipython:: python
@@ -1216,36 +1297,29 @@ For instance, a contrived way to transpose the DataFrame would be:
12161297
df2_t = pd.DataFrame(dict((idx,values) for idx, values in df2.iterrows()))
12171298
print(df2_t)
12181299
1219-
.. note::
1220-
1221-
``iterrows`` does **not** preserve dtypes across the rows (dtypes are
1222-
preserved across columns for DataFrames). For example,
1223-
1224-
.. ipython:: python
1225-
1226-
df_iter = pd.DataFrame([[1, 1.0]], columns=['x', 'y'])
1227-
row = next(df_iter.iterrows())[1]
1228-
print(row['x'].dtype)
1229-
print(df_iter['x'].dtype)
1230-
12311300
itertuples
12321301
~~~~~~~~~~
12331302

1234-
The :meth:`~DataFrame.itertuples` method will return an iterator yielding a tuple for each row in the
1235-
DataFrame. The first element of the tuple will be the row's corresponding index
1236-
value, while the remaining values are the row values proper.
1303+
The :meth:`~DataFrame.itertuples` method will return an iterator
1304+
yielding a tuple for each row in the DataFrame. The first element
1305+
of the tuple will be the row's corresponding index value,
1306+
while the remaining values are the row values.
12371307

12381308
For instance,
12391309

12401310
.. ipython:: python
12411311
1242-
for r in df2.itertuples():
1243-
print(r)
1312+
for row in df.itertuples():
1313+
print(row)
1314+
1315+
This method does not convert the row to a Series object but just returns the
1316+
values inside a tuple. Therefore, :meth:`~DataFrame.itertuples` preserves the
1317+
data type of the values and is generally faster as :meth:`~DataFrame.iterrows`.
12441318

12451319
.. _basics.dt_accessors:
12461320

12471321
.dt accessor
1248-
~~~~~~~~~~~~
1322+
------------
12491323

12501324
``Series`` has an accessor to succinctly return datetime like properties for the
12511325
*values* of the Series, if its a datetime/period like Series.

pandas/core/frame.py

Lines changed: 65 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -546,7 +546,15 @@ def _repr_html_(self):
546546
return None
547547

548548
def iteritems(self):
549-
"""Iterator over (column, series) pairs"""
549+
"""
550+
Iterator over (column name, Series) pairs.
551+
552+
See also
553+
--------
554+
iterrows : Iterate over the rows of a DataFrame as (index, Series) pairs.
555+
itertuples : Iterate over the rows of a DataFrame as tuples of the values.
556+
557+
"""
550558
if self.columns.is_unique and hasattr(self, '_item_cache'):
551559
for k in self.columns:
552560
yield k, self._get_item_cache(k)
@@ -556,25 +564,45 @@ def iteritems(self):
556564

557565
def iterrows(self):
558566
"""
559-
Iterate over rows of DataFrame as (index, Series) pairs.
567+
Iterate over the rows of a DataFrame as (index, Series) pairs.
560568
561569
Notes
562570
-----
563571
564-
* ``iterrows`` does **not** preserve dtypes across the rows (dtypes
565-
are preserved across columns for DataFrames). For example,
566-
567-
>>> df = DataFrame([[1, 1.0]], columns=['x', 'y'])
568-
>>> row = next(df.iterrows())[1]
569-
>>> print(row['x'].dtype)
570-
float64
571-
>>> print(df['x'].dtype)
572-
int64
572+
1. Because ``iterrows` returns a Series for each row,
573+
it does **not** preserve dtypes across the rows (dtypes are
574+
preserved across columns for DataFrames). For example,
575+
576+
>>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
577+
>>> row = next(df.iterrows())[1]
578+
>>> row
579+
int 1.0
580+
float 1.5
581+
Name: 0, dtype: float64
582+
>>> print(row['int'].dtype)
583+
float64
584+
>>> print(df['int'].dtype)
585+
int64
586+
587+
To preserve dtypes while iterating over the rows, it is better
588+
to use :meth:`itertuples` which returns tuples of the values
589+
and which is generally faster as ``iterrows``.
590+
591+
2. You should **never modify** something you are iterating over.
592+
This is not guaranteed to work in all cases. Depending on the
593+
data types, the iterator returns a copy and not a view, and writing
594+
to it will have no effect.
573595
574596
Returns
575597
-------
576598
it : generator
577599
A generator that iterates over the rows of the frame.
600+
601+
See also
602+
--------
603+
itertuples : Iterate over the rows of a DataFrame as tuples of the values.
604+
iteritems : Iterate over (column name, Series) pairs.
605+
578606
"""
579607
columns = self.columns
580608
for k, v in zip(self.index, self.values):
@@ -583,8 +611,32 @@ def iterrows(self):
583611

584612
def itertuples(self, index=True):
585613
"""
586-
Iterate over rows of DataFrame as tuples, with index value
587-
as first element of the tuple
614+
Iterate over the rows of DataFrame as tuples, with index value
615+
as first element of the tuple.
616+
617+
Parameters
618+
----------
619+
index : boolean, default True
620+
If True, return the index as the first element of the tuple.
621+
622+
See also
623+
--------
624+
iterrows : Iterate over the rows of a DataFrame as (index, Series) pairs.
625+
iteritems : Iterate over (column name, Series) pairs.
626+
627+
Examples
628+
--------
629+
630+
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
631+
>>> df
632+
col1 col2
633+
a 1 0.1
634+
b 2 0.2
635+
>>> for row in df.itertuples():
636+
... print(row)
637+
('a', 1, 0.10000000000000001)
638+
('b', 2, 0.20000000000000001)
639+
588640
"""
589641
arrays = []
590642
if index:

0 commit comments

Comments
 (0)