Skip to content

Commit 00b83bf

Browse files
committed
formatting
1 parent ac602d6 commit 00b83bf

File tree

1 file changed

+15
-12
lines changed

1 file changed

+15
-12
lines changed

recipes_source/recipes/amp_recipe.py

Lines changed: 15 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,17 @@
88
where some operations use the ``torch.float32`` (``float``) datatype and other operations
99
use ``torch.float16`` (``half``). Some ops, like linear layers and convolutions,
1010
are much faster in ``float16``. Other ops, like reductions, often require the dynamic
11-
range of ``float32``. Mixed precision tries to match each op to its appropriate datatype.
11+
range of ``float32``. Mixed precision tries to match each op to its appropriate datatype,
1212
which can reduce your network's runtime and memory footprint.
1313
1414
Ordinarily, "automatic mixed precision training" uses `torch.cuda.amp.autocast <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast>`_ and
1515
`torch.cuda.amp.GradScaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`_ together.
1616
17-
This tutorial measures the performance of a simple network in default precision,
17+
This recipe measures the performance of a simple network in default precision,
1818
then walks through adding ``autocast`` and ``GradScaler`` to run the same network in
1919
mixed precision with improved performance.
2020
21-
You may download and run this tutorial as a standalone Python script.
21+
You may download and run this recipe as a standalone Python script.
2222
The only requirements are Pytorch 1.6+ and a CUDA-capable GPU.
2323
2424
Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere).
@@ -62,7 +62,7 @@ def make_model(in_size, out_size, num_layers):
6262

6363
##########################################################
6464
# ``batch_size``, ``in_size``, ``out_size``, and ``num_layers`` are chosen to be large enough to saturate the GPU with work.
65-
# Typically, mixed precision provides the greatest speedup when GPU is saturated.
65+
# Typically, mixed precision provides the greatest speedup when the GPU is saturated.
6666
# Small networks may be CPU bound, in which case mixed precision won't improve performance.
6767
# Sizes are also chosen such that linear layers' participating dimensions are multiples of 8,
6868
# to permit Tensor Core usage on Tensor Core-capable GPUs (see :ref:`Troubleshooting<troubleshooting>` below).
@@ -87,7 +87,7 @@ def make_model(in_size, out_size, num_layers):
8787
##########################################################
8888
# Default Precision
8989
# -----------------
90-
# Without torch.cuda.amp, the following simple network executes all ops in default precision (``torch.float32``):
90+
# Without ``torch.cuda.amp``, the following simple network executes all ops in default precision (``torch.float32``):
9191

9292
net = make_model(in_size, out_size, num_layers)
9393
opt = torch.optim.SGD(net.parameters(), lr=0.001)
@@ -139,7 +139,8 @@ def make_model(in_size, out_size, num_layers):
139139
# helps prevent gradients with small magnitudes from flushing to zero
140140
# ("underflowing") when training with mixed precision.
141141
#
142-
# ``torch.cuda.amp.GradScaler`` performs the steps of gradient scaling conveniently.
142+
# `torch.cuda.amp.GradScaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`_
143+
# performs the steps of gradient scaling conveniently.
143144

144145
# Constructs scaler once, at the beginning of the convergence run, using default args.
145146
# If your network fails to converge with default GradScaler args, please file an issue.
@@ -170,9 +171,9 @@ def make_model(in_size, out_size, num_layers):
170171
##########################################################
171172
# All together ("Automatic Mixed Precision")
172173
# ------------------------------------------
173-
# The following also demonstrates ``enabled``, an optional convenience argument to ``autocast`` and ``GradScaler``.
174+
# (The following also demonstrates ``enabled``, an optional convenience argument to ``autocast`` and ``GradScaler``.
174175
# If False, ``autocast`` and ``GradScaler``\ 's calls become no-ops.
175-
# This allows switching between default precision and mixed precision without if/else statements.
176+
# This allows switching between default precision and mixed precision without if/else statements.)
176177

177178
use_amp = True
178179

@@ -196,8 +197,8 @@ def make_model(in_size, out_size, num_layers):
196197
# Inspecting/modifying gradients (e.g., clipping)
197198
# --------------------------------------------------------
198199
# All gradients produced by ``scaler.scale(loss).backward()`` are scaled. If you wish to modify or inspect
199-
# the parameters' ``.grad`` attributes between ``backward()`` and ``scaler.step(optimizer)``, you should
200-
# unscale them first using ``scaler.unscale_(optimizer)``.
200+
# the parameters' ``.grad`` attributes between ``backward()`` and ``scaler.step(optimizer)``, you should
201+
# unscale them first using `scaler.unscale_(optimizer) <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.unscale_>`_.
201202

202203
for epoch in range(0): # 0 epochs, this section is for illustration only
203204
for input, target in zip(data, targets):
@@ -232,6 +233,7 @@ def make_model(in_size, out_size, num_layers):
232233
"optimizer": opt.state_dict(),
233234
"scaler": scaler.state_dict()}
234235

236+
##########################################################
235237
# (write checkpoint as desired, e.g., ``torch.save(checkpoint, "filename")``.)
236238
#
237239
# When resuming, load the scaler state dict alongside the model and optimizer state dicts.
@@ -242,11 +244,12 @@ def make_model(in_size, out_size, num_layers):
242244
opt.load_state_dict(checkpoint["optimizer"])
243245
scaler.load_state_dict(checkpoint["scaler"])
244246

245-
# If a checkpoint was created from a run _without_ mixed precision, and you want to resume training _with_ mixed precision,
247+
##########################################################
248+
# If a checkpoint was created from a run *without* Amp, and you want to resume training *with* Amp,
246249
# load model and optimizer states from the checkpoint as usual. The checkpoint won't contain a saved scaler state, so
247250
# use a fresh instance of ``GradScaler``.
248251
#
249-
# If a checkpoint was created from a run _with_ mixed precision and you want to resume training _without_ mixed precision,
252+
# If a checkpoint was created from a run *with* Amp and you want to resume training *without* Amp,
250253
# load model and optimizer states from the checkpoint as usual, and ignore the saved scaler state.
251254

252255
##########################################################

0 commit comments

Comments
 (0)