|
11 | 11 | range of ``float32``. Mixed precision tries to match each op to its appropriate datatype,
|
12 | 12 | which can reduce your network's runtime and memory footprint.
|
13 | 13 |
|
14 |
| -Ordinarily, "automatic mixed precision training" uses `torch.autocast <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast>`_ and |
| 14 | +Ordinarily, "automatic mixed precision training" uses `torch.autocast <https://pytorch.org/docs/stable/amp.html#torch.autocast>`_ and |
15 | 15 | `torch.cuda.amp.GradScaler <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler>`_ together.
|
16 | 16 |
|
17 | 17 | This recipe measures the performance of a simple network in default precision,
|
18 | 18 | then walks through adding ``autocast`` and ``GradScaler`` to run the same network in
|
19 | 19 | mixed precision with improved performance.
|
20 | 20 |
|
21 | 21 | You may download and run this recipe as a standalone Python script.
|
22 |
| -The only requirements are Pytorch 1.6+ and a CUDA-capable GPU. |
| 22 | +The only requirements are PyTorch 1.6 or later and a CUDA-capable GPU. |
23 | 23 |
|
24 | 24 | Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere).
|
25 | 25 | This recipe should show significant (2-3X) speedup on those architectures.
|
@@ -105,7 +105,7 @@ def make_model(in_size, out_size, num_layers):
|
105 | 105 | ##########################################################
|
106 | 106 | # Adding autocast
|
107 | 107 | # ---------------
|
108 |
| -# Instances of `torch.cuda.amp.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`_ |
| 108 | +# Instances of `torch.autocast <https://pytorch.org/docs/stable/amp.html#autocasting>`_ |
109 | 109 | # serve as context managers that allow regions of your script to run in mixed precision.
|
110 | 110 | #
|
111 | 111 | # In these regions, CUDA ops run in a dtype chosen by autocast
|
@@ -310,7 +310,7 @@ def make_model(in_size, out_size, num_layers):
|
310 | 310 | # 1. Disable ``autocast`` or ``GradScaler`` individually (by passing ``enabled=False`` to their constructor) and see if infs/NaNs persist.
|
311 | 311 | # 2. If you suspect part of your network (e.g., a complicated loss function) overflows , run that forward region in ``float32``
|
312 | 312 | # and see if infs/NaNs persist.
|
313 |
| -# `The autocast docstring <https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast>`_'s last code snippet |
| 313 | +# `The autocast docstring <https://pytorch.org/docs/stable/amp.html#torch.autocast>`_'s last code snippet |
314 | 314 | # shows forcing a subregion to run in ``float32`` (by locally disabling autocast and casting the subregion's inputs).
|
315 | 315 | #
|
316 | 316 | # Type mismatch error (may manifest as CUDNN_STATUS_BAD_PARAM)
|
|
0 commit comments