You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Lots of minor grammar, clarity, brevity or other style improvements.
* Make all links consistently point to stable documents rather than a
mix of stable and master.
* Make certain link targets relative instead of absolute.
Co-authored-by: Dmytro Dzhulgakov <dzhulgakov@users.noreply.github.com>
Co-authored-by: Holly Sweeney <77758406+holly1238@users.noreply.github.com>
if you would like to further speed up training and are willing to write a
77
71
little more code to set it up.
78
-
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
72
+
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
79
73
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
80
74
if the application needs to scale across machine boundaries.
81
-
5. Use `torchelastic<https://pytorch.org/elastic>`__ to launch distributed
82
-
training, if errors (e.g., OOM) are expected or if the resources can join and
83
-
leave dynamically during the training.
75
+
5. Use `torch.distributed.elastic<https://pytorch.org/docs/stable/distributed.elastic.html>`__
76
+
to launch distributed training if errors (e.g., out-of-memory) are expected or if
77
+
resources can join and leave dynamically during training.
84
78
85
79
86
-
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/master/notes/amp_examples.html#working-with-multiple-gpus>`__.
80
+
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
87
81
88
82
89
83
``torch.nn.DataParallel``
90
84
~~~~~~~~~~~~~~~~~~~~~~~~~
91
85
92
-
The `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__
86
+
The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
93
87
package enables single-machine multi-GPU parallelism with the lowest coding
94
88
hurdle. It only requires a one-line change to the application code. The tutorial
95
-
`Optional: Data Parallelism <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`__
96
-
shows an example. The caveat is that, although ``DataParallel`` is very easy to
97
-
use, it usually does not offer the best performance. This is because the
98
-
implementation of ``DataParallel`` replicates the model in every forward pass,
99
-
and its single-process multi-thread parallelism naturally suffers from GIL
100
-
contentions. To get better performance, please consider using
offer a starter example and some brief descriptions of its design and
123
-
implementation. If this is your first time using DDP, please start from this
118
+
implementation. If this is your first time using DDP, start from this
124
119
document.
125
120
2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
126
121
explains some common problems with DDP training, including unbalanced
@@ -129,54 +124,52 @@ DDP materials are listed below:
129
124
described in the
130
125
`Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
131
126
tutorial.
132
-
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__
127
+
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/stable/distributed/ddp/README.md>`__
133
128
document shows how to use the DDP launching script.
134
-
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html>`__
135
-
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html>`__
136
-
helps to reduce optimizer memory footprint for distributed data-parallel
137
-
training.
138
-
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <https://pytorch.org/tutorials/advanced/generic_oin.html>`__
129
+
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
130
+
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
131
+
helps to reduce optimizer memory footprint.
132
+
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_oin.html>`__
139
133
tutorial walks through using the generic join context for distributed training with uneven inputs.
140
134
141
-
TorchElastic
142
-
~~~~~~~~~~~~
135
+
torch.distributed.elastic
136
+
~~~~~~~~~~~~~~~~~~~~~~~~~
143
137
144
138
With the growth of the application complexity and scale, failure recovery
145
-
becomes an imperative requirement. Sometimes, it is inevitable to hit errors
146
-
like OOM when using DDP, but DDP itself cannot recover from those errors nor
147
-
does basic ``try-except`` block work. This is because DDP requires all processes
148
-
to operate in a closely synchronized manner and all ``AllReduce`` communications
149
-
launched in different processes must match. If one of the processes in the group
150
-
throws an OOM exception, it is likely to lead to desynchronization (mismatched
151
-
``AllReduce`` operations) which would then cause a crash or hang. If you expect
152
-
failures to occur during training or if resources might leave and join
153
-
dynamically, please launch distributed data-parallel training using
154
-
`torchelastic <https://pytorch.org/elastic>`__.
155
-
156
-
157
-
General Distributed Training
139
+
becomes a requirement. Sometimes it is inevitable to hit errors
140
+
like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
141
+
and it is not possible to handle them using a standard ``try-except`` construct.
142
+
This is because DDP requires all processes to operate in a closely synchronized manner
143
+
and all ``AllReduce`` communications launched in different processes must match.
144
+
If one of the processes in the group
145
+
throws an exception, it is likely to lead to desynchronization (mismatched
146
+
``AllReduce`` operations) which would then cause a crash or hang.
0 commit comments