Skip to content

Commit bbcf868

Browse files
garymmdzhulgakovholly1238
authored
Distributed Overview: clean up (#1733)
* Lots of minor grammar, clarity, brevity or other style improvements. * Make all links consistently point to stable documents rather than a mix of stable and master. * Make certain link targets relative instead of absolute. Co-authored-by: Dmytro Dzhulgakov <dzhulgakov@users.noreply.github.com> Co-authored-by: Holly Sweeney <77758406+holly1238@users.noreply.github.com>
1 parent 18905c9 commit bbcf868

File tree

1 file changed

+71
-78
lines changed

1 file changed

+71
-78
lines changed

beginner_source/dist_overview.rst

Lines changed: 71 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,8 @@ PyTorch Distributed Overview
33
**Author**: `Shen Li <https://mrshenli.github.io/>`_
44

55

6-
This is the overview page for the ``torch.distributed`` package. As there are
7-
more and more documents, examples and tutorials added at different locations,
8-
it becomes unclear which document or tutorial to consult for a specific problem
9-
or what is the best order to read these contents. The goal of this page is to
10-
address this problem by categorizing documents into different topics and briefly
6+
This is the overview page for the ``torch.distributed`` package. The goal of
7+
this page is to categorize documents into different topics and briefly
118
describe each of them. If this is your first time building distributed training
129
applications using PyTorch, it is recommended to use this document to navigate
1310
to the technology that can best serve your use case.
@@ -19,108 +16,106 @@ Introduction
1916
As of PyTorch v1.6.0, features in ``torch.distributed`` can be categorized into
2017
three main components:
2118

22-
* `Distributed Data-Parallel Training <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
19+
* `Distributed Data-Parallel Training <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
2320
(DDP) is a widely adopted single-program multiple-data training paradigm. With
2421
DDP, the model is replicated on every process, and every model replica will be
2522
fed with a different set of input data samples. DDP takes care of gradient
26-
communications to keep model replicas synchronized and overlaps it with the
23+
communication to keep model replicas synchronized and overlaps it with the
2724
gradient computations to speed up training.
28-
* `RPC-Based Distributed Training <https://pytorch.org/docs/master/rpc.html>`__
29-
(RPC) is developed to support general training structures that cannot fit into
30-
data-parallel training, such as distributed pipeline parallelism, parameter
31-
server paradigm, and combination of DDP with other training paradigms. It
32-
helps manage remote object lifetime and extend autograd engine to beyond
25+
* `RPC-Based Distributed Training <https://pytorch.org/docs/stable/rpc.html>`__
26+
(RPC) supports general training structures that cannot fit into
27+
data-parallel training such as distributed pipeline parallelism, parameter
28+
server paradigm, and combinations of DDP with other training paradigms. It
29+
helps manage remote object lifetime and extends the
30+
`autograd engine <https://pytorch.org/docs/stable/autograd.html>`__ beyond
3331
machine boundaries.
3432
* `Collective Communication <https://pytorch.org/docs/stable/distributed.html>`__
35-
(c10d) library support sending tensors across processes within a group. It
33+
(c10d) library supports sending tensors across processes within a group. It
3634
offers both collective communication APIs (e.g.,
3735
`all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
3836
and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
3937
and P2P communication APIs (e.g.,
4038
`send <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__
4139
and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__).
42-
DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/master/rpc.html#process-group-backend>`__)
43-
are built on c10d as of v1.6.0, where the former uses collective communications
40+
DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/stable/rpc.html#process-group-backend>`__)
41+
are built on c10d, where the former uses collective communications
4442
and the latter uses P2P communications. Usually, developers do not need to
45-
directly use this raw communication API, as DDP and RPC features above can serve
43+
directly use this raw communication API, as the DDP and RPC APIs can serve
4644
many distributed training scenarios. However, there are use cases where this API
4745
is still helpful. One example would be distributed parameter averaging, where
4846
applications would like to compute the average values of all model parameters
4947
after the backward pass instead of using DDP to communicate gradients. This can
5048
decouple communications from computations and allow finer-grain control over
5149
what to communicate, but on the other hand, it also gives up the performance
5250
optimizations offered by DDP. The
53-
`Writing Distributed Applications with PyTorch <https://pytorch.org/tutorials/intermediate/dist_tuto.html>`__
51+
`Writing Distributed Applications with PyTorch <../intermediate/dist_tuto.html>`__
5452
shows examples of using c10d communication APIs.
5553

5654

57-
Most of the existing documents are written for either DDP or RPC, the remainder
58-
of this page will elaborate materials for these two components.
59-
60-
6155
Data Parallel Training
6256
----------------------
6357

6458
PyTorch provides several options for data-parallel training. For applications
6559
that gradually grow from simple to complex and from prototype to production, the
6660
common development trajectory would be:
6761

68-
1. Use single-device training, if the data and model can fit in one GPU, and the
62+
1. Use single-device training if the data and model can fit in one GPU, and
6963
training speed is not a concern.
7064
2. Use single-machine multi-GPU
71-
`DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__,
72-
if there are multiple GPUs on the server, and you would like to speed up
73-
training with the minimum code change.
65+
`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
66+
to make use of multiple GPUs on a single machine to speed up training with
67+
minimal code changes.
7468
3. Use single-machine multi-GPU
75-
`DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
69+
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
7670
if you would like to further speed up training and are willing to write a
7771
little more code to set it up.
78-
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
72+
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
7973
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
8074
if the application needs to scale across machine boundaries.
81-
5. Use `torchelastic <https://pytorch.org/elastic>`__ to launch distributed
82-
training, if errors (e.g., OOM) are expected or if the resources can join and
83-
leave dynamically during the training.
75+
5. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
76+
to launch distributed training if errors (e.g., out-of-memory) are expected or if
77+
resources can join and leave dynamically during training.
8478

8579

86-
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/master/notes/amp_examples.html#working-with-multiple-gpus>`__.
80+
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
8781

8882

8983
``torch.nn.DataParallel``
9084
~~~~~~~~~~~~~~~~~~~~~~~~~
9185

92-
The `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__
86+
The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
9387
package enables single-machine multi-GPU parallelism with the lowest coding
9488
hurdle. It only requires a one-line change to the application code. The tutorial
95-
`Optional: Data Parallelism <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`__
96-
shows an example. The caveat is that, although ``DataParallel`` is very easy to
97-
use, it usually does not offer the best performance. This is because the
98-
implementation of ``DataParallel`` replicates the model in every forward pass,
99-
and its single-process multi-thread parallelism naturally suffers from GIL
100-
contentions. To get better performance, please consider using
101-
`DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
89+
`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
90+
shows an example. Although ``DataParallel`` is very easy to
91+
use, it usually does not offer the best performance because it replicates the
92+
model in every forward pass, and its single-process multi-thread parallelism
93+
naturally suffers from
94+
`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
95+
better performance, consider using
96+
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
10297

10398

10499
``torch.nn.parallel.DistributedDataParallel``
105100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
106101

107-
Compared to `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__,
108-
`DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
102+
Compared to `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__,
103+
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
109104
requires one more step to set up, i.e., calling
110105
`init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
111106
DDP uses multi-process parallelism, and hence there is no GIL contention across
112107
model replicas. Moreover, the model is broadcast at DDP construction time instead
113108
of in every forward pass, which also helps to speed up training. DDP is shipped
114109
with several performance optimization technologies. For a more in-depth
115-
explanation, please refer to this
116-
`DDP paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
110+
explanation, refer to this
111+
`paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
117112

118113

119114
DDP materials are listed below:
120115

121116
1. `DDP notes <https://pytorch.org/docs/stable/notes/ddp.html>`__
122117
offer a starter example and some brief descriptions of its design and
123-
implementation. If this is your first time using DDP, please start from this
118+
implementation. If this is your first time using DDP, start from this
124119
document.
125120
2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
126121
explains some common problems with DDP training, including unbalanced
@@ -129,54 +124,52 @@ DDP materials are listed below:
129124
described in the
130125
`Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
131126
tutorial.
132-
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__
127+
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/stable/distributed/ddp/README.md>`__
133128
document shows how to use the DDP launching script.
134-
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html>`__
135-
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html>`__
136-
helps to reduce optimizer memory footprint for distributed data-parallel
137-
training.
138-
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <https://pytorch.org/tutorials/advanced/generic_oin.html>`__
129+
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
130+
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
131+
helps to reduce optimizer memory footprint.
132+
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_oin.html>`__
139133
tutorial walks through using the generic join context for distributed training with uneven inputs.
140134

141-
TorchElastic
142-
~~~~~~~~~~~~
135+
torch.distributed.elastic
136+
~~~~~~~~~~~~~~~~~~~~~~~~~
143137

144138
With the growth of the application complexity and scale, failure recovery
145-
becomes an imperative requirement. Sometimes, it is inevitable to hit errors
146-
like OOM when using DDP, but DDP itself cannot recover from those errors nor
147-
does basic ``try-except`` block work. This is because DDP requires all processes
148-
to operate in a closely synchronized manner and all ``AllReduce`` communications
149-
launched in different processes must match. If one of the processes in the group
150-
throws an OOM exception, it is likely to lead to desynchronization (mismatched
151-
``AllReduce`` operations) which would then cause a crash or hang. If you expect
152-
failures to occur during training or if resources might leave and join
153-
dynamically, please launch distributed data-parallel training using
154-
`torchelastic <https://pytorch.org/elastic>`__.
155-
156-
157-
General Distributed Training
139+
becomes a requirement. Sometimes it is inevitable to hit errors
140+
like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
141+
and it is not possible to handle them using a standard ``try-except`` construct.
142+
This is because DDP requires all processes to operate in a closely synchronized manner
143+
and all ``AllReduce`` communications launched in different processes must match.
144+
If one of the processes in the group
145+
throws an exception, it is likely to lead to desynchronization (mismatched
146+
``AllReduce`` operations) which would then cause a crash or hang.
147+
`torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
148+
adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
149+
150+
RPC-Based Distributed Training
158151
----------------------------
159152

160153
Many training paradigms do not fit into data parallelism, e.g.,
161154
parameter server paradigm, distributed pipeline parallelism, reinforcement
162-
learning applications with multiple observers or agents, etc. The
163-
`torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html>`__ aims at
155+
learning applications with multiple observers or agents, etc.
156+
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
164157
supporting general distributed training scenarios.
165158

166-
The `torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html>`__ package
159+
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__
167160
has four main pillars:
168161

169-
* `RPC <https://pytorch.org/docs/master/rpc.html#rpc>`__ supports running
162+
* `RPC <https://pytorch.org/docs/stable/rpc.html#rpc>`__ supports running
170163
a given function on a remote worker.
171-
* `RRef <https://pytorch.org/docs/master/rpc.html#rref>`__ helps to manage the
164+
* `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ helps to manage the
172165
lifetime of a remote object. The reference counting protocol is presented in the
173-
`RRef notes <https://pytorch.org/docs/master/rpc/rref.html#remote-reference-protocol>`__.
174-
* `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__
166+
`RRef notes <https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol>`__.
167+
* `Distributed Autograd <https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework>`__
175168
extends the autograd engine beyond machine boundaries. Please refer to
176-
`Distributed Autograd Design <https://pytorch.org/docs/master/rpc/distributed_autograd.html#distributed-autograd-design>`__
169+
`Distributed Autograd Design <https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design>`__
177170
for more details.
178-
* `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__
179-
that automatically reaches out to all participating workers to update
171+
* `Distributed Optimizer <https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim>`__
172+
automatically reaches out to all participating workers to update
180173
parameters using gradients computed by the distributed autograd engine.
181174

182175
RPC Tutorials are listed below:
@@ -196,9 +189,9 @@ RPC Tutorials are listed below:
196189
to a distributed environment and shows how to implement it using RPC.
197190
4. The `Implementing Batch RPC Processing Using Asynchronous Executions <../intermediate/rpc_async_execution.html>`__
198191
tutorial demonstrates how to implement RPC batch processing using the
199-
`@rpc.functions.async_execution <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.functions.async_execution>`__
200-
decorator, which can help speed up inference and training. It uses similar
201-
RL and PS examples employed in the above tutorials 1 and 2.
192+
`@rpc.functions.async_execution <https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution>`__
193+
decorator, which can help speed up inference and training. It uses
194+
RL and PS examples similar to those in the above tutorials 1 and 2.
202195
5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
203196
tutorial demonstrates how to combine DDP with RPC to train a model using
204197
distributed data parallelism combined with distributed model parallelism.

0 commit comments

Comments
 (0)