Skip to content

Commit 506b16d

Browse files
authored
Merge branch 'main' into 2905-dep-loading-data-redirect
2 parents cc5e8ec + 7a44230 commit 506b16d

File tree

2 files changed

+65
-186
lines changed

2 files changed

+65
-186
lines changed

beginner_source/dist_overview.rst

Lines changed: 59 additions & 186 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
PyTorch Distributed Overview
22
============================
3-
**Author**: `Shen Li <https://mrshenli.github.io/>`_
3+
**Author**: `Will Constable <https://github.com/wconstab/>`_
44

55
.. note::
66
|edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/beginner_source/dist_overview.rst>`__.
@@ -15,203 +15,76 @@ to the technology that can best serve your use case.
1515
Introduction
1616
------------
1717

18-
As of PyTorch v1.6.0, features in ``torch.distributed`` can be categorized into
19-
three main components:
20-
21-
* `Distributed Data-Parallel Training <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
22-
(DDP) is a widely adopted single-program multiple-data training paradigm. With
23-
DDP, the model is replicated on every process, and every model replica will be
24-
fed with a different set of input data samples. DDP takes care of gradient
25-
communication to keep model replicas synchronized and overlaps it with the
26-
gradient computations to speed up training.
27-
* `RPC-Based Distributed Training <https://pytorch.org/docs/stable/rpc.html>`__
28-
(RPC) supports general training structures that cannot fit into
29-
data-parallel training such as distributed pipeline parallelism, parameter
30-
server paradigm, and combinations of DDP with other training paradigms. It
31-
helps manage remote object lifetime and extends the
32-
`autograd engine <https://pytorch.org/docs/stable/autograd.html>`__ beyond
33-
machine boundaries.
34-
* `Collective Communication <https://pytorch.org/docs/stable/distributed.html>`__
35-
(c10d) library supports sending tensors across processes within a group. It
36-
offers both collective communication APIs (e.g.,
37-
`all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
18+
The PyTorch Distributed library includes a collective of parallelism modules,
19+
a communications layer, and infrastructure for launching and
20+
debugging large training jobs.
21+
22+
23+
Parallelism APIs
24+
****************
25+
26+
These Parallelism Modules offer high-level functionality and compose with existing models:
27+
28+
- `Distributed Data-Parallel (DDP) <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
29+
- `Fully Sharded Data-Parallel Training (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__
30+
- `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__
31+
- `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__
32+
33+
Sharding primitives
34+
*******************
35+
36+
``DTensor`` and ``DeviceMesh`` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
37+
38+
- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__ represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
39+
- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ``ProcessGroup`` instances for collective communications in multi-dimensional parallelisms. Try out our `Device Mesh Recipe <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__ to learn more.
40+
41+
Communications APIs
42+
*******************
43+
44+
The `PyTorch distributed communication layer (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ offers both collective communication APIs (e.g., `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
3845
and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
3946
and P2P communication APIs (e.g.,
4047
`send <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__
41-
and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__).
42-
DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/stable/rpc.html#process-group-backend>`__)
43-
are built on c10d, where the former uses collective communications
44-
and the latter uses P2P communications. Usually, developers do not need to
45-
directly use this raw communication API, as the DDP and RPC APIs can serve
46-
many distributed training scenarios. However, there are use cases where this API
47-
is still helpful. One example would be distributed parameter averaging, where
48-
applications would like to compute the average values of all model parameters
49-
after the backward pass instead of using DDP to communicate gradients. This can
50-
decouple communications from computations and allow finer-grain control over
51-
what to communicate, but on the other hand, it also gives up the performance
52-
optimizations offered by DDP.
48+
and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__),
49+
which are used under the hood in all of the parallelism implementations.
5350
`Writing Distributed Applications with PyTorch <../intermediate/dist_tuto.html>`__
5451
shows examples of using c10d communication APIs.
5552

53+
Launcher
54+
********
5655

57-
Data Parallel Training
58-
----------------------
59-
60-
PyTorch provides several options for data-parallel training. For applications
61-
that gradually grow from simple to complex and from prototype to production, the
62-
common development trajectory would be:
63-
64-
1. Use single-device training if the data and model can fit in one GPU, and
65-
training speed is not a concern.
66-
2. Use single-machine multi-GPU
67-
`DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
68-
to make use of multiple GPUs on a single machine to speed up training with
69-
minimal code changes.
70-
3. Use single-machine multi-GPU
71-
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
72-
if you would like to further speed up training and are willing to write a
73-
little more code to set it up.
74-
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
75-
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
76-
if the application needs to scale across machine boundaries.
77-
5. Use multi-GPU `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
78-
training on a single-machine or multi-machine when the data and model cannot
79-
fit on one GPU.
80-
6. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
81-
to launch distributed training if errors (e.g., out-of-memory) are expected or if
82-
resources can join and leave dynamically during training.
56+
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
8357

8458

85-
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
59+
Applying Parallelism To Scale Your Model
60+
----------------------------------------
8661

62+
Data Parallelism is a widely adopted single-program multiple-data training paradigm
63+
where the model is replicated on every process, every model replica computes local gradients for
64+
a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
8765

88-
``torch.nn.DataParallel``
89-
~~~~~~~~~~~~~~~~~~~~~~~~~
90-
91-
The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
92-
package enables single-machine multi-GPU parallelism with the lowest coding
93-
hurdle. It only requires a one-line change to the application code. The tutorial
94-
`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
95-
shows an example. Although ``DataParallel`` is very easy to
96-
use, it usually does not offer the best performance because it replicates the
97-
model in every forward pass, and its single-process multi-thread parallelism
98-
naturally suffers from
99-
`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
100-
better performance, consider using
101-
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
102-
103-
104-
``torch.nn.parallel.DistributedDataParallel``
105-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
106-
107-
Compared to `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__,
108-
`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
109-
requires one more step to set up, i.e., calling
110-
`init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
111-
DDP uses multi-process parallelism, and hence there is no GIL contention across
112-
model replicas. Moreover, the model is broadcast at DDP construction time instead
113-
of in every forward pass, which also helps to speed up training. DDP is shipped
114-
with several performance optimization technologies. For a more in-depth
115-
explanation, refer to this
116-
`paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
117-
118-
119-
DDP materials are listed below:
120-
121-
1. `DDP notes <https://pytorch.org/docs/stable/notes/ddp.html>`__
122-
offer a starter example and some brief descriptions of its design and
123-
implementation. If this is your first time using DDP, start from this
124-
document.
125-
2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
126-
explains some common problems with DDP training, including unbalanced
127-
workload, checkpointing, and multi-device models. Note that, DDP can be
128-
easily combined with single-machine multi-device model parallelism which is
129-
described in the
130-
`Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
131-
tutorial.
132-
3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/main/distributed/ddp/README.md>`__
133-
document shows how to use the DDP launching script.
134-
4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
135-
recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
136-
helps to reduce optimizer memory footprint.
137-
5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_join.html>`__
138-
tutorial walks through using the generic join context for distributed training with uneven inputs.
139-
140-
141-
``torch.distributed.FullyShardedDataParallel``
142-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143-
144-
The `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
145-
(FSDP) is a type of data parallelism paradigm which maintains a per-GPU copy of a model’s
146-
parameters, gradients and optimizer states, it shards all of these states across
147-
data-parallel workers. The support for FSDP was added starting PyTorch v1.11. The tutorial
148-
`Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
149-
provides in depth explanation and example of how FSDP works.
150-
151-
152-
torch.distributed.elastic
153-
~~~~~~~~~~~~~~~~~~~~~~~~~
154-
155-
With the growth of the application complexity and scale, failure recovery
156-
becomes a requirement. Sometimes it is inevitable to hit errors
157-
like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
158-
and it is not possible to handle them using a standard ``try-except`` construct.
159-
This is because DDP requires all processes to operate in a closely synchronized manner
160-
and all ``AllReduce`` communications launched in different processes must match.
161-
If one of the processes in the group
162-
throws an exception, it is likely to lead to desynchronization (mismatched
163-
``AllReduce`` operations) which would then cause a crash or hang.
164-
`torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
165-
adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
166-
167-
RPC-Based Distributed Training
168-
------------------------------
66+
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn't fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
16967

170-
Many training paradigms do not fit into data parallelism, e.g.,
171-
parameter server paradigm, distributed pipeline parallelism, reinforcement
172-
learning applications with multiple observers or agents, etc.
173-
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
174-
supporting general distributed training scenarios.
175-
176-
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__
177-
has four main pillars:
178-
179-
* `RPC <https://pytorch.org/docs/stable/rpc.html#rpc>`__ supports running
180-
a given function on a remote worker.
181-
* `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ helps to manage the
182-
lifetime of a remote object. The reference counting protocol is presented in the
183-
`RRef notes <https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol>`__.
184-
* `Distributed Autograd <https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework>`__
185-
extends the autograd engine beyond machine boundaries. Please refer to
186-
`Distributed Autograd Design <https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design>`__
187-
for more details.
188-
* `Distributed Optimizer <https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim>`__
189-
automatically reaches out to all participating workers to update
190-
parameters using gradients computed by the distributed autograd engine.
191-
192-
RPC Tutorials are listed below:
193-
194-
1. The `Getting Started with Distributed RPC Framework <../intermediate/rpc_tutorial.html>`__
195-
tutorial first uses a simple Reinforcement Learning (RL) example to
196-
demonstrate RPC and RRef. Then, it applies a basic distributed model
197-
parallelism to an RNN example to show how to use distributed autograd and
198-
distributed optimizer.
199-
2. The `Implementing a Parameter Server Using Distributed RPC Framework <../intermediate/rpc_param_server_tutorial.html>`__
200-
tutorial borrows the spirit of
201-
`HogWild! training <https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf>`__
202-
and applies it to an asynchronous parameter server (PS) training application.
203-
3. The `Distributed Pipeline Parallelism Using RPC <../intermediate/dist_pipeline_parallel_tutorial.html>`__
204-
tutorial extends the single-machine pipeline parallel example (presented in
205-
`Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__)
206-
to a distributed environment and shows how to implement it using RPC.
207-
4. The `Implementing Batch RPC Processing Using Asynchronous Executions <../intermediate/rpc_async_execution.html>`__
208-
tutorial demonstrates how to implement RPC batch processing using the
209-
`@rpc.functions.async_execution <https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution>`__
210-
decorator, which can help speed up inference and training. It uses
211-
RL and PS examples similar to those in the above tutorials 1 and 2.
212-
5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
213-
tutorial demonstrates how to combine DDP with RPC to train a model using
214-
distributed data parallelism combined with distributed model parallelism.
68+
When deciding what parallelism techniques to choose for your model, use these common guidelines:
69+
70+
#. Use `DistributedDataParallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`__,
71+
if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
72+
73+
* Use `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__, to launch multiple pytorch processes if you are you using more than one node.
74+
75+
* See also: `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
76+
77+
#. Use `FullyShardedDataParallel (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__ when your model cannot fit on one GPU.
78+
79+
* See also: `Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
80+
81+
#. Use `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ and/or `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ if you reach scaling limitations with FSDP.
82+
83+
* Try our `Tensor Parallelism Tutorial <https://pytorch.org/tutorials/intermediate/TP_tutorial.html>`__
84+
85+
* See also: `TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__
86+
87+
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
21588

21689

21790
PyTorch Distributed Developers

en-wordlist.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,7 @@ dataset’s
335335
deallocation
336336
decompositions
337337
decorrelated
338+
devicemesh
338339
deserialize
339340
deserialized
340341
desynchronization
@@ -346,6 +347,7 @@ distractor
346347
downsample
347348
downsamples
348349
dropdown
350+
dtensor
349351
duration
350352
elementwise
351353
embeddings
@@ -482,6 +484,7 @@ prespecified
482484
pretrained
483485
prewritten
484486
primals
487+
processgroup
485488
profiler
486489
profilers
487490
protobuf
@@ -503,6 +506,7 @@ relu
503506
reproducibility
504507
rescale
505508
rescaling
509+
reshard
506510
resnet
507511
restride
508512
rewinded
@@ -515,6 +519,8 @@ runtime
515519
runtime
516520
runtimes
517521
scalable
522+
sharded
523+
Sharding
518524
softmax
519525
sparsified
520526
sparsifier

0 commit comments

Comments
 (0)