You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
``DTensor`` and ``DeviceMesh`` are primitives used to build parallelism in terms of sharded or replicated tensors on N-dimensional process groups.
37
+
38
+
- `DTensor <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md>`__ represents a tensor that is sharded and/or replicated, and communicates automatically to reshard tensors as needed by operations.
39
+
- `DeviceMesh <https://pytorch.org/docs/stable/distributed.html#devicemesh>`__ abstracts the accelerator device communicators into a multi-dimensional array, which manages the underlying ``ProcessGroup`` instances for collective communications in multi-dimensional parallelisms. Try out our `Device Mesh Recipe <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__ to learn more.
40
+
41
+
Communications APIs
42
+
*******************
43
+
44
+
The `PyTorch distributed communication layer (C10D) <https://pytorch.org/docs/stable/distributed.html>`__ offers both collective communication APIs (e.g., `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
38
45
and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
if you would like to further speed up training and are willing to write a
73
-
little more code to set it up.
74
-
4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
75
-
and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
76
-
if the application needs to scale across machine boundaries.
77
-
5. Use multi-GPU `FullyShardedDataParallel <https://pytorch.org/docs/stable/fsdp.html>`__
78
-
training on a single-machine or multi-machine when the data and model cannot
79
-
fit on one GPU.
80
-
6. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
81
-
to launch distributed training if errors (e.g., out-of-memory) are expected or if
82
-
resources can join and leave dynamically during training.
56
+
`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ is a widely-used launcher script, which spawns processes on the local and remote machines for running distributed PyTorch programs.
83
57
84
58
85
-
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
59
+
Applying Parallelism To Scale Your Model
60
+
----------------------------------------
86
61
62
+
Data Parallelism is a widely adopted single-program multiple-data training paradigm
63
+
where the model is replicated on every process, every model replica computes local gradients for
64
+
a different set of input data samples, gradients are averaged within the data-parallel communicator group before each optimizer step.
87
65
88
-
``torch.nn.DataParallel``
89
-
~~~~~~~~~~~~~~~~~~~~~~~~~
90
-
91
-
The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
92
-
package enables single-machine multi-GPU parallelism with the lowest coding
93
-
hurdle. It only requires a one-line change to the application code. The tutorial
94
-
`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
95
-
shows an example. Although ``DataParallel`` is very easy to
96
-
use, it usually does not offer the best performance because it replicates the
97
-
model in every forward pass, and its single-process multi-thread parallelism
98
-
naturally suffers from
99
-
`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
166
-
167
-
RPC-Based Distributed Training
168
-
------------------------------
66
+
Model Parallelism techniques (or Sharded Data Parallelism) are required when a model doesn't fit in GPU, and can be combined together to form multi-dimensional (N-D) parallelism techniques.
169
67
170
-
Many training paradigms do not fit into data parallelism, e.g.,
171
-
parameter server paradigm, distributed pipeline parallelism, reinforcement
172
-
learning applications with multiple observers or agents, etc.
173
-
`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
174
-
supporting general distributed training scenarios.
decorator, which can help speed up inference and training. It uses
211
-
RL and PS examples similar to those in the above tutorials 1 and 2.
212
-
5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
213
-
tutorial demonstrates how to combine DDP with RPC to train a model using
214
-
distributed data parallelism combined with distributed model parallelism.
68
+
When deciding what parallelism techniques to choose for your model, use these common guidelines:
69
+
70
+
#. Use `DistributedDataParallel (DDP) <https://pytorch.org/docs/stable/notes/ddp.html>`__,
71
+
if your model fits in a single GPU but you want to easily scale up training using multiple GPUs.
72
+
73
+
* Use `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__, to launch multiple pytorch processes if you are you using more than one node.
74
+
75
+
* See also: `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
76
+
77
+
#. Use `FullyShardedDataParallel (FSDP) <https://pytorch.org/docs/stable/fsdp.html>`__ when your model cannot fit on one GPU.
78
+
79
+
* See also: `Getting Started with FSDP <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__
80
+
81
+
#. Use `Tensor Parallel (TP) <https://pytorch.org/docs/stable/distributed.tensor.parallel.html>`__ and/or `Pipeline Parallel (PP) <https://pytorch.org/docs/main/distributed.pipelining.html>`__ if you reach scaling limitations with FSDP.
* See also: `TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__
86
+
87
+
.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
0 commit comments