Distributed Overview: clean up (#1733)

garymm · dzhulgakov · holly1238 · web-flow · commit bbcf868d4bb6 · 2021-11-02T02:52:16.000-07:00
* Lots of minor grammar, clarity, brevity or other style improvements.
* Make all links consistently point to stable documents rather than a
  mix of stable and master.
* Make certain link targets relative instead of absolute.

Co-authored-by: Dmytro Dzhulgakov &lt;dzhulgakov@users.noreply.github.com&gt;
Co-authored-by: Holly Sweeney &lt;77758406+holly1238@users.noreply.github.com&gt;
diff --git a/beginner_source/dist_overview.rst b/beginner_source/dist_overview.rst
@@ -3,11 +3,8 @@ PyTorch Distributed Overview
 **Author**: `Shen Li <https://mrshenli.github.io/>`_
 
 
-This is the overview page for the ``torch.distributed`` package. As there are
-more and more documents, examples and tutorials added at different locations,
-it becomes unclear which document or tutorial to consult for a specific problem
-or what is the best order to read these contents. The goal of this page is to
-address this problem by categorizing documents into different topics and briefly
+This is the overview page for the ``torch.distributed`` package. The goal of
+this page is to categorize documents into different topics and briefly
 describe each of them. If this is your first time building distributed training
 applications using PyTorch, it is recommended to use this document to navigate
 to the technology that can best serve your use case.
@@ -19,108 +16,106 @@ Introduction
 As of PyTorch v1.6.0, features in ``torch.distributed`` can be categorized into
 three main components:
 
-* `Distributed Data-Parallel Training <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
+* `Distributed Data-Parallel Training <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
   (DDP) is a widely adopted single-program multiple-data training paradigm. With
   DDP, the model is replicated on every process, and every model replica will be
   fed with a different set of input data samples. DDP takes care of gradient
-  communications to keep model replicas synchronized and overlaps it with the
+  communication to keep model replicas synchronized and overlaps it with the
   gradient computations to speed up training.
-* `RPC-Based Distributed Training <https://pytorch.org/docs/master/rpc.html>`__
-  (RPC) is developed to support general training structures that cannot fit into
-  data-parallel training, such as distributed pipeline parallelism, parameter
-  server paradigm, and combination of DDP with other training paradigms. It
-  helps manage remote object lifetime and extend autograd engine to beyond
+* `RPC-Based Distributed Training <https://pytorch.org/docs/stable/rpc.html>`__
+  (RPC) supports general training structures that cannot fit into
+  data-parallel training such as distributed pipeline parallelism, parameter
+  server paradigm, and combinations of DDP with other training paradigms. It
+  helps manage remote object lifetime and extends the
+  `autograd engine <https://pytorch.org/docs/stable/autograd.html>`__ beyond
   machine boundaries.
 * `Collective Communication <https://pytorch.org/docs/stable/distributed.html>`__
-  (c10d) library support sending tensors across processes within a group. It
+  (c10d) library supports sending tensors across processes within a group. It
   offers both collective communication APIs (e.g.,
   `all_reduce <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce>`__
   and `all_gather <https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_gather>`__)
   and P2P communication APIs (e.g.,
   `send <https://pytorch.org/docs/stable/distributed.html#torch.distributed.send>`__
   and `isend <https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend>`__).
-  DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/master/rpc.html#process-group-backend>`__)
-  are built on c10d as of v1.6.0, where the former uses collective communications
+  DDP and RPC (`ProcessGroup Backend <https://pytorch.org/docs/stable/rpc.html#process-group-backend>`__)
+  are built on c10d, where the former uses collective communications
   and the latter uses P2P communications. Usually, developers do not need to
-  directly use this raw communication API, as DDP and RPC features above can serve
+  directly use this raw communication API, as the DDP and RPC APIs can serve
   many distributed training scenarios. However, there are use cases where this API
   is still helpful. One example would be distributed parameter averaging, where
   applications would like to compute the average values of all model parameters
   after the backward pass instead of using DDP to communicate gradients. This can
   decouple communications from computations and allow finer-grain control over
   what to communicate, but on the other hand, it also gives up the performance
   optimizations offered by DDP. The
-  `Writing Distributed Applications with PyTorch <https://pytorch.org/tutorials/intermediate/dist_tuto.html>`__
+  `Writing Distributed Applications with PyTorch <../intermediate/dist_tuto.html>`__
   shows examples of using c10d communication APIs.
 
 
-Most of the existing documents are written for either DDP or RPC, the remainder
-of this page will elaborate materials for these two components.
-
-
 Data Parallel Training
 ----------------------
 
 PyTorch provides several options for data-parallel training. For applications
 that gradually grow from simple to complex and from prototype to production, the
 common development trajectory would be:
 
-1. Use single-device training, if the data and model can fit in one GPU, and the
+1. Use single-device training if the data and model can fit in one GPU, and
    training speed is not a concern.
 2. Use single-machine multi-GPU
-   `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__,
-   if there are multiple GPUs on the server, and you would like to speed up
-   training with the minimum code change.
+   `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
+   to make use of multiple GPUs on a single machine to speed up training with
+   minimal code changes.
 3. Use single-machine multi-GPU
-   `DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
+   `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__,
    if you would like to further speed up training and are willing to write a
    little more code to set it up.
-4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
+4. Use multi-machine `DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
    and the `launching script <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__,
    if the application needs to scale across machine boundaries.
-5. Use `torchelastic <https://pytorch.org/elastic>`__ to launch distributed
-   training, if errors (e.g., OOM) are expected or if the resources can join and
-   leave dynamically during the training.
+5. Use `torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
+   to launch distributed training if errors (e.g., out-of-memory) are expected or if
+   resources can join and leave dynamically during training.
 
 
-.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/master/notes/amp_examples.html#working-with-multiple-gpus>`__.
+.. note:: Data-parallel training also works with `Automatic Mixed Precision (AMP) <https://pytorch.org/docs/stable/notes/amp_examples.html#working-with-multiple-gpus>`__.
 
 
 ``torch.nn.DataParallel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__
+The `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__
 package enables single-machine multi-GPU parallelism with the lowest coding
 hurdle. It only requires a one-line change to the application code. The tutorial
-`Optional: Data Parallelism <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`__
-shows an example. The caveat is that, although ``DataParallel`` is very easy to
-use, it usually does not offer the best performance. This is because the
-implementation of ``DataParallel`` replicates the model in every forward pass,
-and its single-process multi-thread parallelism naturally suffers from GIL
-contentions. To get better performance, please consider using
-`DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
+`Optional: Data Parallelism <../beginner/blitz/data_parallel_tutorial.html>`__
+shows an example. Although ``DataParallel`` is very easy to
+use, it usually does not offer the best performance because it replicates the
+model in every forward pass, and its single-process multi-thread parallelism
+naturally suffers from
+`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`__ contention. To get
+better performance, consider using
+`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__.
 
 
 ``torch.nn.parallel.DistributedDataParallel``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Compared to `DataParallel <https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html>`__,
-`DistributedDataParallel <https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html>`__
+Compared to `DataParallel <https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html>`__,
+`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`__
 requires one more step to set up, i.e., calling
 `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
 DDP uses multi-process parallelism, and hence there is no GIL contention across
 model replicas. Moreover, the model is broadcast at DDP construction time instead
 of in every forward pass, which also helps to speed up training. DDP is shipped
 with several performance optimization technologies. For a more in-depth
-explanation, please refer to this
-`DDP paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
+explanation, refer to this
+`paper <http://www.vldb.org/pvldb/vol13/p3005-li.pdf>`__ (VLDB'20).
 
 
 DDP materials are listed below:
 
 1. `DDP notes <https://pytorch.org/docs/stable/notes/ddp.html>`__
    offer a starter example and some brief descriptions of its design and
-   implementation. If this is your first time using DDP, please start from this
+   implementation. If this is your first time using DDP, start from this
    document.
 2. `Getting Started with Distributed Data Parallel <../intermediate/ddp_tutorial.html>`__
    explains some common problems with DDP training, including unbalanced
@@ -129,54 +124,52 @@ DDP materials are listed below:
    described in the
    `Single-Machine Model Parallel Best Practices <../intermediate/model_parallel_tutorial.html>`__
    tutorial.
-3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md>`__
+3. The `Launching and configuring distributed data parallel applications <https://github.com/pytorch/examples/blob/stable/distributed/ddp/README.md>`__
    document shows how to use the DDP launching script.
-4. The `Shard Optimizer States With ZeroRedundancyOptimizer <https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html>`__
-   recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/master/distributed.optim.html>`__
-   helps to reduce optimizer memory footprint for distributed data-parallel
-   training.
-5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <https://pytorch.org/tutorials/advanced/generic_oin.html>`__
+4. The `Shard Optimizer States With ZeroRedundancyOptimizer <../recipes/zero_redundancy_optimizer.html>`__
+   recipe demonstrates how `ZeroRedundancyOptimizer <https://pytorch.org/docs/stable/distributed.optim.html>`__
+   helps to reduce optimizer memory footprint.
+5. The `Distributed Training with Uneven Inputs Using the Join Context Manager <../advanced/generic_oin.html>`__
    tutorial walks through using the generic join context for distributed training with uneven inputs.
 
-TorchElastic
-~~~~~~~~~~~~
+torch.distributed.elastic
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 With the growth of the application complexity and scale, failure recovery
-becomes an imperative requirement. Sometimes, it is inevitable to hit errors
-like OOM when using DDP, but DDP itself cannot recover from those errors nor
-does basic ``try-except`` block work. This is because DDP requires all processes
-to operate in a closely synchronized manner and all ``AllReduce`` communications
-launched in different processes must match. If one of the processes in the group
-throws an OOM exception, it is likely to lead to desynchronization (mismatched
-``AllReduce`` operations) which would then cause a crash or hang. If you expect
-failures to occur during training or if resources might leave and join
-dynamically, please launch distributed data-parallel training using
-`torchelastic <https://pytorch.org/elastic>`__.
-
-
-General Distributed Training
+becomes a requirement. Sometimes it is inevitable to hit errors
+like out-of-memory (OOM) when using DDP, but DDP itself cannot recover from those errors,
+and it is not possible to handle them using a standard ``try-except`` construct.
+This is because DDP requires all processes to operate in a closely synchronized manner
+and all ``AllReduce`` communications launched in different processes must match.
+If one of the processes in the group
+throws an exception, it is likely to lead to desynchronization (mismatched
+``AllReduce`` operations) which would then cause a crash or hang.
+`torch.distributed.elastic <https://pytorch.org/docs/stable/distributed.elastic.html>`__
+adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
+
+RPC-Based Distributed Training
 ----------------------------
 
 Many training paradigms do not fit into data parallelism, e.g.,
 parameter server paradigm, distributed pipeline parallelism, reinforcement
-learning applications with multiple observers or agents, etc. The
-`torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html>`__ aims at
+learning applications with multiple observers or agents, etc.
+`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__ aims at
 supporting general distributed training scenarios.
 
-The `torch.distributed.rpc <https://pytorch.org/docs/master/rpc.html>`__ package
+`torch.distributed.rpc <https://pytorch.org/docs/stable/rpc.html>`__
 has four main pillars:
 
-* `RPC <https://pytorch.org/docs/master/rpc.html#rpc>`__ supports running
+* `RPC <https://pytorch.org/docs/stable/rpc.html#rpc>`__ supports running
   a given function on a remote worker.
-* `RRef <https://pytorch.org/docs/master/rpc.html#rref>`__ helps to manage the
+* `RRef <https://pytorch.org/docs/stable/rpc.html#rref>`__ helps to manage the
   lifetime of a remote object. The reference counting protocol is presented in the
-  `RRef notes <https://pytorch.org/docs/master/rpc/rref.html#remote-reference-protocol>`__.
-* `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__
+  `RRef notes <https://pytorch.org/docs/stable/rpc/rref.html#remote-reference-protocol>`__.
+* `Distributed Autograd <https://pytorch.org/docs/stable/rpc.html#distributed-autograd-framework>`__
   extends the autograd engine beyond machine boundaries. Please refer to
-  `Distributed Autograd Design <https://pytorch.org/docs/master/rpc/distributed_autograd.html#distributed-autograd-design>`__
+  `Distributed Autograd Design <https://pytorch.org/docs/stable/rpc/distributed_autograd.html#distributed-autograd-design>`__
   for more details.
-* `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__
-  that automatically reaches out to all participating workers to update
+* `Distributed Optimizer <https://pytorch.org/docs/stable/rpc.html#module-torch.distributed.optim>`__
+  automatically reaches out to all participating workers to update
   parameters using gradients computed by the distributed autograd engine.
 
 RPC Tutorials are listed below:
@@ -196,9 +189,9 @@ RPC Tutorials are listed below:
    to a distributed environment and shows how to implement it using RPC.
 4. The `Implementing Batch RPC Processing Using Asynchronous Executions <../intermediate/rpc_async_execution.html>`__
    tutorial demonstrates how to implement RPC batch processing using the
-   `@rpc.functions.async_execution <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.functions.async_execution>`__
-   decorator, which can help speed up inference and training. It uses similar
-   RL and PS examples employed in the above tutorials 1 and 2.
+   `@rpc.functions.async_execution <https://pytorch.org/docs/stable/rpc.html#torch.distributed.rpc.functions.async_execution>`__
+   decorator, which can help speed up inference and training. It uses
+   RL and PS examples similar to those in the above tutorials 1 and 2.
 5. The `Combining Distributed DataParallel with Distributed RPC Framework <../advanced/rpc_ddp_tutorial.html>`__
    tutorial demonstrates how to combine DDP with RPC to train a model using
    distributed data parallelism combined with distributed model parallelism.