pytorch · svekars · Feb 10, 2023 · Feb 7, 2023 · Feb 7, 2023
diff --git a/beginner_source/ddp_series_fault_tolerance.rst b/beginner_source/ddp_series_fault_tolerance.rst
@@ -42,8 +42,8 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
 
 In distributed training, a single process failure can
 disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
-script robust is particularly important here. You might also prefer your training job to be *elastic* i.e. 
-
+script robust is particularly important here. You might also prefer your training job to be *elastic*, for example,
+compute resources can join and leave dynamically over the course of the job.
 
 PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and 
 elastic training. When a failure occurs, ``torchrun`` logs the errors and
@@ -60,7 +60,7 @@ Why use ``torchrun``
 ``torchrun`` handles the minutiae of distributed training so that you
 don't need to. For instance,
 
--  You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
+-  You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
 -  No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups. 
 -  Gracefully restarting training from the last saved training snapshot
 

diff --git a/beginner_source/ddp_series_multigpu.rst b/beginner_source/ddp_series_multigpu.rst
@@ -41,7 +41,7 @@ In this tutorial, we start with a single-GPU training script and migrate that to
 Along the way, we will talk through important concepts in distributed training while implementing them in our code.
 
 .. note:: 
-   If your model contains any ``BatchNorm`` layer, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm`` 
+   If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
    layers across replicas.
 
    Use the helper function 
@@ -57,7 +57,7 @@ Imports
 ~~~~~~~
 -  ``torch.multiprocessing`` is a PyTorch wrapper around Python's native
    multiprocessing
--  The dsitributed process group contains all the processes that can
+-  The distributed process group contains all the processes that can
    communicate and synchronize with each other.
 
 .. code:: diff

diff --git a/beginner_source/ddp_series_theory.rst b/beginner_source/ddp_series_theory.rst
@@ -54,8 +54,8 @@ DDP improves upon the architecture in a few ways:
 |                                       | machines                     |
 +---------------------------------------+------------------------------+
 | Slower; uses multithreading on a      | Faster (no GIL contention)   |
-| single process and runs into GIL      | because it uses              |
-| contention                            | multiprocessing              |
+| single process and runs into Global   | because it uses              |
+| Interpreter Lock (GIL) contention     | multiprocessing              |
 +---------------------------------------+------------------------------+
 
 Further Reading

diff --git a/intermediate_source/ddp_series_minGPT.rst b/intermediate_source/ddp_series_minGPT.rst
@@ -48,9 +48,9 @@ Files used for training
 ~~~~~~~~~~~~~~~~~~~~~~~~
 - `trainer.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py>`__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
 - `model.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/model.py>`__ defines the model architecture.
-- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the `Dataset`class for a character-level dataset.
+- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the ``Dataset`` class for a character-level dataset.
 - `gpt2_train_cfg.yaml <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/gpt2_train_cfg.yaml>`__ contains the configurations for data, model, optimizer, and training run.
-- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
+- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.
 
 
 Saving and Loading from the cloud
@@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations
 Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint. 
 When models grow larger, more aggressive techniques might be useful:
 
-   -  `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
-   -  `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
+-  `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
+-  `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
 
 
 Further Reading

diff --git a/intermediate_source/ddp_series_multinode.rst b/intermediate_source/ddp_series_multinode.rst
@@ -38,7 +38,7 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
 Multinode training involves deploying a training job across several
 machines. There are two ways to do this:
 
--  running a torchrun command on each machine with identical rendezvous arguments, or
+-  running a ``torchrun`` command on each machine with identical rendezvous arguments, or
 -  deploying it on a compute cluster using a workload manager (like SLURM)
 
 In this video we will go over the (minimal) code changes required to move from single-node multigpu to
@@ -50,7 +50,7 @@ on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU
 Local and Global ranks
 ~~~~~~~~~~~~~~~~~~~~~~~~
 In single-node settings, we were tracking the 
-``gpu_id``s of the devices running our training processes. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
+``gpu_id`` of each device running our training process. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
 which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, ``torchrun`` provides another variable
 ``RANK`` which refers to the global rank of a process.