Skip to content

Commit 9ec2625

Browse files
Fix minor typos, grammar, and formatting errors in the DDP video series (#2197)
* Fix minor typos, grammar, and formatting errors
1 parent 6b325fc commit 9ec2625

File tree

5 files changed

+13
-13
lines changed

5 files changed

+13
-13
lines changed

beginner_source/ddp_series_fault_tolerance.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
4242

4343
In distributed training, a single process failure can
4444
disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
45-
script robust is particularly important here. You might also prefer your training job to be *elastic* i.e.
46-
45+
script robust is particularly important here. You might also prefer your training job to be *elastic*, for example,
46+
compute resources can join and leave dynamically over the course of the job.
4747

4848
PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and
4949
elastic training. When a failure occurs, ``torchrun`` logs the errors and
@@ -60,7 +60,7 @@ Why use ``torchrun``
6060
``torchrun`` handles the minutiae of distributed training so that you
6161
don't need to. For instance,
6262

63-
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
63+
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
6464
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
6565
- Gracefully restarting training from the last saved training snapshot
6666

beginner_source/ddp_series_multigpu.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ In this tutorial, we start with a single-GPU training script and migrate that to
4141
Along the way, we will talk through important concepts in distributed training while implementing them in our code.
4242

4343
.. note::
44-
If your model contains any ``BatchNorm`` layer, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
44+
If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
4545
layers across replicas.
4646

4747
Use the helper function
@@ -57,7 +57,7 @@ Imports
5757
~~~~~~~
5858
- ``torch.multiprocessing`` is a PyTorch wrapper around Python's native
5959
multiprocessing
60-
- The dsitributed process group contains all the processes that can
60+
- The distributed process group contains all the processes that can
6161
communicate and synchronize with each other.
6262

6363
.. code:: diff

beginner_source/ddp_series_theory.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ DDP improves upon the architecture in a few ways:
5454
| | machines |
5555
+---------------------------------------+------------------------------+
5656
| Slower; uses multithreading on a | Faster (no GIL contention) |
57-
| single process and runs into GIL | because it uses |
58-
| contention | multiprocessing |
57+
| single process and runs into Global | because it uses |
58+
| Interpreter Lock (GIL) contention | multiprocessing |
5959
+---------------------------------------+------------------------------+
6060

6161
Further Reading

intermediate_source/ddp_series_minGPT.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ Files used for training
4848
~~~~~~~~~~~~~~~~~~~~~~~~
4949
- `trainer.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py>`__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
5050
- `model.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/model.py>`__ defines the model architecture.
51-
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the `Dataset`class for a character-level dataset.
51+
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the ``Dataset`` class for a character-level dataset.
5252
- `gpt2_train_cfg.yaml <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/gpt2_train_cfg.yaml>`__ contains the configurations for data, model, optimizer, and training run.
53-
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
53+
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.
5454

5555

5656
Saving and Loading from the cloud
@@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations
7272
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
7373
When models grow larger, more aggressive techniques might be useful:
7474

75-
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
75+
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76+
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
7777

7878

7979
Further Reading

intermediate_source/ddp_series_multinode.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
3838
Multinode training involves deploying a training job across several
3939
machines. There are two ways to do this:
4040

41-
- running a torchrun command on each machine with identical rendezvous arguments, or
41+
- running a ``torchrun`` command on each machine with identical rendezvous arguments, or
4242
- deploying it on a compute cluster using a workload manager (like SLURM)
4343

4444
In this video we will go over the (minimal) code changes required to move from single-node multigpu to
@@ -50,7 +50,7 @@ on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU
5050
Local and Global ranks
5151
~~~~~~~~~~~~~~~~~~~~~~~~
5252
In single-node settings, we were tracking the
53-
``gpu_id``s of the devices running our training processes. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
53+
``gpu_id`` of each device running our training process. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
5454
which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, ``torchrun`` provides another variable
5555
``RANK`` which refers to the global rank of a process.
5656

0 commit comments

Comments
 (0)