Skip to content

Fix minor typos, grammar, and formatting errors in the DDP video series #2197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions beginner_source/ddp_series_fault_tolerance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/

In distributed training, a single process failure can
disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
script robust is particularly important here. You might also prefer your training job to be *elastic* i.e.

script robust is particularly important here. You might also prefer your training job to be *elastic*, for example,
compute resources can join and leave dynamically over the course of the job.

PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and
elastic training. When a failure occurs, ``torchrun`` logs the errors and
Expand All @@ -60,7 +60,7 @@ Why use ``torchrun``
``torchrun`` handles the minutiae of distributed training so that you
don't need to. For instance,

- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
- Gracefully restarting training from the last saved training snapshot

Expand Down
4 changes: 2 additions & 2 deletions beginner_source/ddp_series_multigpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ In this tutorial, we start with a single-GPU training script and migrate that to
Along the way, we will talk through important concepts in distributed training while implementing them in our code.

.. note::
If your model contains any ``BatchNorm`` layer, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
layers across replicas.

Use the helper function
Expand All @@ -57,7 +57,7 @@ Imports
~~~~~~~
- ``torch.multiprocessing`` is a PyTorch wrapper around Python's native
multiprocessing
- The dsitributed process group contains all the processes that can
- The distributed process group contains all the processes that can
communicate and synchronize with each other.

.. code:: diff
Expand Down
4 changes: 2 additions & 2 deletions beginner_source/ddp_series_theory.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,8 +54,8 @@ DDP improves upon the architecture in a few ways:
| | machines |
+---------------------------------------+------------------------------+
| Slower; uses multithreading on a | Faster (no GIL contention) |
| single process and runs into GIL | because it uses |
| contention | multiprocessing |
| single process and runs into Global | because it uses |
| Interpreter Lock (GIL) contention | multiprocessing |
+---------------------------------------+------------------------------+

Further Reading
Expand Down
8 changes: 4 additions & 4 deletions intermediate_source/ddp_series_minGPT.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ Files used for training
~~~~~~~~~~~~~~~~~~~~~~~~
- `trainer.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py>`__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
- `model.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/model.py>`__ defines the model architecture.
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the `Dataset`class for a character-level dataset.
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the ``Dataset`` class for a character-level dataset.
- `gpt2_train_cfg.yaml <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/gpt2_train_cfg.yaml>`__ contains the configurations for data, model, optimizer, and training run.
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.


Saving and Loading from the cloud
Expand All @@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
When models grow larger, more aggressive techniques might be useful:

- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.


Further Reading
Expand Down
4 changes: 2 additions & 2 deletions intermediate_source/ddp_series_multinode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
Multinode training involves deploying a training job across several
machines. There are two ways to do this:

- running a torchrun command on each machine with identical rendezvous arguments, or
- running a ``torchrun`` command on each machine with identical rendezvous arguments, or
- deploying it on a compute cluster using a workload manager (like SLURM)

In this video we will go over the (minimal) code changes required to move from single-node multigpu to
Expand All @@ -50,7 +50,7 @@ on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU
Local and Global ranks
~~~~~~~~~~~~~~~~~~~~~~~~
In single-node settings, we were tracking the
``gpu_id``s of the devices running our training processes. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
``gpu_id`` of each device running our training process. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, ``torchrun`` provides another variable
``RANK`` which refers to the global rank of a process.

Expand Down