diff --git a/beginner_source/ddp_series_fault_tolerance.rst b/beginner_source/ddp_series_fault_tolerance.rst index e141b4a7ff3..a05c2e1a9ca 100644 --- a/beginner_source/ddp_series_fault_tolerance.rst +++ b/beginner_source/ddp_series_fault_tolerance.rst @@ -42,8 +42,8 @@ Follow along with the video below or on `youtube `__. +- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables `__. - No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups. - Gracefully restarting training from the last saved training snapshot diff --git a/beginner_source/ddp_series_multigpu.rst b/beginner_source/ddp_series_multigpu.rst index baf92d8f8af..46059f286b1 100644 --- a/beginner_source/ddp_series_multigpu.rst +++ b/beginner_source/ddp_series_multigpu.rst @@ -41,7 +41,7 @@ In this tutorial, we start with a single-GPU training script and migrate that to Along the way, we will talk through important concepts in distributed training while implementing them in our code. .. note:: - If your model contains any ``BatchNorm`` layer, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm`` + If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm`` layers across replicas. Use the helper function @@ -57,7 +57,7 @@ Imports ~~~~~~~ - ``torch.multiprocessing`` is a PyTorch wrapper around Python's native multiprocessing -- The dsitributed process group contains all the processes that can +- The distributed process group contains all the processes that can communicate and synchronize with each other. .. code:: diff diff --git a/beginner_source/ddp_series_theory.rst b/beginner_source/ddp_series_theory.rst index 8e493523140..7c22352f70e 100644 --- a/beginner_source/ddp_series_theory.rst +++ b/beginner_source/ddp_series_theory.rst @@ -54,8 +54,8 @@ DDP improves upon the architecture in a few ways: | | machines | +---------------------------------------+------------------------------+ | Slower; uses multithreading on a | Faster (no GIL contention) | -| single process and runs into GIL | because it uses | -| contention | multiprocessing | +| single process and runs into Global | because it uses | +| Interpreter Lock (GIL) contention | multiprocessing | +---------------------------------------+------------------------------+ Further Reading diff --git a/intermediate_source/ddp_series_minGPT.rst b/intermediate_source/ddp_series_minGPT.rst index 0493db8d62e..1d1f809e434 100644 --- a/intermediate_source/ddp_series_minGPT.rst +++ b/intermediate_source/ddp_series_minGPT.rst @@ -48,9 +48,9 @@ Files used for training ~~~~~~~~~~~~~~~~~~~~~~~~ - `trainer.py `__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset. - `model.py `__ defines the model architecture. -- `char_dataset.py `__ contains the `Dataset`class for a character-level dataset. +- `char_dataset.py `__ contains the ``Dataset`` class for a character-level dataset. - `gpt2_train_cfg.yaml `__ contains the configurations for data, model, optimizer, and training run. -- `main.py `__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job. +- `main.py `__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job. Saving and Loading from the cloud @@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint. When models grow larger, more aggressive techniques might be useful: - - `activation checkpointing `__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint. - - `Fully-Sharded Data Parallel `__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog `__ to learn how we trained a 1 Trillion parameter model with FSDP. +- `activation checkpointing `__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint. +- `Fully-Sharded Data Parallel `__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog `__ to learn how we trained a 1 Trillion parameter model with FSDP. Further Reading diff --git a/intermediate_source/ddp_series_multinode.rst b/intermediate_source/ddp_series_multinode.rst index f7638366992..721c5580f6c 100644 --- a/intermediate_source/ddp_series_multinode.rst +++ b/intermediate_source/ddp_series_multinode.rst @@ -38,7 +38,7 @@ Follow along with the video below or on `youtube