You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: beginner_source/ddp_series_fault_tolerance.rst
+3-3Lines changed: 3 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -42,8 +42,8 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
42
42
43
43
In distributed training, a single process failure can
44
44
disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
45
-
script robust is particularly important here. You might also prefer your training job to be *elastic* i.e.
46
-
45
+
script robust is particularly important here. You might also prefer your training job to be *elastic*, for example,
46
+
compute resources can join and leave dynamically over the course of the job.
47
47
48
48
PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and
49
49
elastic training. When a failure occurs, ``torchrun`` logs the errors and
@@ -60,7 +60,7 @@ Why use ``torchrun``
60
60
``torchrun`` handles the minutiae of distributed training so that you
61
61
don't need to. For instance,
62
62
63
-
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
63
+
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
64
64
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
65
65
- Gracefully restarting training from the last saved training snapshot
Copy file name to clipboardExpand all lines: intermediate_source/ddp_series_minGPT.rst
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -48,9 +48,9 @@ Files used for training
48
48
~~~~~~~~~~~~~~~~~~~~~~~~
49
49
- `trainer.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py>`__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
50
50
- `model.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/model.py>`__ defines the model architecture.
51
-
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the `Dataset`class for a character-level dataset.
51
+
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the ``Dataset`` class for a character-level dataset.
52
52
- `gpt2_train_cfg.yaml <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/gpt2_train_cfg.yaml>`__ contains the configurations for data, model, optimizer, and training run.
53
-
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
53
+
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.
54
54
55
55
56
56
Saving and Loading from the cloud
@@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations
72
72
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
73
73
When models grow larger, more aggressive techniques might be useful:
74
74
75
-
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76
-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
75
+
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76
+
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
0 commit comments