pytorch
diff --git a/‎_static/css/custom.css
Lines changed: 11 additions & 4 deletions b/‎_static/css/custom.css
Lines changed: 11 additions & 4 deletions
diff --git a/‎advanced_source/ddp_pipeline.py
Lines changed: 1 addition & 1 deletion b/‎advanced_source/ddp_pipeline.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎beginner_source/ddp_series_fault_tolerance.rst
Lines changed: 212 additions & 0 deletions b/‎beginner_source/ddp_series_fault_tolerance.rst
Lines changed: 212 additions & 0 deletions
diff --git a/‎beginner_source/ddp_series_intro.rst
Lines changed: 55 additions & 0 deletions b/‎beginner_source/ddp_series_intro.rst
Lines changed: 55 additions & 0 deletions
@@ -2,11 +2,12 @@
 */
 
 :root {
+    --sd-color-info: #ee4c2c;
     --sd-color-primary: #6c6c6d;
     --sd-color-primary-highlight: #f3f4f7;
     --sd-color-card-border-hover: #ee4c2c;
     --sd-color-card-border: #f3f4f7;
-    --sd-color-card-background: #f3f4f7;
+    --sd-color-card-background: #fff;
     --sd-color-card-text: inherit;
     --sd-color-card-header: transparent;
     --sd-color-card-footer: transparent;
@@ -20,13 +21,19 @@
     --sd-color-tabs-underline: rgb(222, 222, 222);
 }
 
+.sd-text-info {
+    color: #ee4c2c;
+}
+
+
 .sd-card {
     position: relative;
-    background-color: #f3f4f7;
-    opacity: 0.5;
+    background-color: #fff;
+    opacity: 1.0;
     border-radius: 0px;
     width: 30%;
-    border: none
+    border: none;
+    padding-bottom: 0px;
 }
 
 
 
@@ -52,7 +52,7 @@ def __init__(self, d_model, dropout=0.1, max_len=5000):
         pe[:, 0::2] = torch.sin(position * div_term)
         pe[:, 1::2] = torch.cos(position * div_term)
         pe = pe.unsqueeze(0).transpose(0, 1)
-        self.register_buffer('pe', pe)
+        self.pe = nn.Parameter(pe, requires_grad=False)
 
     def forward(self, x):
         x = x + self.pe[:x.size(0), :]
 
@@ -0,0 +1,212 @@
+`Introduction <ddp_series_intro.html>`__ \|\| `What is DDP <ddp_series_theory.html>`__ \|\| `Single-Node
+Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| **Fault
+Tolerance** \|\| `Multi-Node
+training <../intermediate/ddp_series_multinode.html>`__ \|\| `minGPT Training <../intermediate/ddp_series_minGPT.html>`__
+
+
+Fault-tolerant Distributed Training with ``torchrun``
+=====================================================
+
+Authors: `Suraj Subramanian <https://github.com/suraj813>`__
+
+.. grid:: 2
+
+   .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+      :margin: 0
+      
+      -  Launching multi-GPU training jobs with ``torchrun``
+      -  Saving and loading snapshots of your training job
+      -  Structuring your training script for graceful restarts
+
+      .. grid:: 1
+
+         .. grid-item::
+
+            :octicon:`code-square;1.0em;` View the code used in this tutorial on `GitHub <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu_torchrun.py>`__
+
+   .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+      :margin: 0
+
+      * High-level `overview <ddp_series_theory.html>`__ of DDP
+      * Familiarity with `DDP code <ddp_series_multigpu.html>`__
+      * A machine with multiple GPUs (this tutorial uses an AWS p3.8xlarge instance)
+      * PyTorch `installed <https://pytorch.org/get-started/locally/>`__ with CUDA
+
+Follow along with the video below or on `youtube <https://www.youtube.com/watch/9kIvQOiwYzg>`__.
+
+.. raw:: html
+
+   <div style="margin-top:10px; margin-bottom:10px;">
+     <iframe width="560" height="315" src="https://www.youtube.com/embed/9kIvQOiwYzg" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+   </div>
+
+In distributed training, a single process failure can
+disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
+script robust is particularly important here. You might also prefer your training job to be *elastic* i.e. 
+
+
+PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and 
+elastic training. When a failure occurs, ``torchrun`` logs the errors and
+attempts to automatically restart all the processes from the last saved
+“snapshot” of the training job. 
+
+The snapshot saves more than just the model state; it can include
+details about the number of epochs run, optimizer states or any other
+stateful attribute of the training job necessary for its continuity.
+
+Why use ``torchrun``
+~~~~~~~~~~~~~~~~~~~~
+
+``torchrun`` handles the minutiae of distributed training so that you
+don't need to. For instance,
+
+-  You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
+-  No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups. 
+-  Gracefully restarting training from the last saved training snapshot
+
+
+Graceful restarts
+~~~~~~~~~~~~~~~~~~~~~
+For graceful restarts, you should structure your train script like:
+
+.. code:: python
+
+   def main():
+     load_snapshot(snapshot_path)
+     initialize()
+     train()
+
+   def train():
+     for batch in iter(dataset):
+       train_step(batch)
+
+       if should_checkpoint:
+         save_snapshot(snapshot_path)
+
+If a failure occurs, ``torchrun`` will terminate all the processes and restart them. 
+Each process entrypoint first loads and initializes the last saved snapshot, and continues training from there.
+So at any failure, you only lose the training progress from the last saved snapshot. 
+
+In elastic training, whenever there are any membership changes (adding or removing nodes), ``torchrun`` will terminate and spawn processes
+on available devices. Having this structure ensures your training job can continue without manual intervention.
+
+
+
+
+
+Diff for `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py>`__ v/s `multigpu_torchrun.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu_torchrun.py>`__
+-----------------------------------------------------------
+
+Process group initialization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+-  ``torchrun`` assigns ``RANK`` and ``WORLD_SIZE`` automatically,
+   amongst `other env
+   variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__
+
+.. code:: diff
+
+   - def ddp_setup(rank, world_size):
+   + def ddp_setup():
+   -     """
+   -     Args:
+   -         rank: Unique identifier of each process
+   -         world_size: Total number of processes
+   -     """
+   -     os.environ["MASTER_ADDR"] = "localhost"
+   -     os.environ["MASTER_PORT"] = "12355"
+   -     init_process_group(backend="nccl", rank=rank, world_size=world_size)
+   +     init_process_group(backend="nccl")
+
+
+Use Torchrun-provided env variables
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code:: diff
+
+   - self.gpu_id = gpu_id
+   + self.gpu_id = int(os.environ["LOCAL_RANK"])
+
+Saving and loading snapshots
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regularly storing all the relevant information in snapshots allows our
+training job to seamlessly resume after an interruption.
+
+.. code:: diff
+
+   + def _save_snapshot(self, epoch):
+   +     snapshot = {}
+   +     snapshot["MODEL_STATE"] = self.model.module.state_dict()
+   +     snapshot["EPOCHS_RUN"] = epoch
+   +     torch.save(snapshot, "snapshot.pt")
+   +     print(f"Epoch {epoch} | Training snapshot saved at snapshot.pt")
+
+   + def _load_snapshot(self, snapshot_path):
+   +     snapshot = torch.load(snapshot_path)
+   +     self.model.load_state_dict(snapshot["MODEL_STATE"])
+   +     self.epochs_run = snapshot["EPOCHS_RUN"]
+   +     print(f"Resuming training from snapshot at Epoch {self.epochs_run}")
+
+
+Loading a snapshot in the Trainer constructor
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When restarting an interrupted training job, your script will first try
+to load a snapshot to resume training from.
+
+.. code:: diff
+
+   class Trainer:
+      def __init__(self, snapshot_path, ...):
+      ...
+   +  if os.path.exists(snapshot_path):
+   +     self._load_snapshot(snapshot_path)
+      ...
+
+
+Resuming training
+~~~~~~~~~~~~~~~~~
+
+Training can resume from the last epoch run, instead of starting all
+over from scratch.
+
+.. code:: diff
+
+   def train(self, max_epochs: int):
+   -  for epoch in range(max_epochs):
+   +  for epoch in range(self.epochs_run, max_epochs):
+         self._run_epoch(epoch)
+
+
+Running the script
+~~~~~~~~~~~~~~~~~~
+Simply call your entrypoint function as you would for a non-multiprocessing script; ``torchrun`` automatically
+spawns the processes.
+
+.. code:: diff
+
+   if __name__ == "__main__":
+      import sys
+      total_epochs = int(sys.argv[1])
+      save_every = int(sys.argv[2])
+   -  world_size = torch.cuda.device_count()
+   -  mp.spawn(main, args=(world_size, total_epochs, save_every,), nprocs=world_size)
+   +  main(save_every, total_epochs)
+
+
+.. code:: diff
+
+   - python multigpu.py 50 10
+   + torchrun --standalone --nproc_per_node=4 multigpu_torchrun.py 50 10
+
+Further Reading
+---------------
+
+-  `Multi-Node training with DDP <../intermediate/ddp_series_multinode.html>`__  (next tutorial in this series)
+-  `Multi-GPU Training with DDP <ddp_series_multigpu.html>`__ (previous tutorial in this series)
+-  `torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__
+-  `Torchrun launch
+   options <https://github.com/pytorch/pytorch/blob/bbe803cb35948df77b46a2d38372910c96693dcd/torch/distributed/run.py#L401>`__
+-  `Migrating from torch.distributed.launch to
+   torchrun <https://pytorch.org/docs/stable/elastic/train_script.html#elastic-train-script>`__
@@ -0,0 +1,55 @@
+**Introduction** \|\| `What is DDP <ddp_series_theory.html>`__ \|\| `Single-Node
+Multi-GPU Training <ddp_series_multigpu.html>`__ \|\| `Fault
+Tolerance <ddp_series_fault_tolerance.html>`__ \|\| `Multi-Node
+training <../intermediate/ddp_series_multinode.html>`__ \|\| `minGPT Training <../intermediate/ddp_series_minGPT.html>`__
+
+Distributed Data Parallel in PyTorch - Video Tutorials
+======================================================
+
+Authors: `Suraj Subramanian <https://github.com/suraj813>`__
+
+Follow along with the video below or on `youtube <https://www.youtube.com/watch/-K3bZYHYHEA>`__.
+
+.. raw:: html
+
+   <div style="margin-top:10px; margin-bottom:10px;">
+     <iframe width="560" height="315" src="https://www.youtube.com/embed/-K3bZYHYHEA" frameborder="0" allow="accelerometer; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+   </div>
+
+This series of video tutorials walks you through distributed training in
+PyTorch via DDP.
+
+The series starts with a simple non-distributed training job, and ends
+with deploying a training job across several machines in a cluster.
+Along the way, you will also learn about
+`torchrun <https://pytorch.org/docs/stable/elastic/run.html>`__ for
+fault-tolerant distributed training.
+
+The tutorial assumes a basic familiarity with model training in PyTorch.
+
+Running the code
+----------------
+
+You will need multiple CUDA GPUs to run the tutorial code. Typically,
+this can be done on a cloud instance with multiple GPUs (the tutorials
+use an Amazon EC2 P3 instance with 4 GPUs).
+
+The tutorial code is hosted at this `github
+repo <https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series>`__. Clone the repo and
+follow along!
+
+Tutorial sections
+-----------------
+
+0. Introduction (this page)
+1. `What is DDP? <ddp_series_theory.html>`__ Gently introduces what DDP is doing
+   under the hood
+2. `Single-Node Multi-GPU Training <ddp_series_multigpu.html>`__ Training models
+   using multiple GPUs on a single machine
+3. `Fault-tolerant distributed training <ddp_series_fault_tolerance.html>`__
+   Making your distributed training job robust with torchrun
+4. `Multi-Node training <../intermediate/ddp_series_multinode.html>`__ Training models using
+   multiple GPUs on multiple machines
+5. `Training a GPT model with DDP <../intermediate/ddp_series_minGPT.html>`__ “Real-world”
+   example of training a `minGPT <https://github.com/karpathy/minGPT>`__
+   model with DDP