[doc] minor fixups to DDP tutorial

c-p-i-o · c-p-i-o · commit 057e4d1c10b8 · 2024-10-28T10:42:10.000-07:00
Summary: Add "set_device" call to keep things consistent between all DDP tutorials. This was inspired by the following change in the PyTorch repo: pytorch/examples#1285 (review) Test Plan: Ran tutorial with the applied changes and we see: """ Running basic DDP example on rank 3. Running basic DDP example on rank 1. Running basic DDP example on rank 2. Running basic DDP example on rank 0. Finished running basic DDP example on rank 0. Finished running basic DDP example on rank 1. Finished running basic DDP example on rank 3. Finished running basic DDP example on rank 2. Running DDP checkpoint example on rank 2. Running DDP checkpoint example on rank 1. Running DDP checkpoint example on rank 0. Running DDP checkpoint example on rank 3. Finished DDP checkpoint example on rank 0. Finished DDP checkpoint example on rank 3. Finished DDP checkpoint example on rank 1. Finished DDP checkpoint example on rank 2. Running DDP with model parallel example on rank 0. Running DDP with model parallel example on rank 1. Finished running DDP with model parallel example on rank 0. Finished running DDP with model parallel example on rank 1. """
diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
@@ -99,6 +99,9 @@ be found in
         os.environ['MASTER_ADDR'] = 'localhost'
         os.environ['MASTER_PORT'] = '12355'
 
+        # set the device id for this process
+        torch.cuda.set_device(rank)
+
         # initialize the process group
         dist.init_process_group("gloo", rank=rank, world_size=world_size)
 
@@ -141,6 +144,7 @@ different DDP processes starting from different initial model parameter values.
         optimizer.step()
 
         cleanup()
+        print(f"Finished running basic DDP example on rank {rank}.")
 
 
     def run_demo(demo_fn, world_size):
@@ -182,7 +186,7 @@ for more details. When using DDP, one optimization is to save the model in
 only one process and then load it to all processes, reducing write overhead.
 This is correct because all processes start from the same parameters and
 gradients are synchronized in backward passes, and hence optimizers should keep
-setting parameters to the same values. If you use this optimization, make sure no process starts 
+setting parameters to the same values. If you use this optimization, make sure no process starts
 loading before the saving is finished. Additionally, when
 loading the module, you need to provide an appropriate ``map_location``
 argument to prevent a process from stepping into others' devices. If ``map_location``
@@ -218,7 +222,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
 
         loss_fn = nn.MSELoss()
         optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
-        
+
         optimizer.zero_grad()
         outputs = ddp_model(torch.randn(20, 10))
         labels = torch.randn(20, 5).to(rank)
@@ -234,6 +238,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
             os.remove(CHECKPOINT_PATH)
 
         cleanup()
+        print(f"Finished running DDP checkpoint example on rank {rank}.")
 
 Combining DDP with Model Parallelism
 ------------------------------------
@@ -285,6 +290,7 @@ either the application or the model ``forward()`` method.
         optimizer.step()
 
         cleanup()
+        print(f"Finished running DDP with model parallel example on rank {rank}.")
 
 
     if __name__ == "__main__":
@@ -323,10 +329,13 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
 
 
     def demo_basic():
-        dist.init_process_group("nccl")
         rank = dist.get_rank()
+        torch.cuda.set_device(rank)
+
+        dist.init_process_group("nccl")
+
         print(f"Start running basic DDP example on rank {rank}.")
-   
+
         # create model and move it to GPU with id rank
         device_id = rank % torch.cuda.device_count()
         model = ToyModel().to(device_id)
@@ -340,23 +349,24 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
         labels = torch.randn(20, 5).to(device_id)
         loss_fn(outputs, labels).backward()
         optimizer.step()
-        dist.destroy_process_group()
-        
+        cleanup()
+        print(f"Finished running basic DDP example on rank {rank}.")
+
     if __name__ == "__main__":
         demo_basic()
 
-One can then run a `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command 
+One can then run a `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command
 on all nodes to initialize the DDP job created above:
 
 .. code:: bash
 
     torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py
 
-We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we 
+We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we
 are running it on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes.
 
-Here torchrun will launch 8 process and invoke ``elastic_ddp.py`` 
-on each process on the node it is launched on, but user also needs to apply cluster 
+Here torchrun will launch 8 process and invoke ``elastic_ddp.py``
+on each process on the node it is launched on, but user also needs to apply cluster
 management tools like slurm to actually run this command on 2 nodes.
 
 For example, on a SLURM enabled cluster, we can write a script to run the command above
@@ -371,5 +381,5 @@ Then we can just run this script using the SLURM command: ``srun --nodes=2 ./tor
 Of course, this is just an example; you can choose your own cluster scheduling tools
 to initiate the torchrun job.
 
-For more information about Elastic run, one can check this 
+For more information about Elastic run, one can check this
 `quick start document <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ to learn more.