Skip to content

Commit 07a7ae2

Browse files
authored
Call out using set_device when initing pg
1 parent 8c0785e commit 07a7ae2

File tree

2 files changed

+4
-1
lines changed

2 files changed

+4
-1
lines changed

beginner_source/ddp_series_fault_tolerance.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ Process group initialization
117117
- os.environ["MASTER_PORT"] = "12355"
118118
- init_process_group(backend="nccl", rank=rank, world_size=world_size)
119119
+ init_process_group(backend="nccl")
120-
120+
torch.cuda.set_device(int(os.environ["LOCAL_RANK"]))
121121
122122
Use Torchrun-provided env variables
123123
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

beginner_source/ddp_series_multigpu.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,8 @@ Constructing the process group
8383
initializes the distributed process group.
8484
- Read more about `choosing a DDP
8585
backend <https://pytorch.org/docs/stable/distributed.html#which-backend-to-use>`__
86+
- `set_device <https://pytorch.org/docs/stable/generated/torch.cuda.set_device.html?highlight=set_device#torch.cuda.set_device>`__
87+
sets the default GPU for each process. This is important to prevent hangs or excessive memory utilization on `GPU:0`
8688

8789
.. code:: diff
8890
@@ -95,6 +97,7 @@ Constructing the process group
9597
+ os.environ["MASTER_ADDR"] = "localhost"
9698
+ os.environ["MASTER_PORT"] = "12355"
9799
+ init_process_group(backend="nccl", rank=rank, world_size=world_size)
100+
+ torch.cuda.set_device(rank)
98101
99102
100103
Constructing the DDP model

0 commit comments

Comments
 (0)