Closed
Description
📚 The doc issue
In the docs tutorial on how to set up Multi-GPU training, it is suggested that the following is the proper way to setup each process (initializing the, e.g., NCCL, process group and then calling torch.cuda.set_device(rank)
):
def ddp_setup(rank: int, world_size: int):
"""
Args:
rank: Unique identifier of each process
world_size: Total number of processes
"""
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
init_process_group(backend="nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
However, these issues suggest that the proper way is to call set_device
before initializing the process group:
- Call to CUDA function failed. with DDP using 4 GPUs pytorch#54550 (comment)
- distributed.all_gather function stuck when using NCCL backend pytorch#18689 (comment)
Which is the correct order? Are there pauses or slowdowns if the order changes?
Suggest a potential alternative/fix
No response
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225
Metadata
Metadata
Assignees
Labels
No labels