Update ddp_tutorial.rst #2516

HUGHNew · 2023-08-02T06:57:41Z

Description

add dist.destroy_process_group() in example code block to make sure release the resources
modify the link syntax error about torchrun

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

cc @osalpekar @H-Huang @kwen2501

1. add `dist.destroy_process_group()` in example code block 2. modify the link syntax error about torchrun

pytorch-bot · 2023-08-02T06:57:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2516

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ea8f16c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2023-08-02T06:58:44Z

❌ Deploy Preview for pytorch-tutorials-preview failed.

Name	Link
🔨 Latest commit	`ea8f16c`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/64c9fe67a6c69d00088b663b

svekars · 2023-08-02T16:53:27Z

@H-Huang can you take a look?

H-Huang · 2023-08-02T18:16:44Z

intermediate_source/ddp_tutorial.rst

@@ -340,11 +340,12 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
        labels = torch.randn(20, 5).to(device_id)
        loss_fn(outputs, labels).backward()
        optimizer.step()
+        dist.destroy_process_group()


This is actually not necessary. After the script is finished the process_group will be garbage collected by Python.

destroy_process_group clears up the process_group state so that any references to it are freed and it can be garbage collected. Since this is the last line of the script, all references will be cleaned up anyways.

Since the other examples in the same tutorial page call destroy_process_group, it's natural for me to complement this function call in this exmaple code to make they consistent.
On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

https://github.com/pytorch/pytorch/blob/0dc7f6df9d00428cd175018e2bf9b45a8ec39b9c/torch/distributed/distributed_c10d.py#L1359-L1425

If you look at the implementation for destroy_process_group you can see its pretty simple and we are just clearing some maps or deleting dictionaries that reference the process group which allow it to be GCed.

I checked and you are right that the rest of the examples use a cleanup() function to do this. I'm okay with adding this in to be consistent then

On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

For this, can you clarify? If there is memory leakage after the script is finished executing then this is definitely an issue and we should file a bug and fix this.

I have suffered the GPU memory leakage recently. I think I should do some experiments for reproduction to verify my statement before a issue.

H-Huang

Thanks, new doc snippet looks like this:

LGTM

Update ddp_tutorial.rst

ea8f16c

1. add `dist.destroy_process_group()` in example code block 2. modify the link syntax error about torchrun

facebook-github-bot added the cla signed label Aug 2, 2023

svekars added the distributed label Aug 2, 2023

H-Huang reviewed Aug 2, 2023

View reviewed changes

H-Huang approved these changes Aug 3, 2023

View reviewed changes

H-Huang merged commit b4e6207 into pytorch:main Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update ddp_tutorial.rst #2516

Update ddp_tutorial.rst #2516

Uh oh!

HUGHNew commented Aug 2, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Aug 2, 2023 •

edited

Loading

Uh oh!

netlify bot commented Aug 2, 2023

Uh oh!

svekars commented Aug 2, 2023

Uh oh!

H-Huang Aug 2, 2023

Uh oh!

HUGHNew Aug 3, 2023

Uh oh!

H-Huang Aug 3, 2023

Uh oh!

HUGHNew Aug 3, 2023

Uh oh!

H-Huang left a comment

Uh oh!

Uh oh!

Update ddp_tutorial.rst #2516

Update ddp_tutorial.rst #2516

Uh oh!

Conversation

HUGHNew commented Aug 2, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

pytorch-bot bot commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2516

✅ No Failures

Uh oh!

netlify bot commented Aug 2, 2023

❌ Deploy Preview for pytorch-tutorials-preview failed.

Uh oh!

svekars commented Aug 2, 2023

Uh oh!

H-Huang Aug 2, 2023

Choose a reason for hiding this comment

Uh oh!

HUGHNew Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

HUGHNew Aug 3, 2023

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HUGHNew commented Aug 2, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 2, 2023 •

edited

Loading