Skip to content

Update ddp_tutorial.rst #2516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 3, 2023
Merged

Update ddp_tutorial.rst #2516

merged 1 commit into from
Aug 3, 2023

Conversation

HUGHNew
Copy link
Contributor

@HUGHNew HUGHNew commented Aug 2, 2023

Description

  1. add dist.destroy_process_group() in example code block to make sure release the resources
  2. modify the link syntax error about torchrun

Checklist

  • The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • Only one issue is addressed in this pull request
  • Labels from the issue that this PR is fixing are added to this pull request
  • No unnecessary issues are included into this pull request.

cc @osalpekar @H-Huang @kwen2501

1. add `dist.destroy_process_group()` in example code block
2. modify the link syntax error about torchrun
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2516

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ea8f16c:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@netlify
Copy link

netlify bot commented Aug 2, 2023

Deploy Preview for pytorch-tutorials-preview failed.

Name Link
🔨 Latest commit ea8f16c
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/64c9fe67a6c69d00088b663b

@svekars
Copy link
Contributor

svekars commented Aug 2, 2023

@H-Huang can you take a look?

@@ -340,11 +340,12 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
dist.destroy_process_group()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not necessary. After the script is finished the process_group will be garbage collected by Python.

destroy_process_group clears up the process_group state so that any references to it are freed and it can be garbage collected. Since this is the last line of the script, all references will be cleaned up anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the other examples in the same tutorial page call destroy_process_group, it's natural for me to complement this function call in this exmaple code to make they consistent.
On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pytorch/pytorch/blob/0dc7f6df9d00428cd175018e2bf9b45a8ec39b9c/torch/distributed/distributed_c10d.py#L1359-L1425

If you look at the implementation for destroy_process_group you can see its pretty simple and we are just clearing some maps or deleting dictionaries that reference the process group which allow it to be GCed.

I checked and you are right that the rest of the examples use a cleanup() function to do this. I'm okay with adding this in to be consistent then

On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

For this, can you clarify? If there is memory leakage after the script is finished executing then this is definitely an issue and we should file a bug and fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have suffered the GPU memory leakage recently. I think I should do some experiments for reproduction to verify my statement before a issue.

Copy link
Member

@H-Huang H-Huang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, new doc snippet looks like this:

image

LGTM

@H-Huang H-Huang merged commit b4e6207 into pytorch:main Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants