-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Update ddp_tutorial.rst #2516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ddp_tutorial.rst #2516
Conversation
1. add `dist.destroy_process_group()` in example code block 2. modify the link syntax error about torchrun
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2516
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ea8f16c: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
❌ Deploy Preview for pytorch-tutorials-preview failed.
|
@H-Huang can you take a look? |
@@ -340,11 +340,12 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``. | |||
labels = torch.randn(20, 5).to(device_id) | |||
loss_fn(outputs, labels).backward() | |||
optimizer.step() | |||
dist.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not necessary. After the script is finished the process_group will be garbage collected by Python.
destroy_process_group
clears up the process_group state so that any references to it are freed and it can be garbage collected. Since this is the last line of the script, all references will be cleaned up anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the other examples in the same tutorial page call destroy_process_group
, it's natural for me to complement this function call in this exmaple code to make they consistent.
On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group
to notify the nvidia driver to collect the graphic memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the implementation for destroy_process_group
you can see its pretty simple and we are just clearing some maps or deleting dictionaries that reference the process group which allow it to be GCed.
I checked and you are right that the rest of the examples use a cleanup()
function to do this. I'm okay with adding this in to be consistent then
On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.
For this, can you clarify? If there is memory leakage after the script is finished executing then this is definitely an issue and we should file a bug and fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have suffered the GPU memory leakage recently. I think I should do some experiments for reproduction to verify my statement before a issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description
dist.destroy_process_group()
in example code block to make sure release the resourcestorchrun
Checklist
cc @osalpekar @H-Huang @kwen2501