Skip to content

Update ddp_tutorial.rst #2516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 3, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion intermediate_source/ddp_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -340,11 +340,12 @@ Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
dist.destroy_process_group()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not necessary. After the script is finished the process_group will be garbage collected by Python.

destroy_process_group clears up the process_group state so that any references to it are freed and it can be garbage collected. Since this is the last line of the script, all references will be cleaned up anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the other examples in the same tutorial page call destroy_process_group, it's natural for me to complement this function call in this exmaple code to make they consistent.
On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pytorch/pytorch/blob/0dc7f6df9d00428cd175018e2bf9b45a8ec39b9c/torch/distributed/distributed_c10d.py#L1359-L1425

If you look at the implementation for destroy_process_group you can see its pretty simple and we are just clearing some maps or deleting dictionaries that reference the process group which allow it to be GCed.

I checked and you are right that the rest of the examples use a cleanup() function to do this. I'm okay with adding this in to be consistent then

On the other hand, after I send a SIGINT signal to terminate the DDP training, the used GPUs don't release the memory immediately. Therefore I think it's also necessary to call destroy_process_group to notify the nvidia driver to collect the graphic memory.

For this, can you clarify? If there is memory leakage after the script is finished executing then this is definitely an issue and we should file a bug and fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have suffered the GPU memory leakage recently. I think I should do some experiments for reproduction to verify my statement before a issue.


if __name__ == "__main__":
demo_basic()

One can then run a `torch elastic/torchrun<https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command
One can then run a `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command
on all nodes to initialize the DDP job created above:

.. code:: bash
Expand Down