Update DDP Tutorial to remove Single-Process Multi-Device Use Case #973

mrshenli · 2020-04-28T20:06:17Z

As we do not recommend using Single-Process Multi-Device mode with
DDP due to the reduced throughput and as this mode is broken in
v1.5, added this PR to update DDP tutorial to using Multi-Process
Single-Device mode.

This PR also includes the following minor fixes:

highlight that this tutorial requires at least 8 GPUs to run
added a pointer to RPC
added a pointer to torchelastic
as DDP now broadcast initial param values from rank 0 to others
it no longer require setting RNG seed.

netlify · 2020-04-28T20:11:56Z

Deploy preview for pytorch-tutorials-preview ready!

Built with commit 5ca59d7

https://deploy-preview-973--pytorch-tutorials-preview.netlify.app

mrshenli · 2020-04-28T20:29:52Z

The corresponding example PR is pytorch/examples#763

mrshenli · 2020-04-28T20:39:26Z

Hey @kiukchung this currently links to https://github.com/pytorch/elastic. Let me know if you prefer https://pytorch.org/elastic/0.2.0rc1/index.html or a different link. Thanks!

mrshenli · 2020-04-28T20:41:55Z

cc @jlin27 @gchanan @zhaojuanmao @rohan-varma

intermediate_source/ddp_tutorial.rst

rohan-varma · 2020-04-28T21:14:56Z

intermediate_source/ddp_tutorial.rst

+create a DDP instance in each process. DDP uses collective communications in the
+`torch.distributed <https://pytorch.org/tutorials/intermediate/dist_tuto.html>`__
+package to synchronize gradients and buffers. More specifically, DDP inserts
+an autograd hook for each model parameter which will fire when the


Is the autograd hook indeed per-parameter as given by model.parameters()? If user takes a model and then manually adds a nn.Parameter() to it, will DDP sync gradients from this parameter as well? If so, might be worth mentioning this (maybe in the docs is better though)

If user manually adds a nn.Parameter() the Model.__setattr__ should detect that and call register_parameter for it, so this manually added parameter would also be included ini model.parameters(). Let me modify this to "each model parameter in model.parameters()".

rohan-varma · 2020-04-28T21:17:21Z

intermediate_source/ddp_tutorial.rst

-are inevitable due to, e.g., network delays, resource contentions,
-unpredictable workload spikes. To avoid timeouts in these situations, make
-sure that you pass a sufficiently large ``timeout`` value when calling
+In DDP, the constructor and the backward pass are distributed synchronization


Noticed we removed forward method, was it inaccurate that it was a point of synchronization or did something change?

kiuk

@mrshenli that would be awesome, the correct link is https://pytorch.org/elastic (it redirects to the lastest release)

Noticed we removed forward method, was it inaccurate that it was a point of synchronization or did something change?

There is sync in forward to broadcast buffers from rank 0 to all other ranks. I removed this as 1) there is no intra-node sync in MPSD mode 2) buffer is not used in this tutorial. But give it a second thought, I think it might be better to keep this to stay consistent with the previous version which is correct for general cases. Thanks for spotting this!

links are updated, thx!

As we do not recommend using Single-Process Multi-Device mode with DDP due to the reduced throughput and as this mode is broken in v1.5, added this PR to update DDP tutorial to using Multi-Process Single-Device mode. This PR also includes the following minor fixes: 1. highlight that this tutorial requires at least 8 GPUs to run 2. added a pointer to RPC 3. added a pointer to torchelastic 4. as DDP now broadcast initial param values from rank 0 to others it no longer require setting RNG seed.

rohan-varma

Looks good!

Update DDP Tutorial to remove Single-Process Multi-Device Use Case

mrshenli force-pushed the ddp branch from 4902c7a to a7ac1a7 Compare April 28, 2020 20:34

mrshenli changed the title ~~[WIP] Update DDP Tutorial to remove Single-Process Multi-Device Use Case~~ Update DDP Tutorial to remove Single-Process Multi-Device Use Case Apr 28, 2020

mrshenli mentioned this pull request Apr 28, 2020

DistributedDataParallel don't work at nightly build(1.6.0.dev20200408+cu101) pytorch/pytorch#36268

Closed

rohan-varma reviewed Apr 28, 2020

View reviewed changes

intermediate_source/ddp_tutorial.rst Outdated Show resolved Hide resolved

rohan-varma reviewed Apr 28, 2020

View reviewed changes

mrshenli force-pushed the ddp branch from a7ac1a7 to 5ca59d7 Compare April 29, 2020 02:33

mrshenli requested a review from rohan-varma April 29, 2020 20:02

rohan-varma approved these changes Apr 29, 2020

View reviewed changes

jlin27 merged commit f557ee0 into pytorch:master Apr 29, 2020

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021

Merge pull request pytorch#973 from mrshenli/ddp

c1cd706

Update DDP Tutorial to remove Single-Process Multi-Device Use Case

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update DDP Tutorial to remove Single-Process Multi-Device Use Case #973

Update DDP Tutorial to remove Single-Process Multi-Device Use Case #973

Uh oh!

mrshenli commented Apr 28, 2020 •

edited

Loading

Uh oh!

netlify bot commented Apr 28, 2020 •

edited

Loading

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

Uh oh!

rohan-varma Apr 28, 2020

Uh oh!

mrshenli Apr 29, 2020

Uh oh!

rohan-varma Apr 28, 2020

Uh oh!

kiukchung Apr 28, 2020

Uh oh!

mrshenli Apr 29, 2020

Uh oh!

mrshenli Apr 29, 2020

Uh oh!

rohan-varma left a comment

Uh oh!

Uh oh!

Update DDP Tutorial to remove Single-Process Multi-Device Use Case #973

Update DDP Tutorial to remove Single-Process Multi-Device Use Case #973

Uh oh!

Conversation

mrshenli commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

mrshenli commented Apr 28, 2020

Uh oh!

Uh oh!

rohan-varma Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

kiukchung Apr 28, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Apr 29, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrshenli commented Apr 28, 2020 •

edited

Loading

netlify bot commented Apr 28, 2020 •

edited

Loading