-
Notifications
You must be signed in to change notification settings - Fork 4.2k
[DeviceMesh] Add device mesh recipe #2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2718
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b4cb387 with merge base eec8d56 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Iris! Some editorial suggestions. Also, not sure why you add ..
in the beginning of each line. Also, do you think this can be an interactive file instead?
|
||
- `Distributed Communication Package - torch.distributed <https://pytorch.org/docs/stable/distributed.html>`__ | ||
|
||
.. Setting up nccl communicators for distributed communication during distributed training could be challenging. For workloads where users need to compose different parallelisms, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why each line has as ..
prefix... I think this will render as a directive . If this was meant to be a comment the ..
need to be on a separate line.
How to use DeviceMesh with HSDP | ||
------------------------------- | ||
|
||
Hybrid Sharding(HSDP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like more info about HSDP. What it is what's the benefits/use cases of using with DeviceMesh, etc.
33d06c3
to
fd43ef4
Compare
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
fd43ef4
to
572fdd9
Compare
af662e5
to
1102397
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, have a few suggestions inlined. Thanks for working on this!
.. code-block:: python | ||
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup.py | ||
|
||
Note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this "Note" require some special formatting? i.e. sth like
``note::
And then put the below paragraph after this?
Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command. | ||
|
||
.. code-block:: python | ||
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 hsdp.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto: not sure if we want to simplify the run command here
recipes_source/recipes_index.rst
Outdated
:card_description: Learn how to use DeviceMesh | ||
:image: ../_static/img/thumbnails/cropped/profiler.png | ||
:link: ../recipes/distributed_device_mesh.html | ||
:tags: Distributed-Training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we put this card to the first of the distributed section? iirc all other receipts are not quite useful anymore (i.e. RPC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we put this card to the first of the distributed section? iirc all other receipts are not quite useful anymore (i.e. RPC)
We can also add the tutorial on the distributed page as @svekars suggested: https://pytorch.org/tutorials/distributed/home.html
5d434d9
to
42592fb
Compare
distributed/home.rst
Outdated
:link: https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh | ||
:link-type: url | ||
|
||
In this tutorial you will learn to implement about `DeviceMesh` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"implement about"? seems we should say: you will learn about how to use
Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR LGTM! We'll merge the week of Jan 2. Thank you!
* add DeviceMesh recipe --------- Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
* add DeviceMesh recipe --------- Co-authored-by: Svetlana Karslioglu <svekars@meta.com> Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
Description
Add DeviceMesh tutorial