Skip to content

[DeviceMesh] Add device mesh recipe #2718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 24, 2024
Merged

Conversation

wz337
Copy link
Contributor

@wz337 wz337 commented Dec 19, 2023

Description

Add DeviceMesh tutorial

  • Update recipe_index.rst to include DeviceMesh tutorial
  • Add image
  • Add distributed_device_mesh.rst

Copy link

pytorch-bot bot commented Dec 19, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2718

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit b4cb387 with merge base eec8d56 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Iris! Some editorial suggestions. Also, not sure why you add .. in the beginning of each line. Also, do you think this can be an interactive file instead?


- `Distributed Communication Package - torch.distributed <https://pytorch.org/docs/stable/distributed.html>`__

.. Setting up nccl communicators for distributed communication during distributed training could be challenging. For workloads where users need to compose different parallelisms,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why each line has as .. prefix... I think this will render as a directive . If this was meant to be a comment the .. need to be on a separate line.

How to use DeviceMesh with HSDP
-------------------------------

Hybrid Sharding(HSDP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like more info about HSDP. What it is what's the benefits/use cases of using with DeviceMesh, etc.

@wz337 wz337 force-pushed the add_device_mesh_recipe branch from 33d06c3 to fd43ef4 Compare December 19, 2023 23:24
@wz337 wz337 marked this pull request as ready for review December 19, 2023 23:37
wz337 and others added 3 commits December 19, 2023 15:42
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
@wz337 wz337 force-pushed the add_device_mesh_recipe branch from fd43ef4 to 572fdd9 Compare December 19, 2023 23:42
@wz337 wz337 force-pushed the add_device_mesh_recipe branch from af662e5 to 1102397 Compare December 20, 2023 22:29
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, have a few suggestions inlined. Thanks for working on this!

.. code-block:: python
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup.py

Note
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this "Note" require some special formatting? i.e. sth like

``note::

And then put the below paragraph after this?

Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command.

.. code-block:: python
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 hsdp.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto: not sure if we want to simplify the run command here

:card_description: Learn how to use DeviceMesh
:image: ../_static/img/thumbnails/cropped/profiler.png
:link: ../recipes/distributed_device_mesh.html
:tags: Distributed-Training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put this card to the first of the distributed section? iirc all other receipts are not quite useful anymore (i.e. RPC)

Copy link
Contributor Author

@wz337 wz337 Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we put this card to the first of the distributed section? iirc all other receipts are not quite useful anymore (i.e. RPC)

We can also add the tutorial on the distributed page as @svekars suggested: https://pytorch.org/tutorials/distributed/home.html

@wz337 wz337 force-pushed the add_device_mesh_recipe branch from 5d434d9 to 42592fb Compare December 21, 2023 00:36
:link: https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh
:link-type: url

In this tutorial you will learn to implement about `DeviceMesh`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"implement about"? seems we should say: you will learn about how to use

wz337 and others added 2 commits December 20, 2023 19:40
Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
@svekars svekars added the 2.2 label Dec 21, 2023
Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM! We'll merge the week of Jan 2. Thank you!

@svekars svekars merged commit bcaa9f6 into pytorch:main Jan 24, 2024
HDCharles pushed a commit that referenced this pull request Jan 26, 2024
* add DeviceMesh recipe
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
svekars added a commit that referenced this pull request Feb 2, 2024
* add DeviceMesh recipe
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Wanchao <wanchaol@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants