Skip to content

Commit 42592fb

Browse files
committed
address wanchao's comments
1 parent 1102397 commit 42592fb

File tree

3 files changed

+35
-17
lines changed

3 files changed

+35
-17
lines changed

distributed/home.rst

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ PyTorch with each method having their advantages in certain use cases:
1313

1414
* `DistributedDataParallel (DDP) <#learn-ddp>`__
1515
* `Fully Sharded Data Parallel (FSDP) <#learn-fsdp>`__
16+
* `Device Mesh <#device-mesh>`__
1617
* `Remote Procedure Call (RPC) distributed training <#learn-rpc>`__
1718
* `Custom Extensions <#custom-extensions>`__
1819

@@ -51,7 +52,7 @@ Learn DDP
5152
:link: https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join
5253
:link-type: url
5354

54-
This tutorial describes the Join context manager and
55+
This tutorial describes the Join context manager and
5556
demonstrates it's use with DistributedData Parallel.
5657
+++
5758
:octicon:`code;1em` Code
@@ -83,6 +84,23 @@ Learn FSDP
8384
+++
8485
:octicon:`code;1em` Code
8586

87+
.. _device-mesh:
88+
89+
Learn DeviceMesh
90+
----------------
91+
92+
.. grid:: 3
93+
94+
.. grid-item-card:: :octicon:`file-code;1em`
95+
Getting Started with DeviceMesh
96+
:link: https://pytorch.org/tutorials/recipes/distributed_device_mesh.html?highlight=devicemesh
97+
:link-type: url
98+
99+
In this tutorial you will learn to implement about `DeviceMesh`
100+
and how it can help with distributed training.
101+
+++
102+
:octicon:`code;1em` Code
103+
86104
.. _learn-rpc:
87105

88106
Learn RPC

recipes_source/distributed_device_mesh.rst

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,10 @@ Users can also easily manage the underlying process_groups/devices for multi-dim
3030

3131
Why DeviceMesh is Useful
3232
------------------------
33-
DeviceMesh is useful, when composability is requried. That is when your parallelism solutions require both communication across hosts and within each host.
33+
DeviceMesh is useful when working with multi-dimensional parallelism (i.e. 3-D parallel) where parallelism composability is requried. For example, when your parallelism solutions require both communication across hosts and within each host.
3434
The image above shows that we can create a 2D mesh that connects the devices within each host, and connects each device with its counterpart on the other hosts in a homogenous setup.
3535

36-
Without DeviceMesh, users would need to manually set up NCCL communicators before applying any parallelism.
36+
Without DeviceMesh, users would need to manually set up NCCL communicators, cuda devices on each process before applying any parallelism, which could be quite complicated.
3737
The following code snippet illustrates a hybrid sharding 2-D Parallel pattern setup without :class:`DeviceMesh`.
3838
First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and
3939
replicate group to each rank.
@@ -51,6 +51,7 @@ replicate group to each rank.
5151
5252
# Create process groups to manage 2-D like parallel pattern
5353
dist.init_process_group("nccl")
54+
torch.cuda.set_device(rank)
5455
5556
# Create shard groups (e.g. (0, 1, 2, 3), (4, 5, 6, 7))
5657
# and assign the correct shard group to each rank
@@ -78,11 +79,10 @@ To run the above code snippet, we can leverage PyTorch Elastic. Let's create a f
7879
Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command.
7980

8081
.. code-block:: python
81-
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup.py
82+
torchrun --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup.py
8283
83-
Note
84-
85-
For simplicity of demonstration, we are simulating 2D parallel using only one node. Note that this code snippet can also be used when running on multi hosts setup.
84+
.. note::
85+
For simplicity of demonstration, we are simulating 2D parallel using only one node. Note that this code snippet can also be used when running on multi hosts setup.
8686

8787
With the help of :func:`init_device_mesh`, we can accomplish the above 2D setup in just two lines, and we can still
8888
access the underlying :class:`ProcessGroup` if needed.
@@ -100,15 +100,15 @@ Let's create a file named ``2d_setup_with_device_mesh.py``.
100100
Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command.
101101

102102
.. code-block:: python
103-
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup_with_device_mesh.py
103+
torchrun --nproc_per_node=8 2d_setup_with_device_mesh.py
104104
105105
106106
How to use DeviceMesh with HSDP
107107
-------------------------------
108108

109109
Hybrid Sharding Data Parallel(HSDP) is 2D strategy to perform FSDP within a host and DDP across hosts.
110110

111-
Let's see an example of how DeviceMesh can assist with applying HSDP to your model. With DeviceMesh,
111+
Let's see an example of how DeviceMesh can assist with applying HSDP to your model with a simple setup. With DeviceMesh,
112112
users would not need to manually create and manage shard group and replicate group.
113113

114114
.. code-block:: python
@@ -140,7 +140,7 @@ Let's create a file named ``hsdp.py``.
140140
Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command.
141141

142142
.. code-block:: python
143-
torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 hsdp.py
143+
torchrun --nproc_per_node=8 hsdp.py
144144
145145
Conclusion
146146
----------

recipes_source/recipes_index.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,13 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
296296

297297
.. Distributed Training
298298
299+
.. customcarditem::
300+
:header: Getting Started with DeviceMesh
301+
:card_description: Learn how to use DeviceMesh
302+
:image: ../_static/img/thumbnails/cropped/profiler.png
303+
:link: ../recipes/distributed_device_mesh.html
304+
:tags: Distributed-Training
305+
299306
.. customcarditem::
300307
:header: Shard Optimizer States with ZeroRedundancyOptimizer
301308
:card_description: How to use ZeroRedundancyOptimizer to reduce memory consumption.
@@ -324,13 +331,6 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
324331
:link: ../recipes/DCP_tutorial.html
325332
:tags: Distributed-Training
326333

327-
.. customcarditem::
328-
:header: Getting Started with DeviceMesh
329-
:card_description: Learn how to use DeviceMesh
330-
:image: ../_static/img/thumbnails/cropped/profiler.png
331-
:link: ../recipes/distributed_device_mesh.html
332-
:tags: Distributed-Training
333-
334334
.. TorchServe
335335
336336
.. customcarditem::

0 commit comments

Comments
 (0)