1. add to why DM is useful, 2. add get PG, 3. add note

wz337 · wz337 · commit 1102397384be · 2023-12-20T14:27:29.000-08:00
diff --git a/recipes_source/distributed_device_mesh.rst b/recipes_source/distributed_device_mesh.rst
@@ -30,8 +30,12 @@ Users can also easily manage the underlying process_groups/devices for multi-dim
 
 Why DeviceMesh is Useful
 ------------------------
+DeviceMesh is useful, when composability is requried. That is when your parallelism solutions require both communication across hosts and within each host.
+The image above shows that we can create a 2D mesh that connects the devices within each host, and connects each device with its counterpart on the other hosts in a homogenous setup.
 
-The following code snippet illustrates a 2D setup without :class:`DeviceMesh`. First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and
+Without DeviceMesh, users would need to manually set up NCCL communicators before applying any parallelism.
+The following code snippet illustrates a hybrid sharding 2-D Parallel pattern setup without :class:`DeviceMesh`.
+First, we need to manually calculate the shard group and replicate group. Then, we need to assign the correct shard and
 replicate group to each rank.
 
 .. code-block:: python
@@ -76,13 +80,21 @@ Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable
 .. code-block:: python
     torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=100 --rdzv_endpoint=localhost:29400 2d_setup.py
 
+Note
 
-With the help of :func:`init_device_mesh`, we can accomplish the above 2D setup in just two lines.
+For simplicity of demonstration, we are simulating 2D parallel using only one node. Note that this code snippet can also be used when running on multi hosts setup.
+
+With the help of :func:`init_device_mesh`, we can accomplish the above 2D setup in just two lines, and we can still
+access the underlying :class:`ProcessGroup` if needed.
 
 
 .. code-block:: python
     from torch.distributed.device_mesh import init_device_mesh
-    device_mesh = init_device_mesh("cuda", (2, 4))
+    mesh_2d = init_device_mesh("cuda", (2, 4), mesh_dim_names=("replicate", "shard"))
+
+    # Users can acess the undelying process group thru `get_group` API.
+    replicate_group = mesh_2d.get_group(mesh_dim="replicate")
+    shard_group = mesh_2d.get_group(mesh_dim="shard")
 
 Let's create a file named ``2d_setup_with_device_mesh.py``.
 Then, run the following `torch elastic/torchrun <https://pytorch.org/docs/stable/elastic/quickstart.html>`__ command.