Skip to content

Commit 4cc0a2d

Browse files
committed
[dtensor][debug] tutorial showing users how to use commdebugmode and giving access to visual browser
1 parent 6164c77 commit 4cc0a2d

File tree

1 file changed

+30
-25
lines changed

1 file changed

+30
-25
lines changed

recipes_source/distributed_comm_debug_mode.rst

Lines changed: 30 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,10 @@ Using CommDebugMode
33

44
**Author**: `Anshul Sinha <https://github.com/sinhaanshul>`__
55

6-
Prerequisites:
6+
Prerequisites
77

8-
- `Distributed Communication Package - torch.distributed <https://pytorch.org/docs/stable/distributed.html>`__
98
- Python 3.8 - 3.11
10-
- PyTorch 2.2
9+
- PyTorch 2.2 or later
1110

1211

1312
What is CommDebugMode and why is it useful
@@ -16,17 +15,20 @@ As the size of models continues to increase, users are seeking to leverage vario
1615
of parallel strategies to scale up distributed training. However, the lack of interoperability
1716
between existing solutions poses a significant challenge, primarily due to the absence of a
1817
unified abstraction that can bridge these different parallelism strategies. To address this
19-
issue, PyTorch has proposed DistributedTensor(DTensor) which abstracts away the complexities of
20-
tensor communication in distributed training, providing a seamless user experience. However,
21-
this abstraction creates a lack of transparency that can make it challenging for users to
22-
identify and resolve issues. To address this challenge, CommDebugMode, a Python context manager
23-
will serve as one of the primary debugging tools for DTensors, enabling users to view when and
24-
why collective operations are happening when using DTensors, addressing this problem.
18+
issue, PyTorch has proposed `DistributedTensor(DTensor)
19+
<https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py>`_
20+
which abstracts away the complexities of tensor communication in distributed training,
21+
providing a seamless user experience. However, this abstraction creates a lack of transparency
22+
that can make it challenging for users to identify and resolve issues. To address this challenge,
23+
``CommDebugMode``, a Python context manager will serve as one of the primary debugging tools for
24+
DTensors, enabling users to view when and why collective operations are happening when using DTensors,
25+
effectively addressing this issue.
2526

2627

2728
How to use CommDebugMode
2829
------------------------
29-
Using CommDebugMode and getting its output is very simple.
30+
31+
Here is how you can use ``CommDebugMode``:
3032

3133
.. code-block:: python
3234
@@ -46,6 +48,8 @@ Using CommDebugMode and getting its output is very simple.
4648
# used in the visual browser below
4749
comm_mode.generate_json_dump(noise_level=2)
4850
51+
.. code-block:: python
52+
4953
"""
5054
This is what the output looks like for a MLPModule at noise level 0
5155
Expected Output:
@@ -62,19 +66,18 @@ Using CommDebugMode and getting its output is very simple.
6266
*c10d_functional.all_reduce: 1
6367
"""
6468
65-
All users have to do is wrap the code running the model in CommDebugMode and call the API that
66-
they want to use to display the data. One important thing to note is that the users can use a noise_level
67-
arguement to control how much information is displayed to the user. The information below shows what each
68-
noise level displays
69+
To use ``CommDebugMode``, you must wrap the code running the model in ``CommDebugMode`` and call the API that
70+
you want to use to display the data. You can also use a ``noise_level`` argument to control the verbosity
71+
level of displayed information. Here is what each noise level displays:
6972

70-
| 0. prints module-level collective counts
71-
| 1. prints dTensor operations not included in trivial operations, module information
72-
| 2. prints operations not included in trivial operations
73-
| 3. prints all operations
73+
| 0. Prints module-level collective counts
74+
| 1. Prints dTensor operations not included in trivial operations, module information
75+
| 2. Prints operations not included in trivial operations
76+
| 3. Prints all operations
7477
75-
In the example above, users can see in the first picture that the collective operation, all_reduce, occurs
76-
once in the forward pass of the MLPModule. The second picture provides a greater level of detail, allowing
77-
users to pinpoint that the all-reduce operation happens in the second linear layer of the MLPModule.
78+
In the example above, you can see that the collective operation, all_reduce, occurs once in the forward pass
79+
of the ``MLPModule``. Furthermore, you can use ``CommDebugMode`` pinpoint that the all-reduce operation happens
80+
in the second linear layer of the ``MLPModule``.
7881

7982

8083
Below is the interactive module tree visualization that users can upload their JSON dump to:
@@ -190,8 +193,10 @@ Below is the interactive module tree visualization that users can upload their J
190193

191194
Conclusion
192195
------------------------------------------
193-
In conclusion, we have learned how to use CommDebugMode in order to debug Distributed Tensors
194-
and can use future json dumps in the embedded visual browser.
195196

196-
For more detailed information about CommDebugMode, please see
197-
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py
197+
In this recipe, we have learned how to use ``CommDebugMode`` to debug Distributed Tensors. You can use your
198+
own JSON outputs in the embedded visual browser.
199+
200+
For more detailed information about ``CommDebugMode``, see
201+
`comm_mode_features_example.py
202+
<https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py>`_

0 commit comments

Comments
 (0)