@@ -3,11 +3,10 @@ Using CommDebugMode
3
3
4
4
**Author **: `Anshul Sinha <https://github.com/sinhaanshul >`__
5
5
6
- Prerequisites:
6
+ Prerequisites
7
7
8
- - `Distributed Communication Package - torch.distributed <https://pytorch.org/docs/stable/distributed.html >`__
9
8
- Python 3.8 - 3.11
10
- - PyTorch 2.2
9
+ - PyTorch 2.2 or later
11
10
12
11
13
12
What is CommDebugMode and why is it useful
@@ -16,17 +15,20 @@ As the size of models continues to increase, users are seeking to leverage vario
16
15
of parallel strategies to scale up distributed training. However, the lack of interoperability
17
16
between existing solutions poses a significant challenge, primarily due to the absence of a
18
17
unified abstraction that can bridge these different parallelism strategies. To address this
19
- issue, PyTorch has proposed DistributedTensor(DTensor) which abstracts away the complexities of
20
- tensor communication in distributed training, providing a seamless user experience. However,
21
- this abstraction creates a lack of transparency that can make it challenging for users to
22
- identify and resolve issues. To address this challenge, CommDebugMode, a Python context manager
23
- will serve as one of the primary debugging tools for DTensors, enabling users to view when and
24
- why collective operations are happening when using DTensors, addressing this problem.
18
+ issue, PyTorch has proposed `DistributedTensor(DTensor)
19
+ <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py> `_
20
+ which abstracts away the complexities of tensor communication in distributed training,
21
+ providing a seamless user experience. However, this abstraction creates a lack of transparency
22
+ that can make it challenging for users to identify and resolve issues. To address this challenge,
23
+ ``CommDebugMode ``, a Python context manager will serve as one of the primary debugging tools for
24
+ DTensors, enabling users to view when and why collective operations are happening when using DTensors,
25
+ effectively addressing this issue.
25
26
26
27
27
28
How to use CommDebugMode
28
29
------------------------
29
- Using CommDebugMode and getting its output is very simple.
30
+
31
+ Here is how you can use ``CommDebugMode ``:
30
32
31
33
.. code-block :: python
32
34
@@ -46,6 +48,8 @@ Using CommDebugMode and getting its output is very simple.
46
48
# used in the visual browser below
47
49
comm_mode.generate_json_dump(noise_level = 2 )
48
50
51
+ .. code-block :: python
52
+
49
53
"""
50
54
This is what the output looks like for a MLPModule at noise level 0
51
55
Expected Output:
@@ -62,19 +66,18 @@ Using CommDebugMode and getting its output is very simple.
62
66
*c10d_functional.all_reduce: 1
63
67
"""
64
68
65
- All users have to do is wrap the code running the model in CommDebugMode and call the API that
66
- they want to use to display the data. One important thing to note is that the users can use a noise_level
67
- arguement to control how much information is displayed to the user. The information below shows what each
68
- noise level displays
69
+ To use ``CommDebugMode ``, you must wrap the code running the model in ``CommDebugMode `` and call the API that
70
+ you want to use to display the data. You can also use a ``noise_level `` argument to control the verbosity
71
+ level of displayed information. Here is what each noise level displays:
69
72
70
- | 0. prints module-level collective counts
71
- | 1. prints dTensor operations not included in trivial operations, module information
72
- | 2. prints operations not included in trivial operations
73
- | 3. prints all operations
73
+ | 0. Prints module-level collective counts
74
+ | 1. Prints dTensor operations not included in trivial operations, module information
75
+ | 2. Prints operations not included in trivial operations
76
+ | 3. Prints all operations
74
77
75
- In the example above, users can see in the first picture that the collective operation, all_reduce, occurs
76
- once in the forward pass of the MLPModule. The second picture provides a greater level of detail, allowing
77
- users to pinpoint that the all-reduce operation happens in the second linear layer of the MLPModule.
78
+ In the example above, you can see that the collective operation, all_reduce, occurs once in the forward pass
79
+ of the `` MLPModule ``. Furthermore, you can use `` CommDebugMode `` pinpoint that the all-reduce operation happens
80
+ in the second linear layer of the `` MLPModule `` .
78
81
79
82
80
83
Below is the interactive module tree visualization that users can upload their JSON dump to:
@@ -190,8 +193,10 @@ Below is the interactive module tree visualization that users can upload their J
190
193
191
194
Conclusion
192
195
------------------------------------------
193
- In conclusion, we have learned how to use CommDebugMode in order to debug Distributed Tensors
194
- and can use future json dumps in the embedded visual browser.
195
196
196
- For more detailed information about CommDebugMode, please see
197
- https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py
197
+ In this recipe, we have learned how to use ``CommDebugMode `` to debug Distributed Tensors. You can use your
198
+ own JSON outputs in the embedded visual browser.
199
+
200
+ For more detailed information about ``CommDebugMode ``, see
201
+ `comm_mode_features_example.py
202
+ <https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py> `_
0 commit comments