@@ -48,6 +48,8 @@ Enabling Flight Recorder
48
48
------------------------
49
49
There are two required environment variables to get the initial version of Flight Recorder working.
50
50
51
+ - ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE ``: Setting the path where the flight recorder will be dumped with file prefix. The dump is one
52
+ file per rank. The default value is ``/tmp/nccl_trace_rank_ ``.
51
53
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N) ``: Setting ``N `` to a positive number enables collection.
52
54
``N `` represents the number of entries that will be kept internally in a circular buffer.
53
55
We recommended to set this value at *2000 *.
@@ -71,6 +73,9 @@ Additional Settings
71
73
72
74
``fast `` is a new experimental mode that is shown to be much faster than the traditional ``addr2line ``.
73
75
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK `` to collect C++ traces in the Flight Recorder data.
76
+ - If you don't want the flight recorder to be dumped into the local disk but instead onto your own storage, users can define your own writer class
77
+ which inherits from class ``::c10d::DebugInfoWriter `` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter `` before
78
+ we initiate c10d distributed.
74
79
75
80
Retrieving Flight Recorder Data via an API
76
81
------------------------------------------
@@ -169,9 +174,28 @@ To run the convenience script, follow these steps:
169
174
170
175
2. To run the script, use this command:
171
176
172
- .. code :: python
177
+ .. code :: shell
178
+
179
+ python fr_trace.py < dump dir containing trace files> [-o < output file> ]
180
+
181
+ Or if you install PyTorch nightly build or build from scratch (with ``USE_DISTRIBUTED=1 ``), you can directly use the following command:
182
+
183
+ .. code :: shell
184
+
185
+ torchfrtrace < dump dir containing trace files> [-o < output file> ]
186
+
187
+
188
+ For now, we support two modes for the analyzer script, one is to let the script to apply some heuristics to the parsed flight recorder
189
+ dumps to generate a report on potential culprit for the timeout/hang; the other one is to just print out raw dumps. For the latter, by
190
+ default the script prints for all ranks and all ProcessGroups(PGs), and this can be narrowed down to certain ranks and PGs. Example
191
+ command is:
192
+
193
+ Caveat: tabulate module is needed, so you might need pip install it first.
194
+
195
+ .. code :: shell
173
196
174
- python fr_trace.py - d < dump dir containing trace files> [- o < output file > ]
197
+ python fr_trace.py < dump dir containing trace files> -j [--selected-ranks i j k ...]
198
+ torchfrtrace < dump dir containing trace files> -j [--selected-ranks i j k ...]
175
199
176
200
Conclusion
177
201
----------
0 commit comments