Skip to content

Commit 4126761

Browse files
authored
Minor fixes (#3059)
* Minor fixes to the flight recorder tutorial Summary: 1. Fix a sentence saying that there 3 required parameters to enable this feature. Clarify default values. 2. Add a link for the relevant code that user needs to modify to write out Flight Recorder dumps to a location of their choosing. 3. Clarify a sentence by adding the switch that customers need to use to filter their data by ranks. Test Plan: Looked at renderings locally and also in the tests.
1 parent cd7f684 commit 4126761

File tree

1 file changed

+8
-7
lines changed

1 file changed

+8
-7
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,15 @@ Flight Recorder consists of two core parts:
4646

4747
Enabling Flight Recorder
4848
------------------------
49-
There are two required environment variables to get the initial version of Flight Recorder working.
49+
There are three required environment variables to get the initial version of Flight Recorder working.
5050

51-
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
52-
rank. The default value is ``/tmp/nccl_trace_rank_``.
5351
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
5452
``N`` represents the number of entries that will be kept internally in a circular buffer.
55-
We recommended to set this value at *2000*.
53+
We recommended to set this value at *2000*. The default value is ``2000``.
5654
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
57-
If enabled, there will be one file per rank output in the job's running directory.
55+
If enabled, there will be one file per rank output in the job's running directory. The default value is ``false``.
56+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
57+
rank. The default value is ``/tmp/nccl_trace_rank_``.
5858

5959
**Optional settings:**
6060

@@ -74,7 +74,8 @@ Additional Settings
7474
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
7575
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
7676
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77-
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter``
77+
This class should inherit from class ``::c10d::DebugInfoWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L237>`__
78+
and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L242>`__
7879
before we initiate PyTorch distributed.
7980

8081
Retrieving Flight Recorder Data via an API
@@ -189,7 +190,7 @@ command directly:
189190
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
190191
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
191192
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
192-
ranks and PGs. An example command is:
193+
ranks and PGs using the *--selected-ranks* argument. An example command is:
193194
194195
Caveat: tabulate module is needed, so you might need pip install it first.
195196

0 commit comments

Comments
 (0)