Skip to content

Commit 477bdba

Browse files
committed
address code review comments
Summary: Add missing sections from the template and clarify some notes further in the tutorial.
1 parent bd1f808 commit 477bdba

File tree

1 file changed

+40
-13
lines changed

1 file changed

+40
-13
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 40 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44

55
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
66

7-
Background and Motivation
8-
--------------------------
7+
Overview, Background and Motivation
8+
-----------------------------------
99
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
1010
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
1111
that require significant computational resources.
@@ -31,24 +31,48 @@ to be synchronized at certain points. If this synchronization fails, the job can
3131
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
3232
indefinite wait for the job to progress.
3333

34-
Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
35-
cause the underlying issue. There are 2 core parts to flight recorder.
34+
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
35+
information can be used to help root cause the underlying issue when jobs get stuck.
36+
There are 2 core parts to flight recorder.
3637
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
3738
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
3839
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
3940

41+
Prerequisites
42+
-------------
43+
None. This is a new debugging tool that is available in PyTorch version 2.5.
44+
4045
Enabling Flight Recorder
4146
------------------------
4247
There are two required environment variables to get the initial version of flight recorder working.
43-
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
44-
to 2000)
45-
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
48+
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of
49+
entries that will be kept internally in a circular buffer. Recommended to set this at 2000.
50+
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set,
51+
there will be one file per rank output in the jobs running directory.
4652
Optional settings:
4753
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
4854
addr2line - see additinal settings)
4955
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
5056
record the ‘duration’ of each collective. May incur some CPU overhead.
5157

58+
Additional settings
59+
-------------------
60+
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
61+
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
62+
faster than the traditional `addr2line`.
63+
64+
Retrieving Flight Recorder Data via an API
65+
------------------------------------------
66+
Flight recorder data can also be retrieved via an API call.
67+
The API is shown below with the default arguments.
68+
.. code:: python
69+
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
70+
71+
To view the data, you can unpickle the data
72+
.. code:: python
73+
t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
74+
print(t)
75+
5276
Flight Recorder File Formats
5377
----------------------------
5478
Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
@@ -118,13 +142,16 @@ Analyzing Flight Recorder Dumps
118142
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
119143
data.
120144

121-
To run it, one can use command line:
145+
1. In order to run the convenience script, all files from a rank must first be copied over into a single directory.
146+
147+
2. To run it, one can use command line:
122148
.. code:: python
123149
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
124150
125151
126-
Additional settings
127-
-------------------
128-
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
129-
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
130-
faster than the traditional `addr2line`.
152+
Conclusion
153+
----------
154+
This tutorial introduces a new PyTorch diagnostic tool called `flight recorder`. The tutorial talks about how flight
155+
recorder can be enabled to collect diagnostic data from a machine.
156+
Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder` directory
157+
in the PyTorch repository.

0 commit comments

Comments
 (0)