4
4
5
5
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
6
6
7
- Background and Motivation
8
- --------------------------
7
+ Overview, Background and Motivation
8
+ -----------------------------------
9
9
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
10
10
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
11
11
that require significant computational resources.
@@ -31,24 +31,48 @@ to be synchronized at certain points. If this synchronization fails, the job can
31
31
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
32
32
indefinite wait for the job to progress.
33
33
34
- Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
35
- cause the underlying issue. There are 2 core parts to flight recorder.
34
+ Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
35
+ information can be used to help root cause the underlying issue when jobs get stuck.
36
+ There are 2 core parts to flight recorder.
36
37
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
37
38
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38
39
- An analyzer script is available in the `pytorch/tools/flight_recorder ` directory (details below).
39
40
41
+ Prerequisites
42
+ -------------
43
+ None. This is a new debugging tool that is available in PyTorch version 2.5.
44
+
40
45
Enabling Flight Recorder
41
46
------------------------
42
47
There are two required environment variables to get the initial version of flight recorder working.
43
- - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
44
- to 2000)
45
- - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
48
+ - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of
49
+ entries that will be kept internally in a circular buffer. Recommended to set this at 2000.
50
+ - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set,
51
+ there will be one file per rank output in the jobs running directory.
46
52
Optional settings:
47
53
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
48
54
addr2line - see additinal settings)
49
55
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
50
56
record the ‘duration’ of each collective. May incur some CPU overhead.
51
57
58
+ Additional settings
59
+ -------------------
60
+ TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
61
+ from a running program. The default setting is `addr2line `. `fast ` is a new experimental mode that is shown to be much
62
+ faster than the traditional `addr2line `.
63
+
64
+ Retrieving Flight Recorder Data via an API
65
+ ------------------------------------------
66
+ Flight recorder data can also be retrieved via an API call.
67
+ The API is shown below with the default arguments.
68
+ .. code :: python
69
+ torch._C._distributed_c10d._dump_nccl_trace(includeCollectives = True , includeStackTraces = True , onlyActive = False )
70
+
71
+ To view the data, you can unpickle the data
72
+ .. code :: python
73
+ t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
74
+ print (t)
75
+
52
76
Flight Recorder File Formats
53
77
----------------------------
54
78
Flight recorder files are dumped out in `pickle ` format. Files are written out to local disks or mounted shared NFS
@@ -118,13 +142,16 @@ Analyzing Flight Recorder Dumps
118
142
We have convenient scripts available in `pytorch/tools/flight_recorder ` directory that can be used to analyze captured
119
143
data.
120
144
121
- To run it, one can use command line:
145
+ 1. In order to run the convenience script, all files from a rank must first be copied over into a single directory.
146
+
147
+ 2. To run it, one can use command line:
122
148
.. code :: python
123
149
python fr_trace.py - d < dump dir containing trace files> [- o < output file > ]
124
150
125
151
126
- Additional settings
127
- -------------------
128
- TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
129
- from a running program. The default setting is `addr2line `. `fast ` is a new experimental mode that is shown to be much
130
- faster than the traditional `addr2line `.
152
+ Conclusion
153
+ ----------
154
+ This tutorial introduces a new PyTorch diagnostic tool called `flight recorder `. The tutorial talks about how flight
155
+ recorder can be enabled to collect diagnostic data from a machine.
156
+ Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder ` directory
157
+ in the PyTorch repository.
0 commit comments