Skip to content

Commit f8012a9

Browse files
committed
More HTML and formatting fixes
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
1 parent 2ec0fae commit f8012a9

File tree

1 file changed

+69
-64
lines changed

1 file changed

+69
-64
lines changed
Lines changed: 69 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,19 @@
11
(prototype) Flight Recorder for Debugging
22
=========================================
3-
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
3+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
44

55
What you will learn
66
-------------------
77
* Learn about a new tool for debugging stuck jobs during distributed training.
88
* Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
99

10+
Prerequisites
11+
-------------
12+
- PyTorch version 2.5 or later.
13+
14+
1015
Overview
11-
-----------------------------------
16+
--------
1217
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
1318
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
1419
that require significant computational resources.
@@ -41,25 +46,19 @@ There are two core parts to Flight Recorder.
4146
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
4247
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
4348

44-
Prerequisites
45-
-------------
46-
- PyTorch version 2.5 or later.
47-
4849
Enabling Flight Recorder
4950
------------------------
50-
There are two required environment variables to get the initial version of flight recorder working.
51-
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of
52-
entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53-
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled,
54-
there will be one file per rank output in the jobs running directory.
55-
51+
There are two required environment variables to get the initial version of Flight Recorder working.
52+
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53+
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, there will be one file per rank output in the jobs running directory.
54+
5655
**Optional settings:**
5756

5857
- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
5958
``addr2line`` - for more information, see additional settings.
60-
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
59+
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
6160
records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
62-
``duration`` filed indicates how long each collective took to execute..
61+
``duration`` field indicates how long each collective took to execute.
6362

6463
Additional Settings
6564
-------------------
@@ -76,11 +75,14 @@ You can also retrieve Flight Recorder data with an API call.
7675
Below is the API with the default arguments:
7776

7877
.. code:: python
78+
7979
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
8080
81+
8182
To view the data, you can unpickle it as shown below:
8283

8384
.. code:: python
85+
8486
t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
8587
print(t)
8688
@@ -96,61 +98,63 @@ The contents of a Flight Recorder ``unpickled`` file are shown below:
9698
{
9799
"version": "2.3",
98100
"pg_config": {
99-
"0": {
100-
"name": "0",
101-
"desc": "default_pg",
102-
"ranks": "[0, 1]"
103-
}
101+
"0": {
102+
"name": "0",
103+
"desc": "default_pg",
104+
"ranks": "[0, 1]"
105+
}
104106
},
105107
"pg_status": {
106-
"0": {
107-
"last_enqueued_collective": 2,
108-
"last_started_collective": -1,
109-
"last_completed_collective": 2
110-
}
108+
"0": {
109+
"last_enqueued_collective": 2,
110+
"last_started_collective": -1,
111+
"last_completed_collective": 2
112+
}
111113
},
112114
"entries": [
113-
{
114-
"frames": [
115-
{
116-
"name": "test_short_pickle",
117-
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
118-
"line": 3647
119-
},
120-
...
121-
{
122-
"name": "spawn_main",
123-
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
124-
"line": 116
125-
},
126-
{
127-
"name": "<module>",
128-
"filename": "<string>",
129-
"line": 1
130-
}
131-
],
132-
"record_id": 0,
133-
"pg_id": 0,
134-
"process_group": ("0", "default_pg"),
135-
"collective_seq_id": 1,
136-
"p2p_seq_id": 0,
137-
"op_id": 1,
138-
"profiling_name": "nccl:all_reduce",
139-
"time_created_ns": 1724779239936775119,
140-
"input_sizes": [[3, 4]],
141-
"input_dtypes": ["Float"],
142-
"output_sizes": [[3, 4]],
143-
"output_dtypes": ["Float"],
144-
"state": "completed",
145-
"time_discovered_started_ns": null,
146-
"time_discovered_completed_ns": 1724779239975811724,
147-
"retired": true,
148-
"timeout_ms": 600000,
149-
"is_p2p": false
150-
},
115+
{
116+
"frames": [
117+
{
118+
"name": "test_short_pickle",
119+
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
120+
"line": 3647
121+
},
122+
...
123+
{
124+
"name": "spawn_main",
125+
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
126+
"line": 116
127+
},
128+
{
129+
"name": "<module>",
130+
"filename": "<string>",
131+
"line": 1
132+
}
133+
],
134+
"record_id": 0,
135+
"pg_id": 0,
136+
"process_group": ("0", "default_pg"),
137+
"collective_seq_id": 1,
138+
"p2p_seq_id": 0,
139+
"op_id": 1,
140+
"profiling_name": "nccl:all_reduce",
141+
"time_created_ns": 1724779239936775119,
142+
"input_sizes": [[3, 4]],
143+
"input_dtypes": ["Float"],
144+
"output_sizes": [[3, 4]],
145+
"output_dtypes": ["Float"],
146+
"state": "completed",
147+
"time_discovered_started_ns": null,
148+
"time_discovered_completed_ns": 1724779239975811724,
149+
"retired": true,
150+
"timeout_ms": 600000,
151+
"is_p2p": false
152+
},
151153
...]
154+
152155
}
153156
157+
154158
Analyzing Flight Recorder Dumps
155159
-------------------------------
156160

@@ -163,13 +167,14 @@ To run the convenience script, follow these steps:
163167

164168
2. To run the script, use this command:
165169
.. code:: python
170+
166171
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
167172
168173
169174
Conclusion
170175
----------
171-
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
176+
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
172177
We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
173-
Additionally, we explored how to analyze the data captured from the flight recorder using a
178+
Additionally, we explored how to analyze the data captured from the Flight Recorder using a
174179
convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__
175180
directory of the PyTorch repository.

0 commit comments

Comments
 (0)