Skip to content

Commit 4223ce7

Browse files
committed
More HTML formatting changes
Test Plan: Ran rst2html5 and viewed HTML on browser.
1 parent f8012a9 commit 4223ce7

File tree

1 file changed

+86
-80
lines changed

1 file changed

+86
-80
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 86 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -25,61 +25,66 @@ A distributed AI training job is considered `stuck` when it stops making meaning
2525
time.
2626

2727
A job can get stuck for various reasons:
28-
- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to
29-
issues with the data pipeline or the data source.
30-
- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or
31-
memory), the job might not be able to proceed.
32-
- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different
33-
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
34-
stuck.
35-
- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to
36-
get stuck.
37-
- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need
38-
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
39-
occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an
40-
indefinite wait for the job to progress.
28+
29+
- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to issues with the data pipeline or the data source.
30+
31+
- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or memory), the job might not be able to proceed.
32+
33+
- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different devices. If there are network issues, communication between these devices may be disrupted, causing the job to get stuck.
34+
35+
- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to get stuck.
36+
37+
- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an indefinite wait for the job to progress.
4138

4239
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
43-
information can be used to help identify the root cause of issues when jobs get stuck.
40+
information is used to help root cause issues when jobs get stuck.
4441
There are two core parts to Flight Recorder.
45-
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer.
46-
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
47-
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
42+
43+
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
44+
45+
- An analyzer script is available in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__ directory (details below).
46+
The analyzer script runs known heuristics using the collected data and attempts to automatically identify the underlying issue that caused the job to stall.
4847

4948
Enabling Flight Recorder
5049
------------------------
5150
There are two required environment variables to get the initial version of Flight Recorder working.
52-
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53-
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, there will be one file per rank output in the jobs running directory.
51+
52+
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection.
53+
``N`` represents the number of entries that will be kept internally in a circular buffer.
54+
We recommended to set this value at 2000.
55+
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
56+
If enabled, there will be one file per rank output in the job's running directory.
5457

5558
**Optional settings:**
5659

57-
- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
58-
``addr2line`` - for more information, see additional settings.
59-
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
60+
- ``TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder.
61+
C++ stack traces can be useful in providing the exact code path from a PyTorch Python call down to the primitive
62+
C++ implementations. Also see ``TORCH_SYMBOLIZE_MODE`` in additional settings.
63+
- ``TORCH_NCCL_ENABLE_TIMING (true, false)``: true = enable additional cuda events at the start of each collective and
6064
records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
6165
``duration`` field indicates how long each collective took to execute.
6266

6367
Additional Settings
6468
-------------------
6569

66-
``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}:`` This setting determines the program used to retrieve C++ traces
67-
from a running program. The default setting is ``addr2line``. ``fast`` is a new experimental mode that is shown to be much
68-
faster than the traditional ``addr2line``. Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect
69-
C++ traces in the Flight Recorder` data.
70+
- ``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}``: This setting determines the program used to retrieve C++ traces from a running program.
71+
The default setting is ``addr2line``.
72+
73+
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
74+
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
7075

7176
Retrieving Flight Recorder Data via an API
7277
------------------------------------------
7378

7479
You can also retrieve Flight Recorder data with an API call.
75-
Below is the API with the default arguments:
80+
The API with the default arguments is shown below:
7681

7782
.. code:: python
7883
7984
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
8085
8186
82-
To view the data, you can unpickle it as shown below:
87+
To view the data, you can ``unpickle`` it as shown below:
8388

8489
.. code:: python
8590
@@ -93,65 +98,65 @@ Flight Recorder files are dumped in ``pickle`` format. Files are written to loca
9398
folders.
9499

95100
The contents of a Flight Recorder ``unpickled`` file are shown below:
96-
.. code-block: json
101+
102+
.. code-block:: json
97103
98104
{
99-
"version": "2.3",
105+
"version": "2.5",
100106
"pg_config": {
101-
"0": {
102-
"name": "0",
103-
"desc": "default_pg",
104-
"ranks": "[0, 1]"
105-
}
107+
"0": {
108+
"name": "0",
109+
"desc": "default_pg",
110+
"ranks": "[0, 1]"
111+
}
106112
},
107113
"pg_status": {
108-
"0": {
109-
"last_enqueued_collective": 2,
110-
"last_started_collective": -1,
111-
"last_completed_collective": 2
112-
}
114+
"0": {
115+
"last_enqueued_collective": 2,
116+
"last_started_collective": -1,
117+
"last_completed_collective": 2
118+
}
113119
},
114120
"entries": [
115121
{
116-
"frames": [
117-
{
118-
"name": "test_short_pickle",
119-
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
120-
"line": 3647
121-
},
122-
...
123-
{
124-
"name": "spawn_main",
125-
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
126-
"line": 116
127-
},
128-
{
129-
"name": "<module>",
130-
"filename": "<string>",
131-
"line": 1
132-
}
133-
],
134-
"record_id": 0,
135-
"pg_id": 0,
136-
"process_group": ("0", "default_pg"),
137-
"collective_seq_id": 1,
138-
"p2p_seq_id": 0,
139-
"op_id": 1,
140-
"profiling_name": "nccl:all_reduce",
141-
"time_created_ns": 1724779239936775119,
142-
"input_sizes": [[3, 4]],
143-
"input_dtypes": ["Float"],
144-
"output_sizes": [[3, 4]],
145-
"output_dtypes": ["Float"],
146-
"state": "completed",
147-
"time_discovered_started_ns": null,
148-
"time_discovered_completed_ns": 1724779239975811724,
149-
"retired": true,
150-
"timeout_ms": 600000,
151-
"is_p2p": false
152-
},
153-
...]
154-
122+
"frames": [
123+
{
124+
"name": "test_short_pickle",
125+
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
126+
"line": 3647
127+
},
128+
{
129+
"name": "spawn_main",
130+
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
131+
"line": 116
132+
},
133+
{
134+
"name": "<module>",
135+
"filename": "<string>",
136+
"line": 1
137+
}
138+
],
139+
"record_id": 0,
140+
"pg_id": 0,
141+
"process_group": ("0", "default_pg"),
142+
"collective_seq_id": 1,
143+
"p2p_seq_id": 0,
144+
"op_id": 1,
145+
"profiling_name": "nccl:all_reduce",
146+
"time_created_ns": 1724779239936775119,
147+
"input_sizes": [[3, 4]],
148+
"input_dtypes": ["Float"],
149+
"output_sizes": [[3, 4]],
150+
"output_dtypes": ["Float"],
151+
"state": "completed",
152+
"time_discovered_started_ns": null,
153+
"time_discovered_completed_ns": 1724779239975811724,
154+
"retired": true,
155+
"timeout_ms": 600000,
156+
"is_p2p": false
157+
},
158+
...
159+
]
155160
}
156161
157162
@@ -166,14 +171,15 @@ To run the convenience script, follow these steps:
166171
1. Copy all files from a rank into a single directory.
167172

168173
2. To run the script, use this command:
174+
169175
.. code:: python
170176
171177
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
172178
173179
174180
Conclusion
175181
----------
176-
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
182+
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
177183
We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
178184
Additionally, we explored how to analyze the data captured from the Flight Recorder using a
179185
convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__

0 commit comments

Comments
 (0)