Skip to content

Commit bd1f808

Browse files
committed
Add additional settings section
Summary: Update tutorial Test Plan: Reviewers: Subscribers: Tasks: Tags:
1 parent c3a0475 commit bd1f808

File tree

1 file changed

+72
-64
lines changed

1 file changed

+72
-64
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 72 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Flight Recorder captures diagnostics information as collectives run. The diagnos
3535
cause the underlying issue. There are 2 core parts to flight recorder.
3636
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
3737
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38-
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
38+
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
3939

4040
Enabling Flight Recorder
4141
------------------------
@@ -51,72 +51,80 @@ Optional settings:
5151

5252
Flight Recorder File Formats
5353
----------------------------
54-
Flight recorder files are dumped out in `pickle` format.
55-
```
56-
{
57-
"version": "2.3",
58-
"pg_config": {
59-
"0": {
60-
"name": "0",
61-
"desc": "default_pg",
62-
"ranks": "[0, 1]"
63-
}
64-
},
65-
"pg_status": {
66-
"0": {
67-
"last_enqueued_collective": 2,
68-
"last_started_collective": -1,
69-
"last_completed_collective": 2
70-
}
71-
},
72-
"entries": [
73-
{
74-
"frames": [
75-
{
76-
"name": "test_short_pickle",
77-
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
78-
"line": 3647
79-
},
80-
...
81-
{
82-
"name": "spawn_main",
83-
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
84-
"line": 116
85-
},
86-
{
87-
"name": "<module>",
88-
"filename": "<string>",
89-
"line": 1
90-
}
91-
],
92-
"record_id": 0,
93-
"pg_id": 0,
94-
"process_group": ("0", "default_pg"),
95-
"collective_seq_id": 1,
96-
"p2p_seq_id": 0,
97-
"op_id": 1,
98-
"profiling_name": "nccl:all_reduce",
99-
"time_created_ns": 1724779239936775119,
100-
"input_sizes": [[3, 4]],
101-
"input_dtypes": ["Float"],
102-
"output_sizes": [[3, 4]],
103-
"output_dtypes": ["Float"],
104-
"state": "completed",
105-
"time_discovered_started_ns": null,
106-
"time_discovered_completed_ns": 1724779239975811724,
107-
"retired": true,
108-
"timeout_ms": 600000,
109-
"is_p2p": false
54+
Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
55+
folders.
56+
Contents of a flight recorder `unpickled` file is shown below.
57+
.. code-block: JSON
58+
{
59+
"version": "2.3",
60+
"pg_config": {
61+
"0": {
62+
"name": "0",
63+
"desc": "default_pg",
64+
"ranks": "[0, 1]"
65+
}
11066
},
111-
...]
112-
}
113-
```
67+
"pg_status": {
68+
"0": {
69+
"last_enqueued_collective": 2,
70+
"last_started_collective": -1,
71+
"last_completed_collective": 2
72+
}
73+
},
74+
"entries": [
75+
{
76+
"frames": [
77+
{
78+
"name": "test_short_pickle",
79+
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
80+
"line": 3647
81+
},
82+
...
83+
{
84+
"name": "spawn_main",
85+
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
86+
"line": 116
87+
},
88+
{
89+
"name": "<module>",
90+
"filename": "<string>",
91+
"line": 1
92+
}
93+
],
94+
"record_id": 0,
95+
"pg_id": 0,
96+
"process_group": ("0", "default_pg"),
97+
"collective_seq_id": 1,
98+
"p2p_seq_id": 0,
99+
"op_id": 1,
100+
"profiling_name": "nccl:all_reduce",
101+
"time_created_ns": 1724779239936775119,
102+
"input_sizes": [[3, 4]],
103+
"input_dtypes": ["Float"],
104+
"output_sizes": [[3, 4]],
105+
"output_dtypes": ["Float"],
106+
"state": "completed",
107+
"time_discovered_started_ns": null,
108+
"time_discovered_completed_ns": 1724779239975811724,
109+
"retired": true,
110+
"timeout_ms": 600000,
111+
"is_p2p": false
112+
},
113+
...]
114+
}
115+
114116
Analyzing Flight Recorder Dumps
115117
-------------------------------
116118
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
117119
data.
118120

119-
To run it, one can use command line:
120-
```
121-
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
122-
```
121+
To run it, one can use command line:
122+
.. code:: python
123+
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
124+
125+
126+
Additional settings
127+
-------------------
128+
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
129+
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
130+
faster than the traditional `addr2line`.

0 commit comments

Comments
 (0)