@@ -35,7 +35,7 @@ Flight Recorder captures diagnostics information as collectives run. The diagnos
35
35
cause the underlying issue. There are 2 core parts to flight recorder.
36
36
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
37
37
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38
- - An analyzer script is available in the `pytorch/tools/flight_recorder ` directory (details below). T
38
+ - An analyzer script is available in the `pytorch/tools/flight_recorder ` directory (details below).
39
39
40
40
Enabling Flight Recorder
41
41
------------------------
@@ -51,72 +51,80 @@ Optional settings:
51
51
52
52
Flight Recorder File Formats
53
53
----------------------------
54
- Flight recorder files are dumped out in `pickle ` format.
55
- ```
56
- {
57
- "version": "2.3",
58
- "pg_config": {
59
- "0": {
60
- "name": "0",
61
- "desc": "default_pg",
62
- "ranks": "[0, 1]"
63
- }
64
- },
65
- "pg_status": {
66
- "0": {
67
- "last_enqueued_collective": 2,
68
- "last_started_collective": -1,
69
- "last_completed_collective": 2
70
- }
71
- },
72
- "entries": [
73
- {
74
- "frames": [
75
- {
76
- "name": "test_short_pickle",
77
- "filename": "pytorch/test/distributed/test_c10d_nccl.py",
78
- "line": 3647
79
- },
80
- ...
81
- {
82
- "name": "spawn_main",
83
- "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
84
- "line": 116
85
- },
86
- {
87
- "name": "<module>",
88
- "filename": "<string>",
89
- "line": 1
90
- }
91
- ],
92
- "record_id": 0,
93
- "pg_id": 0,
94
- "process_group": ("0", "default_pg"),
95
- "collective_seq_id": 1,
96
- "p2p_seq_id": 0,
97
- "op_id": 1,
98
- "profiling_name": "nccl:all_reduce",
99
- "time_created_ns": 1724779239936775119,
100
- "input_sizes": [[3, 4]],
101
- "input_dtypes": ["Float"],
102
- "output_sizes": [[3, 4]],
103
- "output_dtypes": ["Float"],
104
- "state": "completed",
105
- "time_discovered_started_ns": null,
106
- "time_discovered_completed_ns": 1724779239975811724,
107
- "retired": true,
108
- "timeout_ms": 600000,
109
- "is_p2p": false
54
+ Flight recorder files are dumped out in `pickle ` format. Files are written out to local disks or mounted shared NFS
55
+ folders.
56
+ Contents of a flight recorder `unpickled ` file is shown below.
57
+ .. code-block: JSON
58
+ {
59
+ "version": "2.3",
60
+ "pg_config": {
61
+ "0": {
62
+ "name": "0",
63
+ "desc": "default_pg",
64
+ "ranks": "[0, 1]"
65
+ }
110
66
},
111
- ...]
112
- }
113
- ```
67
+ "pg_status": {
68
+ "0": {
69
+ "last_enqueued_collective": 2,
70
+ "last_started_collective": -1,
71
+ "last_completed_collective": 2
72
+ }
73
+ },
74
+ "entries": [
75
+ {
76
+ "frames": [
77
+ {
78
+ "name": "test_short_pickle",
79
+ "filename": "pytorch/test/distributed/test_c10d_nccl.py",
80
+ "line": 3647
81
+ },
82
+ ...
83
+ {
84
+ "name": "spawn_main",
85
+ "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
86
+ "line": 116
87
+ },
88
+ {
89
+ "name": "<module>",
90
+ "filename": "<string>",
91
+ "line": 1
92
+ }
93
+ ],
94
+ "record_id": 0,
95
+ "pg_id": 0,
96
+ "process_group": ("0", "default_pg"),
97
+ "collective_seq_id": 1,
98
+ "p2p_seq_id": 0,
99
+ "op_id": 1,
100
+ "profiling_name": "nccl:all_reduce",
101
+ "time_created_ns": 1724779239936775119,
102
+ "input_sizes": [[3, 4]],
103
+ "input_dtypes": ["Float"],
104
+ "output_sizes": [[3, 4]],
105
+ "output_dtypes": ["Float"],
106
+ "state": "completed",
107
+ "time_discovered_started_ns": null,
108
+ "time_discovered_completed_ns": 1724779239975811724,
109
+ "retired": true,
110
+ "timeout_ms": 600000,
111
+ "is_p2p": false
112
+ },
113
+ ...]
114
+ }
115
+
114
116
Analyzing Flight Recorder Dumps
115
117
-------------------------------
116
118
We have convenient scripts available in `pytorch/tools/flight_recorder ` directory that can be used to analyze captured
117
119
data.
118
120
119
- To run it, one can use command line:
120
- ```
121
- python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
122
- ```
121
+ To run it, one can use command line:
122
+ .. code :: python
123
+ python fr_trace.py - d < dump dir containing trace files> [- o < output file > ]
124
+
125
+
126
+ Additional settings
127
+ -------------------
128
+ TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
129
+ from a running program. The default setting is `addr2line `. `fast ` is a new experimental mode that is shown to be much
130
+ faster than the traditional `addr2line `.
0 commit comments