1
1
(prototype) Flight Recorder for Debugging
2
2
=========================================
3
- **Author **: `Chirag Pandya <https://github.com/c-p-i-o >`_
3
+ **Author **: `Chirag Pandya <https://github.com/c-p-i-o> `, ` Junjie Wang <https://github.com/fduwjj> `
4
4
5
5
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
6
6
@@ -9,7 +9,7 @@ Background and Motivation
9
9
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
10
10
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
11
11
that require significant computational resources.
12
- An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
12
+ An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
13
13
subsequent training can be done faster. A trained usable model is the final desired outcome.
14
14
One of the biggest impediment to completing training is the concept of a "stuck job".
15
15
@@ -37,9 +37,9 @@ cause the underlying issue. There are 2 core parts to flight recorder.
37
37
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38
38
- An analyzer script is available in the `pytorch/tools/flight_recorder ` directory (details below). T
39
39
40
- Enabling Flight Recorder
41
- ------------------------
42
- There are 2 required environment variables to get the initial version of flight recorder working.
40
+ Enabling Flight Recorder
41
+ ------------------------
42
+ There are two required environment variables to get the initial version of flight recorder working.
43
43
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
44
44
to 2000)
45
45
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
@@ -52,10 +52,71 @@ Optional settings:
52
52
Flight Recorder File Formats
53
53
----------------------------
54
54
Flight recorder files are dumped out in `pickle ` format.
55
-
56
-
57
-
55
+ ```
56
+ {
57
+ "version": "2.3",
58
+ "pg_config": {
59
+ "0": {
60
+ "name": "0",
61
+ "desc": "default_pg",
62
+ "ranks": "[0, 1]"
63
+ }
64
+ },
65
+ "pg_status": {
66
+ "0": {
67
+ "last_enqueued_collective": 2,
68
+ "last_started_collective": -1,
69
+ "last_completed_collective": 2
70
+ }
71
+ },
72
+ "entries": [
73
+ {
74
+ "frames": [
75
+ {
76
+ "name": "test_short_pickle",
77
+ "filename": "pytorch/test/distributed/test_c10d_nccl.py",
78
+ "line": 3647
79
+ },
80
+ ...
81
+ {
82
+ "name": "spawn_main",
83
+ "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
84
+ "line": 116
85
+ },
86
+ {
87
+ "name": "<module>",
88
+ "filename": "<string>",
89
+ "line": 1
90
+ }
91
+ ],
92
+ "record_id": 0,
93
+ "pg_id": 0,
94
+ "process_group": ("0", "default_pg"),
95
+ "collective_seq_id": 1,
96
+ "p2p_seq_id": 0,
97
+ "op_id": 1,
98
+ "profiling_name": "nccl:all_reduce",
99
+ "time_created_ns": 1724779239936775119,
100
+ "input_sizes": [[3, 4]],
101
+ "input_dtypes": ["Float"],
102
+ "output_sizes": [[3, 4]],
103
+ "output_dtypes": ["Float"],
104
+ "state": "completed",
105
+ "time_discovered_started_ns": null,
106
+ "time_discovered_completed_ns": 1724779239975811724,
107
+ "retired": true,
108
+ "timeout_ms": 600000,
109
+ "is_p2p": false
110
+ },
111
+ ...]
112
+ }
113
+ ```
58
114
Analyzing Flight Recorder Dumps
59
115
-------------------------------
60
116
We have convenient scripts available in `pytorch/tools/flight_recorder ` directory that can be used to analyze captured
61
117
data.
118
+
119
+ To run it, one can use command line:
120
+ ```
121
+ python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
122
+ ```
0 commit comments