Skip to content

Commit c3a0475

Browse files
fduwjjc-p-i-o
authored andcommitted
Update flight_recorder_tutorial.rst
Add dump example and command line.
1 parent 107a9bf commit c3a0475

File tree

1 file changed

+69
-8
lines changed

1 file changed

+69
-8
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 69 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
(prototype) Flight Recorder for Debugging
22
=========================================
3-
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_
3+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
44

55
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
66

@@ -9,7 +9,7 @@ Background and Motivation
99
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
1010
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
1111
that require significant computational resources.
12-
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
12+
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
1313
subsequent training can be done faster. A trained usable model is the final desired outcome.
1414
One of the biggest impediment to completing training is the concept of a "stuck job".
1515

@@ -37,9 +37,9 @@ cause the underlying issue. There are 2 core parts to flight recorder.
3737
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
3838
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
3939

40-
Enabling Flight Recorder
41-
------------------------
42-
There are 2 required environment variables to get the initial version of flight recorder working.
40+
Enabling Flight Recorder
41+
------------------------
42+
There are two required environment variables to get the initial version of flight recorder working.
4343
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
4444
to 2000)
4545
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
@@ -52,10 +52,71 @@ Optional settings:
5252
Flight Recorder File Formats
5353
----------------------------
5454
Flight recorder files are dumped out in `pickle` format.
55-
56-
57-
55+
```
56+
{
57+
"version": "2.3",
58+
"pg_config": {
59+
"0": {
60+
"name": "0",
61+
"desc": "default_pg",
62+
"ranks": "[0, 1]"
63+
}
64+
},
65+
"pg_status": {
66+
"0": {
67+
"last_enqueued_collective": 2,
68+
"last_started_collective": -1,
69+
"last_completed_collective": 2
70+
}
71+
},
72+
"entries": [
73+
{
74+
"frames": [
75+
{
76+
"name": "test_short_pickle",
77+
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
78+
"line": 3647
79+
},
80+
...
81+
{
82+
"name": "spawn_main",
83+
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
84+
"line": 116
85+
},
86+
{
87+
"name": "<module>",
88+
"filename": "<string>",
89+
"line": 1
90+
}
91+
],
92+
"record_id": 0,
93+
"pg_id": 0,
94+
"process_group": ("0", "default_pg"),
95+
"collective_seq_id": 1,
96+
"p2p_seq_id": 0,
97+
"op_id": 1,
98+
"profiling_name": "nccl:all_reduce",
99+
"time_created_ns": 1724779239936775119,
100+
"input_sizes": [[3, 4]],
101+
"input_dtypes": ["Float"],
102+
"output_sizes": [[3, 4]],
103+
"output_dtypes": ["Float"],
104+
"state": "completed",
105+
"time_discovered_started_ns": null,
106+
"time_discovered_completed_ns": 1724779239975811724,
107+
"retired": true,
108+
"timeout_ms": 600000,
109+
"is_p2p": false
110+
},
111+
...]
112+
}
113+
```
58114
Analyzing Flight Recorder Dumps
59115
-------------------------------
60116
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
61117
data.
118+
119+
To run it, one can use command line:
120+
```
121+
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
122+
```

0 commit comments

Comments
 (0)