1
1
(prototype) Flight Recorder for Debugging
2
2
=========================================
3
- **Author **: `Chirag Pandya <https://github.com/c-p-i-o> `
3
+ **Author **: `Chirag Pandya <https://github.com/c-p-i-o >`_
4
4
5
5
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
6
6
7
- 1. Background and Motivation
8
- ----------------------------
7
+ Background and Motivation
8
+ --------------------------
9
9
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
10
10
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
11
11
that require significant computational resources.
@@ -17,35 +17,45 @@ A distributed AI training job is considered "stuck" when it stops making meaning
17
17
time.
18
18
19
19
A job can get stuck for various reasons:
20
- Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
20
+ - Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
21
21
issues with the data pipeline or the data source.
22
- Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
22
+ - Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
23
23
memory), the job might not be able to proceed.
24
- Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
24
+ - Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
25
25
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
26
26
stuck.
27
- Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
27
+ - Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
28
28
get stuck.
29
- Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
29
+ - Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
30
30
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
31
31
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
32
32
indefinite wait for the job to progress.
33
33
34
34
Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
35
35
cause the underlying issue. There are 2 core parts to flight recorder.
36
- a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
36
+ - The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
37
37
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38
- b) An analyzer script is available in the `pytorch/tools/flight_recorder ` directory. The analyzer script is capable of
39
- reading flight recorder records and performing an automated analysis on the collected data.
38
+ - An analyzer script is available in the `pytorch/tools/flight_recorder ` directory (details below). T
40
39
41
- 2. Enabling Flight Recorder
40
+ Enabling Flight Recorder
41
+ ------------------------
42
42
There are 2 required environment variables to get the initial version of flight recorder working.
43
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
44
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
43
+ - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
44
+ to 2000)
45
+ - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
45
46
Optional settings:
46
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
47
- see advanced settings)
48
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
49
- ‘duration’ of each collective. May incur some CPU overhead.
47
+ - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
48
+ addr2line - see additinal settings)
49
+ - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
50
+ record the ‘duration’ of each collective. May incur some CPU overhead.
51
+
52
+ Flight Recorder File Formats
53
+ ----------------------------
54
+ Flight recorder files are dumped out in `pickle ` format.
55
+
56
+
50
57
51
- 3. Flight Recorder File Formats
58
+ Analyzing Flight Recorder Dumps
59
+ -------------------------------
60
+ We have convenient scripts available in `pytorch/tools/flight_recorder ` directory that can be used to analyze captured
61
+ data.
0 commit comments