Skip to content

Commit 107a9bf

Browse files
committed
More cleanup.
1 parent d00655d commit 107a9bf

File tree

1 file changed

+29
-19
lines changed

1 file changed

+29
-19
lines changed
Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
(prototype) Flight Recorder for Debugging
22
=========================================
3-
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`
3+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_
44

55
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
66

7-
1. Background and Motivation
8-
----------------------------
7+
Background and Motivation
8+
--------------------------
99
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
1010
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
1111
that require significant computational resources.
@@ -17,35 +17,45 @@ A distributed AI training job is considered "stuck" when it stops making meaning
1717
time.
1818

1919
A job can get stuck for various reasons:
20-
Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
20+
- Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
2121
issues with the data pipeline or the data source.
22-
Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
22+
- Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
2323
memory), the job might not be able to proceed.
24-
Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
24+
- Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
2525
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
2626
stuck.
27-
Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
27+
- Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
2828
get stuck.
29-
Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
29+
- Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
3030
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
3131
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
3232
indefinite wait for the job to progress.
3333

3434
Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
3535
cause the underlying issue. There are 2 core parts to flight recorder.
36-
a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
36+
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
3737
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38-
b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of
39-
reading flight recorder records and performing an automated analysis on the collected data.
38+
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
4039

41-
2. Enabling Flight Recorder
40+
Enabling Flight Recorder
41+
------------------------
4242
There are 2 required environment variables to get the initial version of flight recorder working.
43-
TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
44-
TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
43+
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
44+
to 2000)
45+
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
4546
Optional settings:
46-
TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
47-
see advanced settings)
48-
TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
49-
‘duration’ of each collective. May incur some CPU overhead.
47+
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
48+
addr2line - see additinal settings)
49+
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
50+
record the ‘duration’ of each collective. May incur some CPU overhead.
51+
52+
Flight Recorder File Formats
53+
----------------------------
54+
Flight recorder files are dumped out in `pickle` format.
55+
56+
5057

51-
3. Flight Recorder File Formats
58+
Analyzing Flight Recorder Dumps
59+
-------------------------------
60+
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
61+
data.

0 commit comments

Comments
 (0)