Skip to content

Commit d00655d

Browse files
committed
[Doc] Flight recorder tutorial
Summary: Add a tutorial file for flight recorder. Test Plan: N/A Reviewers: Subscribers: Tasks: Tags:
1 parent f8717d3 commit d00655d

File tree

1 file changed

+51
-0
lines changed

1 file changed

+51
-0
lines changed
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
(prototype) Flight Recorder for Debugging
2+
=========================================
3+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`
4+
5+
This tutorial introduces a new tool for debugging stuck jobs during distributed training.
6+
7+
1. Background and Motivation
8+
----------------------------
9+
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
10+
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
11+
that require significant computational resources.
12+
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
13+
subsequent training can be done faster. A trained usable model is the final desired outcome.
14+
One of the biggest impediment to completing training is the concept of a "stuck job".
15+
16+
A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of
17+
time.
18+
19+
A job can get stuck for various reasons:
20+
Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
21+
issues with the data pipeline or the data source.
22+
Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
23+
memory), the job might not be able to proceed.
24+
Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
25+
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
26+
stuck.
27+
Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
28+
get stuck.
29+
Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
30+
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
31+
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
32+
indefinite wait for the job to progress.
33+
34+
Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
35+
cause the underlying issue. There are 2 core parts to flight recorder.
36+
a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
37+
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
38+
b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of
39+
reading flight recorder records and performing an automated analysis on the collected data.
40+
41+
2. Enabling Flight Recorder
42+
There are 2 required environment variables to get the initial version of flight recorder working.
43+
TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
44+
TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
45+
Optional settings:
46+
TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
47+
see advanced settings)
48+
TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
49+
‘duration’ of each collective. May incur some CPU overhead.
50+
51+
3. Flight Recorder File Formats

0 commit comments

Comments
 (0)