|
| 1 | +(prototype) Flight Recorder for Debugging |
| 2 | +========================================= |
| 3 | +**Author**: `Chirag Pandya <https://github.com/c-p-i-o>` |
| 4 | + |
| 5 | +This tutorial introduces a new tool for debugging stuck jobs during distributed training. |
| 6 | + |
| 7 | +1. Background and Motivation |
| 8 | +---------------------------- |
| 9 | +An AI distributed training job refers to the process of training a machine learning model using multiple devices, such |
| 10 | +as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models |
| 11 | +that require significant computational resources. |
| 12 | +An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that |
| 13 | +subsequent training can be done faster. A trained usable model is the final desired outcome. |
| 14 | +One of the biggest impediment to completing training is the concept of a "stuck job". |
| 15 | + |
| 16 | +A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of |
| 17 | +time. |
| 18 | + |
| 19 | +A job can get stuck for various reasons: |
| 20 | +Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to |
| 21 | +issues with the data pipeline or the data source. |
| 22 | +Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or |
| 23 | +memory), the job might not be able to proceed. |
| 24 | +Network Issues: In a distributed training setup, different parts of the model or data may be processed on different |
| 25 | +devices. If there are network issues, communication between these devices may be disrupted, causing the job to get |
| 26 | +stuck. |
| 27 | +Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to |
| 28 | +get stuck. |
| 29 | +Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need |
| 30 | +to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can |
| 31 | +occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an |
| 32 | +indefinite wait for the job to progress. |
| 33 | + |
| 34 | +Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root |
| 35 | +cause the underlying issue. There are 2 core parts to flight recorder. |
| 36 | +a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer. |
| 37 | +Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file. |
| 38 | +b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of |
| 39 | +reading flight recorder records and performing an automated analysis on the collected data. |
| 40 | + |
| 41 | +2. Enabling Flight Recorder |
| 42 | +There are 2 required environment variables to get the initial version of flight recorder working. |
| 43 | +TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000) |
| 44 | +TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. |
| 45 | +Optional settings: |
| 46 | +TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line - |
| 47 | + see advanced settings) |
| 48 | +TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record |
| 49 | +‘duration’ of each collective. May incur some CPU overhead. |
| 50 | + |
| 51 | +3. Flight Recorder File Formats |
0 commit comments