|
4 | 4 |
|
5 | 5 | What you will learn
|
6 | 6 | -------------------
|
7 |
| -This tutorial introduces a new tool for debugging stuck jobs during distributed training. The tutorial explains how this |
8 |
| -new tool can be enabled and how to use the collected data for analyzing stuck jobs. |
| 7 | +* Learn about a new tool for debugging stuck jobs during distributed training. |
| 8 | +* Learn how you can enable the tool and use the collected data for analyzing stuck jobs. |
9 | 9 |
|
10 |
| -Overview, Background and Motivation |
| 10 | +Overview |
11 | 11 | -----------------------------------
|
12 | 12 | An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
|
13 | 13 | as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
|
14 | 14 | that require significant computational resources.
|
15 |
| -An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that |
16 |
| -subsequent training can be done faster. A trained usable model is the final desired outcome. |
17 |
| -One of the biggest impediment to completing training is the concept of a "stuck job". |
| 15 | +An engineer’s goal is to complete an AI training job as quickly as possible and make continuous improvements so that |
| 16 | +subsequent training can be done faster. A trained, usable model is the final desired outcome. |
| 17 | +One of the biggest impediment to completing training is the concept of a *stuck job*. |
18 | 18 |
|
19 |
| -A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of |
| 19 | +A distributed AI training job is considered `stuck` when it stops making meaningful progress for an extended period of |
20 | 20 | time.
|
21 | 21 |
|
22 | 22 | A job can get stuck for various reasons:
|
23 |
| - - Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to |
| 23 | +- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to |
24 | 24 | issues with the data pipeline or the data source.
|
25 |
| - - Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or |
| 25 | +- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or |
26 | 26 | memory), the job might not be able to proceed.
|
27 |
| - - Network Issues: In a distributed training setup, different parts of the model or data may be processed on different |
| 27 | +- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different |
28 | 28 | devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
|
29 | 29 | stuck.
|
30 |
| - - Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to |
| 30 | +- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to |
31 | 31 | get stuck.
|
32 |
| - - Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need |
| 32 | +- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need |
33 | 33 | to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
|
34 |
| -occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an |
| 34 | +occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an |
35 | 35 | indefinite wait for the job to progress.
|
36 | 36 |
|
37 | 37 | Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
|
38 |
| -information can be used to help root cause the underlying issue when jobs get stuck. |
39 |
| -There are 2 core parts to flight recorder. |
40 |
| -- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer. |
| 38 | +information can be used to help identify the root cause of issues when jobs get stuck. |
| 39 | +There are two core parts to Flight Recorder. |
| 40 | +- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. |
41 | 41 | Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
|
42 | 42 | - An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
|
43 | 43 |
|
44 | 44 | Prerequisites
|
45 | 45 | -------------
|
46 |
| -PyTorch version 2.5 and later. |
| 46 | +- PyTorch version 2.5 or later. |
47 | 47 |
|
48 | 48 | Enabling Flight Recorder
|
49 | 49 | ------------------------
|
50 | 50 | There are two required environment variables to get the initial version of flight recorder working.
|
51 |
| - - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of |
52 |
| - entries that will be kept internally in a circular buffer. Recommended to set this at 2000. |
53 |
| - - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set, |
| 51 | +- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of |
| 52 | + entries that will be kept internally in a circular buffer. We recommended to set this value at 2000. |
| 53 | +- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, |
54 | 54 | there will be one file per rank output in the jobs running directory.
|
55 |
| -Optional settings: |
56 |
| - - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow |
57 |
| - addr2line - see additional settings) |
| 55 | + |
| 56 | +**Optional settings:** |
| 57 | + |
| 58 | +- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow |
| 59 | + ``addr2line`` - for more information, see additional settings. |
58 | 60 | - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
|
59 |
| - record the `duration` of each collective. May incur some CPU overhead. In the collected data, we end up with a |
60 |
| - `duration` field that indicates how long a collective took to execute. |
| 61 | + records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the |
| 62 | + ``duration`` filed indicates how long each collective took to execute.. |
61 | 63 |
|
62 |
| -Additional settings |
| 64 | +Additional Settings |
63 | 65 | -------------------
|
64 |
| -TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces |
65 |
| -from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much |
66 |
| -faster than the traditional `addr2line`. Use this setting in conjunction with `TORCH_NCCL_TRACE_CPP_STACK` to collect |
67 |
| -C++ traces in `flight recorder` data. |
| 66 | + |
| 67 | +``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}:`` This setting determines the program used to retrieve C++ traces |
| 68 | +from a running program. The default setting is ``addr2line``. ``fast`` is a new experimental mode that is shown to be much |
| 69 | +faster than the traditional ``addr2line``. Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect |
| 70 | +C++ traces in the Flight Recorder` data. |
68 | 71 |
|
69 | 72 | Retrieving Flight Recorder Data via an API
|
70 | 73 | ------------------------------------------
|
71 |
| -Flight recorder data can also be retrieved via an API call. |
72 |
| -The API is shown below with the default arguments. |
| 74 | + |
| 75 | +You can also retrieve Flight Recorder data with an API call. |
| 76 | +Below is the API with the default arguments: |
| 77 | + |
73 | 78 | .. code:: python
|
74 | 79 | torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
|
75 | 80 |
|
76 |
| -To view the data, you can unpickle the data |
| 81 | +To view the data, you can unpickle it as shown below: |
| 82 | + |
77 | 83 | .. code:: python
|
78 | 84 | t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
|
79 | 85 | print(t)
|
80 | 86 |
|
81 | 87 | Flight Recorder File Formats
|
82 | 88 | ----------------------------
|
83 |
| -Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS |
| 89 | + |
| 90 | +Flight Recorder files are dumped in ``pickle`` format. Files are written to local disks or mounted shared NFS |
84 | 91 | folders.
|
85 |
| -Contents of a flight recorder `unpickled` file is shown below. |
86 |
| -.. code-block: JSON |
| 92 | + |
| 93 | +The contents of a Flight Recorder ``unpickled`` file are shown below: |
| 94 | +.. code-block: json |
| 95 | +
|
87 | 96 | {
|
88 | 97 | "version": "2.3",
|
89 | 98 | "pg_config": {
|
@@ -144,19 +153,23 @@ Contents of a flight recorder `unpickled` file is shown below.
|
144 | 153 |
|
145 | 154 | Analyzing Flight Recorder Dumps
|
146 | 155 | -------------------------------
|
147 |
| -We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured |
| 156 | + |
| 157 | +We have convenient scripts available in `pytorch/tools/flight_recorder` directory for analyzing captured |
148 | 158 | data.
|
149 | 159 |
|
150 |
| -1. In order to run the convenience script, all files from a rank must first be copied over into a single directory. |
| 160 | +To run the convenience script, follow these steps: |
| 161 | + |
| 162 | +1. Copy all files from a rank into a single directory. |
151 | 163 |
|
152 |
| -2. To run it, one can use command line: |
| 164 | +2. To run the script, use this command: |
153 | 165 | .. code:: python
|
154 | 166 | python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
|
155 | 167 |
|
156 | 168 |
|
157 | 169 | Conclusion
|
158 | 170 | ----------
|
159 |
| -This tutorial introduces a new PyTorch diagnostic tool called `flight recorder`. The tutorial talks about how flight |
160 |
| -recorder can be enabled to collect diagnostic data from a machine. |
161 |
| -Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder` directory |
162 |
| -in the PyTorch repository. |
| 171 | +In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder. |
| 172 | +We have discussed how to enable Flight Recorder to collect diagnostic data from a machine. |
| 173 | +Additionally, we explored how to analyze the data captured from the flight recorder using a |
| 174 | +convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__ |
| 175 | +directory of the PyTorch repository. |
0 commit comments