Skip to content

Commit 2ec0fae

Browse files
c-p-i-osvekars
andcommitted
Apply suggestions from code review
Address code formatting changes Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
1 parent 95432e0 commit 2ec0fae

File tree

1 file changed

+56
-43
lines changed

1 file changed

+56
-43
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 56 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,86 +4,95 @@
44

55
What you will learn
66
-------------------
7-
This tutorial introduces a new tool for debugging stuck jobs during distributed training. The tutorial explains how this
8-
new tool can be enabled and how to use the collected data for analyzing stuck jobs.
7+
* Learn about a new tool for debugging stuck jobs during distributed training.
8+
* Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
99

10-
Overview, Background and Motivation
10+
Overview
1111
-----------------------------------
1212
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
1313
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
1414
that require significant computational resources.
15-
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
16-
subsequent training can be done faster. A trained usable model is the final desired outcome.
17-
One of the biggest impediment to completing training is the concept of a "stuck job".
15+
An engineer’s goal is to complete an AI training job as quickly as possible and make continuous improvements so that
16+
subsequent training can be done faster. A trained, usable model is the final desired outcome.
17+
One of the biggest impediment to completing training is the concept of a *stuck job*.
1818

19-
A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of
19+
A distributed AI training job is considered `stuck` when it stops making meaningful progress for an extended period of
2020
time.
2121

2222
A job can get stuck for various reasons:
23-
- Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
23+
- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to
2424
issues with the data pipeline or the data source.
25-
- Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
25+
- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or
2626
memory), the job might not be able to proceed.
27-
- Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
27+
- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different
2828
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
2929
stuck.
30-
- Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
30+
- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to
3131
get stuck.
32-
- Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
32+
- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need
3333
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
34-
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
34+
occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an
3535
indefinite wait for the job to progress.
3636

3737
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
38-
information can be used to help root cause the underlying issue when jobs get stuck.
39-
There are 2 core parts to flight recorder.
40-
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
38+
information can be used to help identify the root cause of issues when jobs get stuck.
39+
There are two core parts to Flight Recorder.
40+
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer.
4141
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
4242
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
4343

4444
Prerequisites
4545
-------------
46-
PyTorch version 2.5 and later.
46+
- PyTorch version 2.5 or later.
4747

4848
Enabling Flight Recorder
4949
------------------------
5050
There are two required environment variables to get the initial version of flight recorder working.
51-
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of
52-
entries that will be kept internally in a circular buffer. Recommended to set this at 2000.
53-
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set,
51+
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of
52+
entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53+
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled,
5454
there will be one file per rank output in the jobs running directory.
55-
Optional settings:
56-
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
57-
addr2line - see additional settings)
55+
56+
**Optional settings:**
57+
58+
- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
59+
``addr2line`` - for more information, see additional settings.
5860
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
59-
record the `duration` of each collective. May incur some CPU overhead. In the collected data, we end up with a
60-
`duration` field that indicates how long a collective took to execute.
61+
records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
62+
``duration`` filed indicates how long each collective took to execute..
6163

62-
Additional settings
64+
Additional Settings
6365
-------------------
64-
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
65-
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
66-
faster than the traditional `addr2line`. Use this setting in conjunction with `TORCH_NCCL_TRACE_CPP_STACK` to collect
67-
C++ traces in `flight recorder` data.
66+
67+
``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}:`` This setting determines the program used to retrieve C++ traces
68+
from a running program. The default setting is ``addr2line``. ``fast`` is a new experimental mode that is shown to be much
69+
faster than the traditional ``addr2line``. Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect
70+
C++ traces in the Flight Recorder` data.
6871

6972
Retrieving Flight Recorder Data via an API
7073
------------------------------------------
71-
Flight recorder data can also be retrieved via an API call.
72-
The API is shown below with the default arguments.
74+
75+
You can also retrieve Flight Recorder data with an API call.
76+
Below is the API with the default arguments:
77+
7378
.. code:: python
7479
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
7580
76-
To view the data, you can unpickle the data
81+
To view the data, you can unpickle it as shown below:
82+
7783
.. code:: python
7884
t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
7985
print(t)
8086
8187
Flight Recorder File Formats
8288
----------------------------
83-
Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
89+
90+
Flight Recorder files are dumped in ``pickle`` format. Files are written to local disks or mounted shared NFS
8491
folders.
85-
Contents of a flight recorder `unpickled` file is shown below.
86-
.. code-block: JSON
92+
93+
The contents of a Flight Recorder ``unpickled`` file are shown below:
94+
.. code-block: json
95+
8796
{
8897
"version": "2.3",
8998
"pg_config": {
@@ -144,19 +153,23 @@ Contents of a flight recorder `unpickled` file is shown below.
144153
145154
Analyzing Flight Recorder Dumps
146155
-------------------------------
147-
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
156+
157+
We have convenient scripts available in `pytorch/tools/flight_recorder` directory for analyzing captured
148158
data.
149159

150-
1. In order to run the convenience script, all files from a rank must first be copied over into a single directory.
160+
To run the convenience script, follow these steps:
161+
162+
1. Copy all files from a rank into a single directory.
151163

152-
2. To run it, one can use command line:
164+
2. To run the script, use this command:
153165
.. code:: python
154166
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
155167
156168
157169
Conclusion
158170
----------
159-
This tutorial introduces a new PyTorch diagnostic tool called `flight recorder`. The tutorial talks about how flight
160-
recorder can be enabled to collect diagnostic data from a machine.
161-
Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder` directory
162-
in the PyTorch repository.
171+
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
172+
We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
173+
Additionally, we explored how to analyze the data captured from the flight recorder using a
174+
convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__
175+
directory of the PyTorch repository.

0 commit comments

Comments
 (0)