Skip to content

Commit efc56f1

Browse files
committed
minor formatting changes
Summary: minor formatting changes.
1 parent 5d4b704 commit efc56f1

File tree

1 file changed

+3
-7
lines changed

1 file changed

+3
-7
lines changed

prototype_source/flight_recorder_tutorial.rst

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
(prototype) Flight Recorder for Debugging
2-
=========================================
1+
(prototype) Flight Recorder for Debugging Stuck Jobs
2+
====================================================
33
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
44

55
What you will learn
@@ -11,7 +11,6 @@ Prerequisites
1111
-------------
1212
- PyTorch version 2.5 or later.
1313

14-
1514
Overview
1615
--------
1716
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
@@ -38,7 +37,7 @@ A job can get stuck for various reasons:
3837

3938
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
4039
information is used to help identify the root causes of issues when jobs become stuck.
41-
Flight Recorder consists of two core parts:
40+
Flight Recorder consists of two core parts:
4241

4342
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
4443

@@ -83,7 +82,6 @@ The API with the default arguments is shown below:
8382
8483
torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
8584
86-
8785
To view the data, you can ``unpickle`` it as shown below:
8886

8987
.. code:: python
@@ -159,7 +157,6 @@ The contents of a Flight Recorder ``unpickled`` file are shown below:
159157
]
160158
}
161159
162-
163160
Analyzing Flight Recorder Dumps
164161
-------------------------------
165162

@@ -176,7 +173,6 @@ To run the convenience script, follow these steps:
176173
177174
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
178175
179-
180176
Conclusion
181177
----------
182178
In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.

0 commit comments

Comments
 (0)