You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
4
4
5
5
What you will learn
@@ -11,7 +11,6 @@ Prerequisites
11
11
-------------
12
12
- PyTorch version 2.5 or later.
13
13
14
-
15
14
Overview
16
15
--------
17
16
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
@@ -38,7 +37,7 @@ A job can get stuck for various reasons:
38
37
39
38
Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
40
39
information is used to help identify the root causes of issues when jobs become stuck.
41
-
Flight Recorder consists of two core parts:
40
+
Flight Recorder consists of two core parts:
42
41
43
42
- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
44
43
@@ -83,7 +82,6 @@ The API with the default arguments is shown below:
0 commit comments