You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
3
+
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
4
4
5
5
What you will learn
6
6
-------------------
7
7
* Learn about a new tool for debugging stuck jobs during distributed training.
8
8
* Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
9
9
10
+
Prerequisites
11
+
-------------
12
+
- PyTorch version 2.5 or later.
13
+
14
+
10
15
Overview
11
-
-----------------------------------
16
+
--------
12
17
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
13
18
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
14
19
that require significant computational resources.
@@ -41,25 +46,19 @@ There are two core parts to Flight Recorder.
41
46
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
42
47
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
43
48
44
-
Prerequisites
45
-
-------------
46
-
- PyTorch version 2.5 or later.
47
-
48
49
Enabling Flight Recorder
49
50
------------------------
50
-
There are two required environment variables to get the initial version of flight recorder working.
51
-
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of
52
-
entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53
-
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled,
54
-
there will be one file per rank output in the jobs running directory.
55
-
51
+
There are two required environment variables to get the initial version of Flight Recorder working.
52
+
- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
53
+
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, there will be one file per rank output in the jobs running directory.
54
+
56
55
**Optional settings:**
57
56
58
57
- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
59
58
``addr2line`` - for more information, see additional settings.
60
-
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
59
+
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
61
60
records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
62
-
``duration`` filed indicates how long each collective took to execute..
61
+
``duration`` field indicates how long each collective took to execute.
63
62
64
63
Additional Settings
65
64
-------------------
@@ -76,11 +75,14 @@ You can also retrieve Flight Recorder data with an API call.
0 commit comments