Skip to content

Add a tutorial for PyTorch Flight recorder #3024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from
Closed
Changes from 4 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
b2c013e
[Doc] Flight recorder tutorial
c-p-i-o Aug 20, 2024
2b69bcd
More cleanup.
c-p-i-o Aug 21, 2024
7df80d1
Update flight_recorder_tutorial.rst
fduwjj Aug 27, 2024
d4fd0d3
Add additional settings section
c-p-i-o Sep 3, 2024
23eef97
address code review comments
c-p-i-o Sep 3, 2024
f4cf1ff
add missing section
c-p-i-o Sep 3, 2024
f7db1b9
fix typo
c-p-i-o Sep 3, 2024
d86d623
Fixes
c-p-i-o Sep 3, 2024
c071784
Add flight recorder to prototype index
c-p-i-o Sep 3, 2024
aa65c84
Update prototype_index
c-p-i-o Sep 3, 2024
fb13629
Apply suggestions from code review
c-p-i-o Sep 5, 2024
2358201
More HTML and formatting fixes
c-p-i-o Sep 6, 2024
de5654a
More HTML formatting changes
c-p-i-o Sep 6, 2024
3b0c4cc
[dtensor][debug] CommDebugMode recipe (#3001)
XilunWu Aug 19, 2024
f952921
fix: rm `use_cuda` param (#3002)
shaoyuyoung Aug 19, 2024
fa03879
Add programmable Google Search to pytorch tutorials site (#2820)
svekars Aug 22, 2024
ab36383
Tutorial for AOTI Python runtime (#2997)
agunapal Aug 23, 2024
fc27f08
Create tutorial_submission_policy.md (#2995)
svekars Aug 24, 2024
9d97d8f
Removed upper-case letter/made 'download' the link text instead of 'h…
tstatler Aug 27, 2024
acdc91b
Add weights_only=True to torch.load (#3012)
svekars Aug 27, 2024
3dacf89
Fix typos in dynamic_quantization_bert_tutorial.rst (#3019)
hadh93 Aug 28, 2024
fc5a612
Improve custom ops tutorials (#3020)
zou3519 Aug 29, 2024
c1e792a
Removed outdated steps in README about running about setup.py (#3014)
tstatler Aug 29, 2024
202693f
Fix hovering over the GCS search button (#3005)
svekars Aug 29, 2024
5465f9b
Added warnings to select Pytorch mobile tutorials directing users to …
tstatler Aug 30, 2024
f45ddc2
Patched docs for torch_compile_tutorial (#2936)
ignaciobartol Aug 30, 2024
c7a99c5
Upgrade pygame version to 2.6.0 (#3025)
svekars Sep 4, 2024
deb89ba
Update intro_onnx.py (#3033)
labintsev Sep 5, 2024
a5d85ed
Fixed Rst formatting, minor text changes (#3029)
tstatler Sep 5, 2024
200c4e5
Add meta tag to torch_export_aoti_python (#3036)
svekars Sep 5, 2024
44493fb
Fix reference to dcp in loading example (#2972)
angerrp Sep 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions prototype_source/flight_recorder_tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
(prototype) Flight Recorder for Debugging
=========================================
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`

This tutorial introduces a new tool for debugging stuck jobs during distributed training.

Background and Motivation
--------------------------
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
that require significant computational resources.
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
subsequent training can be done faster. A trained usable model is the final desired outcome.
One of the biggest impediment to completing training is the concept of a "stuck job".

A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of
time.

A job can get stuck for various reasons:
- Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
issues with the data pipeline or the data source.
- Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
memory), the job might not be able to proceed.
- Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
stuck.
- Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
get stuck.
- Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
indefinite wait for the job to progress.

Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
cause the underlying issue. There are 2 core parts to flight recorder.
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).

Enabling Flight Recorder
------------------------
There are two required environment variables to get the initial version of flight recorder working.
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
to 2000)
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
Optional settings:
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
addr2line - see additinal settings)
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
record the ‘duration’ of each collective. May incur some CPU overhead.

Flight Recorder File Formats
----------------------------
Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
folders.
Contents of a flight recorder `unpickled` file is shown below.
.. code-block: JSON
{
"version": "2.3",
"pg_config": {
"0": {
"name": "0",
"desc": "default_pg",
"ranks": "[0, 1]"
}
},
"pg_status": {
"0": {
"last_enqueued_collective": 2,
"last_started_collective": -1,
"last_completed_collective": 2
}
},
"entries": [
{
"frames": [
{
"name": "test_short_pickle",
"filename": "pytorch/test/distributed/test_c10d_nccl.py",
"line": 3647
},
...
{
"name": "spawn_main",
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
"line": 116
},
{
"name": "<module>",
"filename": "<string>",
"line": 1
}
],
"record_id": 0,
"pg_id": 0,
"process_group": ("0", "default_pg"),
"collective_seq_id": 1,
"p2p_seq_id": 0,
"op_id": 1,
"profiling_name": "nccl:all_reduce",
"time_created_ns": 1724779239936775119,
"input_sizes": [[3, 4]],
"input_dtypes": ["Float"],
"output_sizes": [[3, 4]],
"output_dtypes": ["Float"],
"state": "completed",
"time_discovered_started_ns": null,
"time_discovered_completed_ns": 1724779239975811724,
"retired": true,
"timeout_ms": 600000,
"is_p2p": false
},
...]
}

Analyzing Flight Recorder Dumps
-------------------------------
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
data.

To run it, one can use command line:
.. code:: python
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]


Additional settings
-------------------
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
faster than the traditional `addr2line`.
Loading