-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add a tutorial for PyTorch Flight recorder #3024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from 4 commits
Commits
Show all changes
31 commits
Select commit
Hold shift + click to select a range
b2c013e
[Doc] Flight recorder tutorial
c-p-i-o 2b69bcd
More cleanup.
c-p-i-o 7df80d1
Update flight_recorder_tutorial.rst
fduwjj d4fd0d3
Add additional settings section
c-p-i-o 23eef97
address code review comments
c-p-i-o f4cf1ff
add missing section
c-p-i-o f7db1b9
fix typo
c-p-i-o d86d623
Fixes
c-p-i-o c071784
Add flight recorder to prototype index
c-p-i-o aa65c84
Update prototype_index
c-p-i-o fb13629
Apply suggestions from code review
c-p-i-o 2358201
More HTML and formatting fixes
c-p-i-o de5654a
More HTML formatting changes
c-p-i-o 3b0c4cc
[dtensor][debug] CommDebugMode recipe (#3001)
XilunWu f952921
fix: rm `use_cuda` param (#3002)
shaoyuyoung fa03879
Add programmable Google Search to pytorch tutorials site (#2820)
svekars ab36383
Tutorial for AOTI Python runtime (#2997)
agunapal fc27f08
Create tutorial_submission_policy.md (#2995)
svekars 9d97d8f
Removed upper-case letter/made 'download' the link text instead of 'h…
tstatler acdc91b
Add weights_only=True to torch.load (#3012)
svekars 3dacf89
Fix typos in dynamic_quantization_bert_tutorial.rst (#3019)
hadh93 fc5a612
Improve custom ops tutorials (#3020)
zou3519 c1e792a
Removed outdated steps in README about running about setup.py (#3014)
tstatler 202693f
Fix hovering over the GCS search button (#3005)
svekars 5465f9b
Added warnings to select Pytorch mobile tutorials directing users to …
tstatler f45ddc2
Patched docs for torch_compile_tutorial (#2936)
ignaciobartol c7a99c5
Upgrade pygame version to 2.6.0 (#3025)
svekars deb89ba
Update intro_onnx.py (#3033)
labintsev a5d85ed
Fixed Rst formatting, minor text changes (#3029)
tstatler 200c4e5
Add meta tag to torch_export_aoti_python (#3036)
svekars 44493fb
Fix reference to dcp in loading example (#2972)
angerrp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
(prototype) Flight Recorder for Debugging | ||
========================================= | ||
**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>` | ||
|
||
This tutorial introduces a new tool for debugging stuck jobs during distributed training. | ||
|
||
Background and Motivation | ||
-------------------------- | ||
An AI distributed training job refers to the process of training a machine learning model using multiple devices, such | ||
as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models | ||
that require significant computational resources. | ||
An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
subsequent training can be done faster. A trained usable model is the final desired outcome. | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
One of the biggest impediment to completing training is the concept of a "stuck job". | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
time. | ||
|
||
A job can get stuck for various reasons: | ||
- Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
issues with the data pipeline or the data source. | ||
- Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
memory), the job might not be able to proceed. | ||
- Network Issues: In a distributed training setup, different parts of the model or data may be processed on different | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
devices. If there are network issues, communication between these devices may be disrupted, causing the job to get | ||
stuck. | ||
- Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
get stuck. | ||
- Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can | ||
occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
indefinite wait for the job to progress. | ||
|
||
Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root | ||
cause the underlying issue. There are 2 core parts to flight recorder. | ||
- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer. | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file. | ||
- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Enabling Flight Recorder | ||
------------------------ | ||
There are two required environment variables to get the initial version of flight recorder working. | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this | ||
to 2000) | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. | ||
Optional settings: | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
addr2line - see additinal settings) | ||
- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and | ||
record the ‘duration’ of each collective. May incur some CPU overhead. | ||
|
||
Flight Recorder File Formats | ||
---------------------------- | ||
Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
folders. | ||
Contents of a flight recorder `unpickled` file is shown below. | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
.. code-block: JSON | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
{ | ||
"version": "2.3", | ||
"pg_config": { | ||
"0": { | ||
"name": "0", | ||
"desc": "default_pg", | ||
"ranks": "[0, 1]" | ||
} | ||
}, | ||
"pg_status": { | ||
"0": { | ||
"last_enqueued_collective": 2, | ||
"last_started_collective": -1, | ||
"last_completed_collective": 2 | ||
} | ||
}, | ||
"entries": [ | ||
{ | ||
"frames": [ | ||
{ | ||
"name": "test_short_pickle", | ||
"filename": "pytorch/test/distributed/test_c10d_nccl.py", | ||
"line": 3647 | ||
}, | ||
... | ||
{ | ||
"name": "spawn_main", | ||
"filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py", | ||
"line": 116 | ||
}, | ||
{ | ||
"name": "<module>", | ||
"filename": "<string>", | ||
"line": 1 | ||
} | ||
], | ||
"record_id": 0, | ||
"pg_id": 0, | ||
"process_group": ("0", "default_pg"), | ||
"collective_seq_id": 1, | ||
"p2p_seq_id": 0, | ||
"op_id": 1, | ||
"profiling_name": "nccl:all_reduce", | ||
"time_created_ns": 1724779239936775119, | ||
"input_sizes": [[3, 4]], | ||
"input_dtypes": ["Float"], | ||
"output_sizes": [[3, 4]], | ||
"output_dtypes": ["Float"], | ||
"state": "completed", | ||
"time_discovered_started_ns": null, | ||
"time_discovered_completed_ns": 1724779239975811724, | ||
"retired": true, | ||
"timeout_ms": 600000, | ||
"is_p2p": false | ||
}, | ||
...] | ||
} | ||
|
||
Analyzing Flight Recorder Dumps | ||
------------------------------- | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured | ||
c-p-i-o marked this conversation as resolved.
Show resolved
Hide resolved
|
||
data. | ||
|
||
To run it, one can use command line: | ||
.. code:: python | ||
python fr_trace.py -d <dump dir containing trace files> [-o <output file>] | ||
|
||
|
||
Additional settings | ||
------------------- | ||
TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces | ||
from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much | ||
faster than the traditional `addr2line`. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.