-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Update FR tutorial to include file path, writer and print usage #3058
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,8 @@ Enabling Flight Recorder | |
------------------------ | ||
There are two required environment variables to get the initial version of Flight Recorder working. | ||
|
||
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per | ||
rank. The default value is ``/tmp/nccl_trace_rank_``. | ||
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection. | ||
``N`` represents the number of entries that will be kept internally in a circular buffer. | ||
We recommended to set this value at *2000*. | ||
|
@@ -71,6 +73,9 @@ Additional Settings | |
|
||
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``. | ||
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data. | ||
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class. | ||
This class should inherit from class ``::c10d::DebugInfoWriter`` and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: |
||
before we initiate PyTorch distributed. | ||
|
||
Retrieving Flight Recorder Data via an API | ||
------------------------------------------ | ||
|
@@ -169,9 +174,29 @@ To run the convenience script, follow these steps: | |
|
||
2. To run the script, use this command: | ||
|
||
.. code:: python | ||
.. code:: shell | ||
|
||
python fr_trace.py <dump dir containing trace files> [-o <output file>] | ||
|
||
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following | ||
command directly: | ||
|
||
.. code:: shell | ||
|
||
torchfrtrace <dump dir containing trace files> [-o <output file>] | ||
|
||
|
||
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight | ||
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps. | ||
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain | ||
ranks and PGs. An example command is: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggested clarification to sentence. can be narrowed down to certain ranks and PGs using the |
||
|
||
Caveat: tabulate module is needed, so you might need pip install it first. | ||
|
||
.. code:: shell | ||
|
||
python fr_trace.py -d <dump dir containing trace files> [-o <output file>] | ||
python fr_trace.py <dump dir containing trace files> -j [--selected-ranks i j k ...] | ||
torchfrtrace <dump dir containing trace files> -j [--selected-ranks i j k ...] | ||
|
||
Conclusion | ||
---------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can you move this to the Optional settings section?
Otherwise the comment on Line 49 can be confusing to the customer (there are two required environment variables).
Or you can fix the comment.