Skip to content

Commit a4a7647

Browse files
authored
Merge branch 'main' into triton_kernel
2 parents b9e2918 + 4126761 commit a4a7647

File tree

2 files changed

+36
-10
lines changed

2 files changed

+36
-10
lines changed

advanced_source/extend_dispatcher.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ to `register a dispatched operator in C++ <dispatcher>`_ and how to write a
1717
What's a new backend?
1818
---------------------
1919

20-
Adding a new backend to PyTorch requires a lot of developement and maintainence from backend extenders.
20+
Adding a new backend to PyTorch requires a lot of development and maintenance from backend extenders.
2121
Before adding a new backend, let's first consider a few common use cases and recommended solutions for them:
2222

2323
* If you have new algorithms for an existing PyTorch operator, send a PR to PyTorch.
@@ -30,7 +30,7 @@ Before adding a new backend, let's first consider a few common use cases and rec
3030

3131
In this tutorial we'll mainly focus on adding a new out-of-tree device below. Adding out-of-tree support
3232
for a different tensor layout might share many common steps with devices, but we haven't seen an example of
33-
such integrations yet so it might require addtional work from PyTorch to support it.
33+
such integrations yet so it might require additional work from PyTorch to support it.
3434

3535
Get a dispatch key for your backend
3636
-----------------------------------
@@ -67,12 +67,12 @@ To create a Tensor on ``PrivateUse1`` backend, you need to set dispatch key in `
6767
Note that ``TensorImpl`` class above assumes your Tensor is backed by a storage like CPU/CUDA. We also
6868
provide ``OpaqueTensorImpl`` for backends without a storage. And you might need to tweak/override certain
6969
methods to fit your customized hardware.
70-
One example in pytorch repo is `Vulkan TensorImpl <https://github.com/pytorch/pytorch/blob/1.7/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h>`_.
70+
One example in pytorch repo is `Vulkan TensorImpl <https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h>`_.
7171

7272

7373
.. note::
7474
Once the prototype is done and you plan to do regular releases for your backend extension, please feel free to
75-
submit a PR to ``pytorch/pytorch`` to reserve a dedicated dispath key for your backend.
75+
submit a PR to ``pytorch/pytorch`` to reserve a dedicated dispatch key for your backend.
7676

7777

7878
Get the full list of PyTorch operators
@@ -361,7 +361,7 @@ actively working on might improve the experience in the future:
361361

362362
* Improve test coverage of generic testing framework.
363363
* Improve ``Math`` kernel coverage and more comprehensive tests to make sure ``Math``
364-
kernel bahavior matches other backends like ``CPU/CUDA``.
364+
kernel behavior matches other backends like ``CPU/CUDA``.
365365
* Refactor ``RegistrationDeclarations.h`` to carry the minimal information and reuse
366366
PyTorch's codegen as much as possible.
367367
* Support a backend fallback kernel to automatic convert inputs to CPU and convert the

prototype_source/flight_recorder_tutorial.rst

Lines changed: 31 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,13 +46,15 @@ Flight Recorder consists of two core parts:
4646

4747
Enabling Flight Recorder
4848
------------------------
49-
There are two required environment variables to get the initial version of Flight Recorder working.
49+
There are three required environment variables to get the initial version of Flight Recorder working.
5050

5151
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
5252
``N`` represents the number of entries that will be kept internally in a circular buffer.
53-
We recommended to set this value at *2000*.
53+
We recommended to set this value at *2000*. The default value is ``2000``.
5454
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
55-
If enabled, there will be one file per rank output in the job's running directory.
55+
If enabled, there will be one file per rank output in the job's running directory. The default value is ``false``.
56+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
57+
rank. The default value is ``/tmp/nccl_trace_rank_``.
5658

5759
**Optional settings:**
5860

@@ -71,6 +73,10 @@ Additional Settings
7173

7274
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
7375
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
76+
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77+
This class should inherit from class ``::c10d::DebugInfoWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L237>`__
78+
and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L242>`__
79+
before we initiate PyTorch distributed.
7480

7581
Retrieving Flight Recorder Data via an API
7682
------------------------------------------
@@ -169,9 +175,29 @@ To run the convenience script, follow these steps:
169175

170176
2. To run the script, use this command:
171177

172-
.. code:: python
178+
.. code:: shell
179+
180+
python fr_trace.py <dump dir containing trace files> [-o <output file>]
181+
182+
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following
183+
command directly:
184+
185+
.. code:: shell
186+
187+
torchfrtrace <dump dir containing trace files> [-o <output file>]
188+
189+
190+
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
191+
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
192+
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
193+
ranks and PGs using the *--selected-ranks* argument. An example command is:
194+
195+
Caveat: tabulate module is needed, so you might need pip install it first.
196+
197+
.. code:: shell
173198
174-
python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
199+
python fr_trace.py <dump dir containing trace files> -j [--selected-ranks i j k ...]
200+
torchfrtrace <dump dir containing trace files> -j [--selected-ranks i j k ...]
175201
176202
Conclusion
177203
----------

0 commit comments

Comments
 (0)