You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: advanced_source/extend_dispatcher.rst
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ to `register a dispatched operator in C++ <dispatcher>`_ and how to write a
17
17
What's a new backend?
18
18
---------------------
19
19
20
-
Adding a new backend to PyTorch requires a lot of developement and maintainence from backend extenders.
20
+
Adding a new backend to PyTorch requires a lot of development and maintenance from backend extenders.
21
21
Before adding a new backend, let's first consider a few common use cases and recommended solutions for them:
22
22
23
23
* If you have new algorithms for an existing PyTorch operator, send a PR to PyTorch.
@@ -30,7 +30,7 @@ Before adding a new backend, let's first consider a few common use cases and rec
30
30
31
31
In this tutorial we'll mainly focus on adding a new out-of-tree device below. Adding out-of-tree support
32
32
for a different tensor layout might share many common steps with devices, but we haven't seen an example of
33
-
such integrations yet so it might require addtional work from PyTorch to support it.
33
+
such integrations yet so it might require additional work from PyTorch to support it.
34
34
35
35
Get a dispatch key for your backend
36
36
-----------------------------------
@@ -67,12 +67,12 @@ To create a Tensor on ``PrivateUse1`` backend, you need to set dispatch key in `
67
67
Note that ``TensorImpl`` class above assumes your Tensor is backed by a storage like CPU/CUDA. We also
68
68
provide ``OpaqueTensorImpl`` for backends without a storage. And you might need to tweak/override certain
69
69
methods to fit your customized hardware.
70
-
One example in pytorch repo is `Vulkan TensorImpl <https://github.com/pytorch/pytorch/blob/1.7/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h>`_.
70
+
One example in pytorch repo is `Vulkan TensorImpl <https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/vulkan/VulkanOpaqueTensorImpl.h>`_.
71
71
72
72
73
73
.. note::
74
74
Once the prototype is done and you plan to do regular releases for your backend extension, please feel free to
75
-
submit a PR to ``pytorch/pytorch`` to reserve a dedicated dispath key for your backend.
75
+
submit a PR to ``pytorch/pytorch`` to reserve a dedicated dispatch key for your backend.
76
76
77
77
78
78
Get the full list of PyTorch operators
@@ -361,7 +361,7 @@ actively working on might improve the experience in the future:
361
361
362
362
* Improve test coverage of generic testing framework.
363
363
* Improve ``Math`` kernel coverage and more comprehensive tests to make sure ``Math``
364
-
kernel bahavior matches other backends like ``CPU/CUDA``.
364
+
kernel behavior matches other backends like ``CPU/CUDA``.
365
365
* Refactor ``RegistrationDeclarations.h`` to carry the minimal information and reuse
366
366
PyTorch's codegen as much as possible.
367
367
* Support a backend fallback kernel to automatic convert inputs to CPU and convert the
Copy file name to clipboardExpand all lines: prototype_source/flight_recorder_tutorial.rst
+31-5Lines changed: 31 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -46,13 +46,15 @@ Flight Recorder consists of two core parts:
46
46
47
47
Enabling Flight Recorder
48
48
------------------------
49
-
There are two required environment variables to get the initial version of Flight Recorder working.
49
+
There are three required environment variables to get the initial version of Flight Recorder working.
50
50
51
51
- ``TORCH_NCCL_TRACE_BUFFER_SIZE = (0, N)``: Setting ``N`` to a positive number enables collection.
52
52
``N`` represents the number of entries that will be kept internally in a circular buffer.
53
-
We recommended to set this value at *2000*.
53
+
We recommended to set this value at *2000*. The default value is ``2000``.
54
54
- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
55
-
If enabled, there will be one file per rank output in the job's running directory.
55
+
If enabled, there will be one file per rank output in the job's running directory. The default value is ``false``.
56
+
- ``TORCH_NCCL_DEBUG_INFO_TEMP_FILE``: Setting the path where the flight recorder will be dumped with file prefix. One file per
57
+
rank. The default value is ``/tmp/nccl_trace_rank_``.
56
58
57
59
**Optional settings:**
58
60
@@ -71,6 +73,10 @@ Additional Settings
71
73
72
74
``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
73
75
Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
76
+
- If you prefer not to have the flight recorder data dumped into the local disk but rather onto your own storage, you can define your own writer class.
77
+
This class should inherit from class ``::c10d::DebugInfoWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L237>`__
78
+
and then register the new writer using ``::c10d::DebugInfoWriter::registerWriter`` `(code) <https://github.com/pytorch/pytorch/blob/release/2.5/torch/csrc/distributed/c10d/NCCLUtils.hpp#L242>`__
79
+
before we initiate PyTorch distributed.
74
80
75
81
Retrieving Flight Recorder Data via an API
76
82
------------------------------------------
@@ -169,9 +175,29 @@ To run the convenience script, follow these steps:
169
175
170
176
2. To run the script, use this command:
171
177
172
-
.. code:: python
178
+
.. code:: shell
179
+
180
+
python fr_trace.py <dump dir containing trace files> [-o <output file>]
181
+
182
+
If you install the PyTorch nightly build or build from scratch with ``USE_DISTRIBUTED=1``, you can directly use the following
183
+
command directly:
184
+
185
+
.. code:: shell
186
+
187
+
torchfrtrace <dump dir containing trace files> [-o <output file>]
188
+
189
+
190
+
Currently, we support two modes for the analyzer script. The first mode allows the script to apply some heuristics to the parsed flight
191
+
recorder dumps to generate a report identifying potential culprits for the timeout. The second mode is simply outputs the raw dumps.
192
+
By default, the script prints flight recoder dumps for all ranks and all ``ProcessGroups``(PGs). This can be narrowed down to certain
193
+
ranks and PGs using the *--selected-ranks* argument. An example command is:
194
+
195
+
Caveat: tabulate module is needed, so you might need pip install it first.
0 commit comments