From 8faf499511dfdec74ccf609166448dd09dfee2fd Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Wed, 15 Jul 2020 10:13:18 -0700 Subject: [PATCH 1/4] Dispatcher tutorial Signed-off-by: Edward Z. Yang --- advanced_source/dispatcher.rst | 286 ++++++++++++++++++++++ advanced_source/dispatcher/CMakeLists.txt | 8 + advanced_source/dispatcher/op.cpp | 105 ++++++++ advanced_source/dispatcher/test.py | 11 + index.rst | 1 + 5 files changed, 411 insertions(+) create mode 100644 advanced_source/dispatcher.rst create mode 100644 advanced_source/dispatcher/CMakeLists.txt create mode 100644 advanced_source/dispatcher/op.cpp create mode 100644 advanced_source/dispatcher/test.py diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst new file mode 100644 index 00000000000..91b1653c0a0 --- /dev/null +++ b/advanced_source/dispatcher.rst @@ -0,0 +1,286 @@ +Dispatcher in C++ +================= + +The dispatcher is an internal component of PyTorch which is responsible for +figuring out what code should actually get run when you call a function like +``torch::add``. This can be nontrivial, because PyTorch operations need +to handle a lot of cross-cutting concerns that are "layered" on top of one +of another. Here is a sampling of some of the things it handles: + +* Switching between the CPU and CUDA implementations of an operator, depending + on the devices of the input tensors. +* Switching between the autograd and backend implementations of an operator, + depending on whether or not autograd handling is necessary. +* Applying autocasting when necessary for automatic mixed precision. +* Applying batching rules when an operator is run under a ``vmap`` call. +* Tracing execution of operations, if you are tracing a model for export. + +If in your `custom operator code `_ you find yourself +manually writing if statements to handle these cases, the dispatcher APIs can +help organize your code. (Conversely, if your custom operator is very simple +and is only for CPU inference, you probably don't need to use the dispatcher, +just use the basic API.) + +In this tutorial, we will describe how to structure a custom operator +registration to use the dispatcher to organize various components. We'll +assume that you are familiar with how to +`register an operator `_ and how to write +a `custom autograd function `_. + +Defining schema and backend implementations +------------------------------------------- + +The general principle behind the dispatcher is that it divides the +implementation of an operator into multiple kernels, each of which +implements functionality for a specific *dispatch key*; for example, +`CPU`, `CUDA` or `Autograd`. The end effect is that when you call +an operator, we first execute the `Autograd` kernel, and then we +redispatch to the `CPU` or `CUDA` kernel depending on the device +types of the passed in tensors. + +Let's take a look at the various parts involved in making this +happen. First, we must define the schema for the operator in question. +Unlike simple pybind11-style operator registration, we don't actually +provide an implementation of our operator at this point; we just +provide a schema string specifying the type signature of the operator +that all of our other kernels will abide by: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN TORCH_LIBRARY + :end-before: END TORCH_LIBRARY + +Next, we need to actually provide some implementations of this operator. +For concreteness, here is a really simple implementation of addition on CPU: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN myadd_cpu + :end-before: END myadd_cpu + +We'd like to register this function as an implementation of ``myops::myadd``, but we +don't want to register it as a catch-all kernel to be run in all cases; we +only want it to be run when we call ``myops::myadd`` at the backend on CPU tensors. +To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN TORCH_LIBRARY_IMPL CPU + :end-before: END TORCH_LIBRARY_IMPL CPU + +The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on +a specific dispatch key (in this case, ``CPU``). Each call to ``impl`` +associates a CPU kernel with the corresponding operator (which we previously +defined in the ``TORCH_LIBRARY`` block). You can have as many +``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example, +if we also have a CUDA implementation ``myadd_cuda``, we can register it +with: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN TORCH_LIBRARY_IMPL CUDA + :end-before: END TORCH_LIBRARY_IMPL CUDA + +These registrations can be split across files or even across library boundaries; so +for example, you could have these two ``TORCH_LIBRARY_IMPL`` blocks compiled +into a separate ``myops_cpu`` and ``myops_cuda`` dynamic library. + +.. note:: + + Did you know that you can also write ``TORCH_LIBRARY_IMPL`` blocks for existing + core operators in PyTorch? This is how XLA support for PyTorch is + implemented: the ``torch_xla`` library contains a ``TORCH_LIBRARY_IMPL`` + that provides implementations for all basic operators on the XLA dispatch + key. + +Adding autograd support +----------------------- + +At this point, we have an operator with both CPU and CUDA implementations. How +can we add autograd support to it? As you might guess, we will register an +autograd kernel (similar to what's described in the `custom autograd function `_ tutorial)! +However, there is a twist: unlike the CPU and CUDA kernels, the autograd kernel +needs to *redispatch*: it needs to call back into the dispatcher to get to +the final CPU and CUDA implementations. + +Thus, before we write the autograd kernel, let's write a *dispatching function* +which calls into the dispatcher to find the right kernel for your operator. +This function constitutes the public C++ API for your operators--in fact, all of +the tensor functions in PyTorch's C++ API all call the dispatcher in the same +way under the hood. Here's what the dispatching function looks like: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: myadd + :end-before: myadd + +Let's break it down: + +* In the first line, we look up a typed operator handle from the dispatcher + corresponding to the operator that we are going to dispatch to. + ``findSchemaOrThrow`` takes two arguments: the (namespace qualified) name + of the operator, and the overload name of the operator (typically just + the empty string). ``typed`` casts the dynamically typed handle into + a statically typed handle (doing a runtime test to make sure you've given + the correct C++ type), so that we can do a normal C++ call on it. We + pass it ``decltype(myadd)`` since the type of the dispatching function is + the same as the type of the underlying kernels registered to the dispatcher. + + For performance, this computation is done in a static variable, so that + we only need to do the (slow) lookup once. If you typoed the name of the + operator you want to call, this lookup will error the first time you call this + function. + +* In the second line, we simply ``call`` the operator handle with all of the + arguments passed into the dispatching function. This will actually invoke + the dispatcher and in the end control will be transferred to whatever kernel + is appropriate for this call. + +With the dispatch function in hand, we can now write the autograd kernel: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: myadd_autograd + :end-before: myadd_autograd + +The autograd function is written as normal using ``torch::autograd::Function``, +except that instead of directly writing the implementation in ``forward()``, +we: + +1. Turn off autograd handling with the `at::AutoNonVariableTypeMode`` RAII + guard, and then +2. Call the dispatch function ``myadd`` to call back into the dispatcher. + +Without (1), your calls will infinite loop (and stack overflow), because +``myadd`` will send you back to the autograd implementation! With (1), +the redispatch will skip over autograd and go to the next handlers, +which will either be CPU and CUDA. + +We can now register this function in the same way we registered the CPU/CUDA +functions: + +.. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN TORCH_LIBRARY_IMPL Autograd + :end-before: END TORCH_LIBRARY_IMPL Autograd + +Going beyond autograd +--------------------- + +In some sense, the dispatcher isn't doing all that much: all it does is +implement a glorified if-statement, along the lines of this: + +.. code-block:: cpp + + class MyAddFunction : ... { + public: + static Tensor forward( + AutogradContext *ctx, torch::Tensor self, torch::Tensor other) { + + if (self.device().type() == DeviceType::CPU) { + return add_cpu(self, other); + } else if (self.device().type() == DeviceType::CUDA) { + return add_cuda(self, other); + } else { + TORCH_CHECK(0, "Unsupported device ", self.device().type()); + } + } + ... + } + +So why use the dispatcher? There are a few reasons: + +1. It is decentralized. You can assemble all of the pieces of an operator + (CPU, CUDA, Autograd) without having to write a single, centralized + if statement that refers to all of them. Importantly, third parties can + register extra implementations for other aspects without having to patch the + original definition of an operator. + +2. It supports more dispatch keys than CPU, CUDA and Autograd. You can + see a full list of dispatch keys that are currently implemented + in PyTorch in ``c10/core/DispatchKey.h``. These dispatch keys + implement a variety of optional functionality for operators, and if you + decide you want your custom operator to support this functionality, + all you have to register a kernel for the appropriate key. + +3. The dispatcher implements support for boxed fallback functions, which + are functions that can be implemented once and apply to all operators + in the system. Boxed fallbacks can be used to provide default behavior + for a dispatch key; if you use the dispatcher to implement your operator, + you also opt into the fallbacks for all of these operations. + +Here are some particular dispatch keys which you may need to define an operator +for. + +Autocast +^^^^^^^^ + +The Autocast dispatch key implements support for +`automatic mixed precision `_ +(AMP). An autocast kernel typically modifies the operation of an operator by casting the +input arguments to some precision before carrying out the operation. For some +operations, it is numerically safe to cast to lower precision, which is how AMP +can achieve speed ups and reduced memory usage without sacrificing much +accuracy. A nontrivial autocast kernel looks something like this: + +.. code-block:: cpp + + Tensor mymatmul_autocast(const Tensor& self, const Tensor& other) { + c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast); + return mymatmul(autocast::_cast(at::kHalf, self), autocast::_cast(at::kHalf, other)); + } + +Notice that, like our autograd kernels, we exclude the ``Autocast`` key from +dispatch before redispatching. By default, if no autocast kernel is provided, +we simply fallthrough directly to the regular operator implementation (no +autocasting occurs.) (We didn't use ``myadd`` for this example, since pointwise +addition doesn't do autocasting and should just fall through). + +When should an autocast kernel be registered? Unfortunately, there aren't +cut-and-dry rules for when you should cast to a lower precision. You can +get a sense for what operators have autocasting behavior by looking at +the `AMP documentation +`_. Some other +general rules: + +* Operations that do reductions should be carried out in float32, +* Any operation with multiple float tensor inputs has to standardize them + to a common precision, and +* Any operation that does a convolution or gemm under the hood should + probably be float16 + +.. + + NB: This doesn't work because torch.ops doesn't support names. + + Named + ^^^^^ + + `Named tensors `_ allow + users to associate explicit names with tensor dimensions, and then have those + dimensions be propagated when you run operations on those tensors. If you + define a new operator, you have to also define rules for how names should + be checked and propagated. The Named kernel handles implementing these rules. + + .. literalinclude:: ../advanced_source/dispatcher/op.cpp + :language: cpp + :start-after: BEGIN TORCH_LIBRARY_IMPL Named + :end-before: END TORCH_LIBRARY_IMPL Named + +Batched +^^^^^^^ + +Batched tensors allow you to write your code in a per-example manner, and then +have them be automatically batched when run under a ``vmap`` invocation. The +API for writing batching rules is currently under development, but once it is +stabilized, you can add support for ``vmap`` for your operators by registering +a kernel at the Batched dispatch key. + +Tracer +^^^^^^ + +The Tracer dispatch key implements support for recording invocations of operators +into a trace when you run ``torch.jit.trace``. We intend to provide a +boxed fallback that will implement tracing for arbitrary operations, +see `issue #41478 ` to track +progress. diff --git a/advanced_source/dispatcher/CMakeLists.txt b/advanced_source/dispatcher/CMakeLists.txt new file mode 100644 index 00000000000..0ef448a9644 --- /dev/null +++ b/advanced_source/dispatcher/CMakeLists.txt @@ -0,0 +1,8 @@ +cmake_minimum_required(VERSION 3.1 FATAL_ERROR) +project(dispatcher) + +find_package(Torch REQUIRED) + +add_library(dispatcher SHARED op.cpp) +target_compile_features(dispatcher PRIVATE cxx_std_14) +target_link_libraries(dispatcher "${TORCH_LIBRARIES}") diff --git a/advanced_source/dispatcher/op.cpp b/advanced_source/dispatcher/op.cpp new file mode 100644 index 00000000000..c3a90aed448 --- /dev/null +++ b/advanced_source/dispatcher/op.cpp @@ -0,0 +1,105 @@ +#include +#include + +#include + +using torch::Tensor; +using torch::DeviceType; +using torch::autograd::tensor_list; +using torch::autograd::AutogradContext; + +// BEGIN myadd +Tensor myadd(const Tensor& self, const Tensor& other) { + static auto op = torch::Dispatcher::singleton() + .findSchemaOrThrow("myops::myadd", "") + .typed(); + return op.call(self, other); +} +// END myadd + +// BEGIN TORCH_LIBRARY +TORCH_LIBRARY(myops, m) { + m.def("myadd(Tensor self, Tensor other) -> Tensor"); +} +// END TORCH_LIBRARY + +// BEGIN myadd_cpu +Tensor myadd_cpu(const Tensor& self_, const Tensor& other_) { + TORCH_CHECK(self_.sizes() == other_.sizes()); + TORCH_INTERNAL_ASSERT(self_.device().type() == DeviceType::CPU); + TORCH_INTERNAL_ASSERT(other_.device().type() == DeviceType::CPU); + Tensor self = self_.contiguous(); + Tensor other = other_.contiguous(); + Tensor result = torch::empty(self.sizes(), self.options()); + const float* self_ptr = self.data_ptr(); + const float* other_ptr = other.data_ptr(); + float* result_ptr = result.data_ptr(); + for (int64_t i = 0; i < result.numel(); i++) { + result_ptr[i] = self_ptr[i] + other_ptr[i]; + } + return result; +} +// END myadd_cpu + +// BEGIN TORCH_LIBRARY_IMPL CPU +TORCH_LIBRARY_IMPL(myops, CPU, m) { + m.impl("myadd", myadd_cpu); +} +// END TORCH_LIBRARY_IMPL CPU + +Tensor myadd_cuda(const Tensor& self, const Tensor& other) { + // Insert your CUDA implementation here + TORCH_CHECK(0, "CUDA not yet implemented"); +} + +// BEGIN TORCH_LIBRARY_IMPL CUDA +TORCH_LIBRARY_IMPL(myops, CUDA, m) { + m.impl("myadd", myadd_cuda); +} +// END TORCH_LIBRARY_IMPL CUDA + +// BEGIN myadd_autograd +class MyAddFunction : public torch::autograd::Function { + public: + static Tensor forward( + AutogradContext *ctx, torch::Tensor self, torch::Tensor other) { + at::AutoNonVariableTypeMode g; + return myadd(self, other); + } + + static tensor_list backward(AutogradContext *ctx, tensor_list grad_outputs) { + auto grad_output = grad_outputs[0]; + return {grad_output, grad_output}; + } +}; + +Tensor myadd_autograd(const Tensor& self, const Tensor& other) { + return MyAddFunction::apply(self, other)[0]; +} +// END myadd_autograd + +// BEGIN TORCH_LIBRARY_IMPL Autograd +TORCH_LIBRARY_IMPL(myops, Autograd, m) { + m.impl("myadd", myadd_autograd); +} +// END TORCH_LIBRARY_IMPL Autograd + +#if 0 +// BEGIN TORCH_LIBRARY_IMPL Named +Tensor myadd_named(const Tensor& self, const Tensor& other) { + // TODO: shouldn't need to do size check here + TORCH_CHECK(self.sizes() == other.sizes()); + auto maybe_outnames = at::unify_from_right(self.names(), other.names()); + auto result = ([&]() { + at::NoNamesGuard guard; + return myadd(self, other); + })(); + at::namedinference::propagate_names_if_nonempty(result, maybe_outnames); + return result; +} + +TORCH_LIBRARY_IMPL(myops, Named, m) { + m.impl("myadd", myadd_named); +} +// END TORCH_LIBRARY_IMPL Named +#endif diff --git a/advanced_source/dispatcher/test.py b/advanced_source/dispatcher/test.py new file mode 100644 index 00000000000..cd35b05a47a --- /dev/null +++ b/advanced_source/dispatcher/test.py @@ -0,0 +1,11 @@ +import torch + +torch.ops.load_library("build/libdispatcher.so") +print(torch.ops.myops.myadd(torch.randn(32, 32), torch.rand(32, 32))) +""" +# Doesn't currently work, because Python frontend on torch.ops doesn't +# support names (for not a good reason?) +x = torch.randn(32, 32, names=('A', 'B')) +y = torch.rand(32, 32, names=('A', 'B')) +print(torch.ops.myops.myadd(x, y)) +""" diff --git a/index.rst b/index.rst index 9a8762e8026..2b0f38b9cdb 100644 --- a/index.rst +++ b/index.rst @@ -472,6 +472,7 @@ Additional Resources advanced/torch_script_custom_ops advanced/torch_script_custom_classes advanced/cpp_autograd + advanced/dispatcher .. toctree:: :maxdepth: 2 From b478ae97cb0b47b676737f4083ff57b00cca82f1 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Wed, 15 Jul 2020 13:44:29 -0700 Subject: [PATCH 2/4] typofix Signed-off-by: Edward Z. Yang --- advanced_source/dispatcher.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst index 91b1653c0a0..3867333d6b5 100644 --- a/advanced_source/dispatcher.rst +++ b/advanced_source/dispatcher.rst @@ -111,8 +111,8 @@ way under the hood. Here's what the dispatching function looks like: .. literalinclude:: ../advanced_source/dispatcher/op.cpp :language: cpp - :start-after: myadd - :end-before: myadd + :start-after: BEGIN myadd + :end-before: END myadd Let's break it down: @@ -140,8 +140,8 @@ With the dispatch function in hand, we can now write the autograd kernel: .. literalinclude:: ../advanced_source/dispatcher/op.cpp :language: cpp - :start-after: myadd_autograd - :end-before: myadd_autograd + :start-after: BEGIN myadd_autograd + :end-before: END myadd_autograd The autograd function is written as normal using ``torch::autograd::Function``, except that instead of directly writing the implementation in ``forward()``, From 898dede14ef91bf3b46a83575d4fc3a019c74aaa Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Thu, 16 Jul 2020 11:33:11 -0700 Subject: [PATCH 3/4] morefix Signed-off-by: Edward Z. Yang --- advanced_source/dispatcher.rst | 30 ++++++------------------------ 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst index 3867333d6b5..888744c90ec 100644 --- a/advanced_source/dispatcher.rst +++ b/advanced_source/dispatcher.rst @@ -33,9 +33,9 @@ Defining schema and backend implementations The general principle behind the dispatcher is that it divides the implementation of an operator into multiple kernels, each of which implements functionality for a specific *dispatch key*; for example, -`CPU`, `CUDA` or `Autograd`. The end effect is that when you call -an operator, we first execute the `Autograd` kernel, and then we -redispatch to the `CPU` or `CUDA` kernel depending on the device +CPU, CUDA or Autograd. The end effect is that when you call +an operator, we first execute the Autograd kernel, and then we +redispatch to the CPU or CUDA kernel depending on the device types of the passed in tensors. Let's take a look at the various parts involved in making this @@ -69,7 +69,7 @@ To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro: :end-before: END TORCH_LIBRARY_IMPL CPU The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on -a specific dispatch key (in this case, ``CPU``). Each call to ``impl`` +a specific dispatch key (in this case, CPU). Each call to ``impl`` associates a CPU kernel with the corresponding operator (which we previously defined in the ``TORCH_LIBRARY`` block). You can have as many ``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example, @@ -147,7 +147,7 @@ The autograd function is written as normal using ``torch::autograd::Function``, except that instead of directly writing the implementation in ``forward()``, we: -1. Turn off autograd handling with the `at::AutoNonVariableTypeMode`` RAII +1. Turn off autograd handling with the ``at::AutoNonVariableTypeMode`` RAII guard, and then 2. Call the dispatch function ``myadd`` to call back into the dispatcher. @@ -249,24 +249,6 @@ general rules: * Any operation that does a convolution or gemm under the hood should probably be float16 -.. - - NB: This doesn't work because torch.ops doesn't support names. - - Named - ^^^^^ - - `Named tensors `_ allow - users to associate explicit names with tensor dimensions, and then have those - dimensions be propagated when you run operations on those tensors. If you - define a new operator, you have to also define rules for how names should - be checked and propagated. The Named kernel handles implementing these rules. - - .. literalinclude:: ../advanced_source/dispatcher/op.cpp - :language: cpp - :start-after: BEGIN TORCH_LIBRARY_IMPL Named - :end-before: END TORCH_LIBRARY_IMPL Named - Batched ^^^^^^^ @@ -282,5 +264,5 @@ Tracer The Tracer dispatch key implements support for recording invocations of operators into a trace when you run ``torch.jit.trace``. We intend to provide a boxed fallback that will implement tracing for arbitrary operations, -see `issue #41478 ` to track +see `issue #41478 `_ to track progress. From 2a96e0d5c42ad8a96f46ada92e8e35b8a5daa9b4 Mon Sep 17 00:00:00 2001 From: "Edward Z. Yang" Date: Wed, 22 Jul 2020 08:01:00 -0700 Subject: [PATCH 4/4] copyedit based on comments Signed-off-by: Edward Z. Yang --- advanced_source/dispatcher.rst | 49 +++++++++++++++++++++------------- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst index 888744c90ec..7a7d806c328 100644 --- a/advanced_source/dispatcher.rst +++ b/advanced_source/dispatcher.rst @@ -31,12 +31,14 @@ Defining schema and backend implementations ------------------------------------------- The general principle behind the dispatcher is that it divides the -implementation of an operator into multiple kernels, each of which -implements functionality for a specific *dispatch key*; for example, -CPU, CUDA or Autograd. The end effect is that when you call -an operator, we first execute the Autograd kernel, and then we -redispatch to the CPU or CUDA kernel depending on the device -types of the passed in tensors. +implementation of an operator into multiple kernels, each of which implements +functionality for a specific *dispatch key*; for example, CPU, CUDA or Autograd. +The dispatcher determines what the highest priority dispatch key is at the time +you call an operator (this is done by looking at both the tensor arguments as +well as some thread local state), and transfers control to the kernel for that +dispatch key. The end effect is that when you call an operator, we first +execute the Autograd kernel, and then we redispatch to the CPU or CUDA kernel +depending on the device types of the passed in tensors. Let's take a look at the various parts involved in making this happen. First, we must define the schema for the operator in question. @@ -58,10 +60,12 @@ For concreteness, here is a really simple implementation of addition on CPU: :start-after: BEGIN myadd_cpu :end-before: END myadd_cpu -We'd like to register this function as an implementation of ``myops::myadd``, but we -don't want to register it as a catch-all kernel to be run in all cases; we -only want it to be run when we call ``myops::myadd`` at the backend on CPU tensors. -To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro: +We'd like to register this function as an implementation of ``myops::myadd``. +However, the simple way of registering it (``def("myadd", myadd_cpu)``) would +register the kernel to run in all cases, even if the tensor is not a CPU +tensor! (Internally, we refer to these as "catch-all" kernels, since they +catch all cases.) To ensure that ``myadd_cpu`` is only run for +CPU tensors, we can use the ``TORCH_LIBRARY_IMPL`` macro: .. literalinclude:: ../advanced_source/dispatcher/op.cpp :language: cpp @@ -71,10 +75,8 @@ To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro: The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on a specific dispatch key (in this case, CPU). Each call to ``impl`` associates a CPU kernel with the corresponding operator (which we previously -defined in the ``TORCH_LIBRARY`` block). You can have as many -``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example, -if we also have a CUDA implementation ``myadd_cuda``, we can register it -with: +defined in the ``TORCH_LIBRARY`` block). If we also have a CUDA implementation ``myadd_cuda``, +we can register it in a separate ``TORCH_LIBRARY_IMPL`` block: .. literalinclude:: ../advanced_source/dispatcher/op.cpp :language: cpp @@ -83,7 +85,17 @@ with: These registrations can be split across files or even across library boundaries; so for example, you could have these two ``TORCH_LIBRARY_IMPL`` blocks compiled -into a separate ``myops_cpu`` and ``myops_cuda`` dynamic library. +into a separate ``myops_cpu`` and ``myops_cuda`` dynamic libraries. Generally, +speaking, the structure of your registrations will look like this: + +1. A single ``TORCH_LIBRARY`` that lists every custom operator in your namespace + in a centralized place. +2. A ``TORCH_LIBRARY_IMPL`` per dispatch key that registers implementations for + that key (e.g., CPU or CUDA). If you like, you can further subdivide + ``TORCH_LIBRARY_IMPL`` blocks into a block per operator. This is convenient + if you have a separate file per operator implementation, but don't want to + expose the operators in a header; you can just put the registration in the + cpp file that defines your operator. .. note:: @@ -152,9 +164,10 @@ we: 2. Call the dispatch function ``myadd`` to call back into the dispatcher. Without (1), your calls will infinite loop (and stack overflow), because -``myadd`` will send you back to the autograd implementation! With (1), -the redispatch will skip over autograd and go to the next handlers, -which will either be CPU and CUDA. +``myadd`` will send you back to this function (as the highest priority dispatch +key would still be autograd.) With (1), +autograd is excluded from the set of dispatch keys under consideration, and +we will go to the next handlers, which will either be CPU and CUDA. We can now register this function in the same way we registered the CPU/CUDA functions: