From 8faf499511dfdec74ccf609166448dd09dfee2fd Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <ezyang@fb.com>
Date: Wed, 15 Jul 2020 10:13:18 -0700
Subject: [PATCH 1/4] Dispatcher tutorial

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
---
 advanced_source/dispatcher.rst            | 286 ++++++++++++++++++++++
 advanced_source/dispatcher/CMakeLists.txt |   8 +
 advanced_source/dispatcher/op.cpp         | 105 ++++++++
 advanced_source/dispatcher/test.py        |  11 +
 index.rst                                 |   1 +
 5 files changed, 411 insertions(+)
 create mode 100644 advanced_source/dispatcher.rst
 create mode 100644 advanced_source/dispatcher/CMakeLists.txt
 create mode 100644 advanced_source/dispatcher/op.cpp
 create mode 100644 advanced_source/dispatcher/test.py

diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst
new file mode 100644
index 00000000000..91b1653c0a0
--- /dev/null
+++ b/advanced_source/dispatcher.rst
@@ -0,0 +1,286 @@
+Dispatcher in C++
+=================
+
+The dispatcher is an internal component of PyTorch which is responsible for
+figuring out what code should actually get run when you call a function like
+``torch::add``.  This can be nontrivial, because PyTorch operations need
+to handle a lot of cross-cutting concerns that are "layered" on top of one
+of another.  Here is a sampling of some of the things it handles:
+
+* Switching between the CPU and CUDA implementations of an operator, depending
+  on the devices of the input tensors.
+* Switching between the autograd and backend implementations of an operator,
+  depending on whether or not autograd handling is necessary.
+* Applying autocasting when necessary for automatic mixed precision.
+* Applying batching rules when an operator is run under a ``vmap`` call.
+* Tracing execution of operations, if you are tracing a model for export.
+
+If in your `custom operator code <torch_script_custom_ops>`_ you find yourself
+manually writing if statements to handle these cases, the dispatcher APIs can
+help organize your code.  (Conversely, if your custom operator is very simple
+and is only for CPU inference, you probably don't need to use the dispatcher,
+just use the basic API.)
+
+In this tutorial, we will describe how to structure a custom operator
+registration to use the dispatcher to organize various components.  We'll
+assume that you are familiar with how to
+`register an operator <torch_script_custom_ops>`_ and how to write
+a `custom autograd function <cpp_autograd>`_.
+
+Defining schema and backend implementations
+-------------------------------------------
+
+The general principle behind the dispatcher is that it divides the
+implementation of an operator into multiple kernels, each of which
+implements functionality for a specific *dispatch key*; for example,
+`CPU`, `CUDA` or `Autograd`.  The end effect is that when you call
+an operator, we first execute the `Autograd` kernel, and then we
+redispatch to the `CPU` or `CUDA` kernel depending on the device
+types of the passed in tensors.
+
+Let's take a look at the various parts involved in making this
+happen.  First, we must define the schema for the operator in question.
+Unlike simple pybind11-style operator registration, we don't actually
+provide an implementation of our operator at this point; we just
+provide a schema string specifying the type signature of the operator
+that all of our other kernels will abide by:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: BEGIN TORCH_LIBRARY
+  :end-before: END TORCH_LIBRARY
+
+Next, we need to actually provide some implementations of this operator.
+For concreteness, here is a really simple implementation of addition on CPU:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: BEGIN myadd_cpu
+  :end-before: END myadd_cpu
+
+We'd like to register this function as an implementation of ``myops::myadd``, but we
+don't want to register it as a catch-all kernel to be run in all cases; we
+only want it to be run when we call ``myops::myadd`` at the backend on CPU tensors.
+To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: BEGIN TORCH_LIBRARY_IMPL CPU
+  :end-before: END TORCH_LIBRARY_IMPL CPU
+
+The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on
+a specific dispatch key (in this case, ``CPU``).  Each call to ``impl``
+associates a CPU kernel with the corresponding operator (which we previously
+defined in the ``TORCH_LIBRARY`` block).  You can have as many
+``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example,
+if we also have a CUDA implementation ``myadd_cuda``, we can register it
+with:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: BEGIN TORCH_LIBRARY_IMPL CUDA
+  :end-before: END TORCH_LIBRARY_IMPL CUDA
+
+These registrations can be split across files or even across library boundaries; so
+for example, you could have these two ``TORCH_LIBRARY_IMPL`` blocks compiled
+into a separate ``myops_cpu`` and ``myops_cuda`` dynamic library.
+
+.. note::
+
+    Did you know that you can also write ``TORCH_LIBRARY_IMPL`` blocks for existing
+    core operators in PyTorch?  This is how XLA support for PyTorch is
+    implemented: the ``torch_xla`` library contains a ``TORCH_LIBRARY_IMPL``
+    that provides implementations for all basic operators on the XLA dispatch
+    key.
+
+Adding autograd support
+-----------------------
+
+At this point, we have an operator with both CPU and CUDA implementations.  How
+can we add autograd support to it?  As you might guess, we will register an
+autograd kernel (similar to what's described in the `custom autograd function <cpp_autograd>`_ tutorial)!
+However, there is a twist: unlike the CPU and CUDA kernels, the autograd kernel
+needs to *redispatch*: it needs to call back into the dispatcher to get to
+the final CPU and CUDA implementations.
+
+Thus, before we write the autograd kernel, let's write a *dispatching function*
+which calls into the dispatcher to find the right kernel for your operator.
+This function constitutes the public C++ API for your operators--in fact, all of
+the tensor functions in PyTorch's C++ API all call the dispatcher in the same
+way under the hood.  Here's what the dispatching function looks like:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: myadd
+  :end-before: myadd
+
+Let's break it down:
+
+* In the first line, we look up a typed operator handle from the dispatcher
+  corresponding to the operator that we are going to dispatch to.
+  ``findSchemaOrThrow`` takes two arguments: the (namespace qualified) name
+  of the operator, and the overload name of the operator (typically just
+  the empty string).  ``typed`` casts the dynamically typed handle into
+  a statically typed handle (doing a runtime test to make sure you've given
+  the correct C++ type), so that we can do a normal C++ call on it.  We
+  pass it ``decltype(myadd)`` since the type of the dispatching function is
+  the same as the type of the underlying kernels registered to the dispatcher.
+
+  For performance, this computation is done in a static variable, so that
+  we only need to do the (slow) lookup once.  If you typoed the name of the
+  operator you want to call, this lookup will error the first time you call this
+  function.
+
+* In the second line, we simply ``call`` the operator handle with all of the
+  arguments passed into the dispatching function.  This will actually invoke
+  the dispatcher and in the end control will be transferred to whatever kernel
+  is appropriate for this call.
+
+With the dispatch function in hand, we can now write the autograd kernel:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: myadd_autograd
+  :end-before: myadd_autograd
+
+The autograd function is written as normal using ``torch::autograd::Function``,
+except that instead of directly writing the implementation in ``forward()``,
+we:
+
+1. Turn off autograd handling with the `at::AutoNonVariableTypeMode`` RAII
+   guard, and then
+2. Call the dispatch function ``myadd`` to call back into the dispatcher.
+
+Without (1), your calls will infinite loop (and stack overflow), because
+``myadd`` will send you back to the autograd implementation!  With (1),
+the redispatch will skip over autograd and go to the next handlers,
+which will either be CPU and CUDA.
+
+We can now register this function in the same way we registered the CPU/CUDA
+functions:
+
+.. literalinclude:: ../advanced_source/dispatcher/op.cpp
+  :language: cpp
+  :start-after: BEGIN TORCH_LIBRARY_IMPL Autograd
+  :end-before: END TORCH_LIBRARY_IMPL Autograd
+
+Going beyond autograd
+---------------------
+
+In some sense, the dispatcher isn't doing all that much: all it does is
+implement a glorified if-statement, along the lines of this:
+
+.. code-block:: cpp
+
+    class MyAddFunction : ... {
+    public:
+      static Tensor forward(
+        AutogradContext *ctx, torch::Tensor self, torch::Tensor other) {
+
+        if (self.device().type() == DeviceType::CPU) {
+          return add_cpu(self, other);
+        } else if (self.device().type() == DeviceType::CUDA) {
+          return add_cuda(self, other);
+        } else {
+          TORCH_CHECK(0, "Unsupported device ", self.device().type());
+        }
+      }
+      ...
+    }
+
+So why use the dispatcher?  There are a few reasons:
+
+1. It is decentralized.  You can assemble all of the pieces of an operator
+   (CPU, CUDA, Autograd) without having to write a single, centralized
+   if statement that refers to all of them.  Importantly, third parties can
+   register extra implementations for other aspects without having to patch the
+   original definition of an operator.
+
+2. It supports more dispatch keys than CPU, CUDA and Autograd.  You can
+   see a full list of dispatch keys that are currently implemented
+   in PyTorch in ``c10/core/DispatchKey.h``.  These dispatch keys
+   implement a variety of optional functionality for operators, and if you
+   decide you want your custom operator to support this functionality,
+   all you have to register a kernel for the appropriate key.
+
+3. The dispatcher implements support for boxed fallback functions, which
+   are functions that can be implemented once and apply to all operators
+   in the system.  Boxed fallbacks can be used to provide default behavior
+   for a dispatch key; if you use the dispatcher to implement your operator,
+   you also opt into the fallbacks for all of these operations.
+
+Here are some particular dispatch keys which you may need to define an operator
+for.
+
+Autocast
+^^^^^^^^
+
+The Autocast dispatch key implements support for
+`automatic mixed precision <https://developer.nvidia.com/automatic-mixed-precision>`_
+(AMP).  An autocast kernel typically modifies the operation of an operator by casting the
+input arguments to some precision before carrying out the operation.  For some
+operations, it is numerically safe to cast to lower precision, which is how AMP
+can achieve speed ups and reduced memory usage without sacrificing much
+accuracy.  A nontrivial autocast kernel looks something like this:
+
+.. code-block:: cpp
+
+    Tensor mymatmul_autocast(const Tensor& self, const Tensor& other) {
+      c10::impl::ExcludeDispatchKeyGuard no_autocast(c10::DispatchKey::Autocast);
+      return mymatmul(autocast::_cast(at::kHalf, self), autocast::_cast(at::kHalf, other));
+    }
+
+Notice that, like our autograd kernels, we exclude the ``Autocast`` key from
+dispatch before redispatching.  By default, if no autocast kernel is provided,
+we simply fallthrough directly to the regular operator implementation (no
+autocasting occurs.) (We didn't use ``myadd`` for this example, since pointwise
+addition doesn't do autocasting and should just fall through).
+
+When should an autocast kernel be registered? Unfortunately, there aren't
+cut-and-dry rules for when you should cast to a lower precision.  You can
+get a sense for what operators have autocasting behavior by looking at
+the `AMP documentation
+<https://pytorch.org/docs/master/amp.html#op-specific-behavior>`_.  Some other
+general rules:
+
+* Operations that do reductions should be carried out in float32,
+* Any operation with multiple float tensor inputs has to standardize them
+  to a common precision, and
+* Any operation that does a convolution or gemm under the hood should
+  probably be float16
+
+..
+
+    NB: This doesn't work because torch.ops doesn't support names.
+
+    Named
+    ^^^^^
+
+    `Named tensors <https://pytorch.org/docs/stable/named_tensor.html>`_ allow
+    users to associate explicit names with tensor dimensions, and then have those
+    dimensions be propagated when you run operations on those tensors.  If you
+    define a new operator, you have to also define rules for how names should
+    be checked and propagated.  The Named kernel handles implementing these rules.
+
+    .. literalinclude:: ../advanced_source/dispatcher/op.cpp
+      :language: cpp
+      :start-after: BEGIN TORCH_LIBRARY_IMPL Named
+      :end-before: END TORCH_LIBRARY_IMPL Named
+
+Batched
+^^^^^^^
+
+Batched tensors allow you to write your code in a per-example manner, and then
+have them be automatically batched when run under a ``vmap`` invocation.  The
+API for writing batching rules is currently under development, but once it is
+stabilized, you can add support for ``vmap`` for your operators by registering
+a kernel at the Batched dispatch key.
+
+Tracer
+^^^^^^
+
+The Tracer dispatch key implements support for recording invocations of operators
+into a trace when you run ``torch.jit.trace``.  We intend to provide a
+boxed fallback that will implement tracing for arbitrary operations,
+see `issue #41478 <https://github.com/pytorch/pytorch/issues/41478>` to track
+progress.
diff --git a/advanced_source/dispatcher/CMakeLists.txt b/advanced_source/dispatcher/CMakeLists.txt
new file mode 100644
index 00000000000..0ef448a9644
--- /dev/null
+++ b/advanced_source/dispatcher/CMakeLists.txt
@@ -0,0 +1,8 @@
+cmake_minimum_required(VERSION 3.1 FATAL_ERROR)
+project(dispatcher)
+
+find_package(Torch REQUIRED)
+
+add_library(dispatcher SHARED op.cpp)
+target_compile_features(dispatcher PRIVATE cxx_std_14)
+target_link_libraries(dispatcher "${TORCH_LIBRARIES}")
diff --git a/advanced_source/dispatcher/op.cpp b/advanced_source/dispatcher/op.cpp
new file mode 100644
index 00000000000..c3a90aed448
--- /dev/null
+++ b/advanced_source/dispatcher/op.cpp
@@ -0,0 +1,105 @@
+#include <torch/torch.h>
+#include <torch/script.h>
+
+#include <ATen/NamedTensorUtils.h>
+
+using torch::Tensor;
+using torch::DeviceType;
+using torch::autograd::tensor_list;
+using torch::autograd::AutogradContext;
+
+// BEGIN myadd
+Tensor myadd(const Tensor& self, const Tensor& other) {
+  static auto op = torch::Dispatcher::singleton()
+    .findSchemaOrThrow("myops::myadd", "")
+    .typed<decltype(myadd)>();
+  return op.call(self, other);
+}
+// END myadd
+
+// BEGIN TORCH_LIBRARY
+TORCH_LIBRARY(myops, m) {
+  m.def("myadd(Tensor self, Tensor other) -> Tensor");
+}
+// END TORCH_LIBRARY
+
+// BEGIN myadd_cpu
+Tensor myadd_cpu(const Tensor& self_, const Tensor& other_) {
+  TORCH_CHECK(self_.sizes() == other_.sizes());
+  TORCH_INTERNAL_ASSERT(self_.device().type() == DeviceType::CPU);
+  TORCH_INTERNAL_ASSERT(other_.device().type() == DeviceType::CPU);
+  Tensor self = self_.contiguous();
+  Tensor other = other_.contiguous();
+  Tensor result = torch::empty(self.sizes(), self.options());
+  const float* self_ptr = self.data_ptr<float>();
+  const float* other_ptr = other.data_ptr<float>();
+  float* result_ptr = result.data_ptr<float>();
+  for (int64_t i = 0; i < result.numel(); i++) {
+    result_ptr[i] = self_ptr[i] + other_ptr[i];
+  }
+  return result;
+}
+// END myadd_cpu
+
+// BEGIN TORCH_LIBRARY_IMPL CPU
+TORCH_LIBRARY_IMPL(myops, CPU, m) {
+  m.impl("myadd", myadd_cpu);
+}
+// END TORCH_LIBRARY_IMPL CPU
+
+Tensor myadd_cuda(const Tensor& self, const Tensor& other) {
+  // Insert your CUDA implementation here
+  TORCH_CHECK(0, "CUDA not yet implemented");
+}
+
+// BEGIN TORCH_LIBRARY_IMPL CUDA
+TORCH_LIBRARY_IMPL(myops, CUDA, m) {
+  m.impl("myadd", myadd_cuda);
+}
+// END TORCH_LIBRARY_IMPL CUDA
+
+// BEGIN myadd_autograd
+class MyAddFunction : public torch::autograd::Function<MyAddFunction> {
+ public:
+  static Tensor forward(
+      AutogradContext *ctx, torch::Tensor self, torch::Tensor other) {
+    at::AutoNonVariableTypeMode g;
+    return myadd(self, other);
+  }
+
+  static tensor_list backward(AutogradContext *ctx, tensor_list grad_outputs) {
+    auto grad_output = grad_outputs[0];
+    return {grad_output, grad_output};
+  }
+};
+
+Tensor myadd_autograd(const Tensor& self, const Tensor& other) {
+  return MyAddFunction::apply(self, other)[0];
+}
+// END myadd_autograd
+
+// BEGIN TORCH_LIBRARY_IMPL Autograd
+TORCH_LIBRARY_IMPL(myops, Autograd, m) {
+  m.impl("myadd", myadd_autograd);
+}
+// END TORCH_LIBRARY_IMPL Autograd
+
+#if 0
+// BEGIN TORCH_LIBRARY_IMPL Named
+Tensor myadd_named(const Tensor& self, const Tensor& other) {
+  // TODO: shouldn't need to do size check here
+  TORCH_CHECK(self.sizes() == other.sizes());
+  auto maybe_outnames = at::unify_from_right(self.names(), other.names());
+  auto result = ([&]() {
+    at::NoNamesGuard guard;
+    return myadd(self, other);
+  })();
+  at::namedinference::propagate_names_if_nonempty(result, maybe_outnames);
+  return result;
+}
+
+TORCH_LIBRARY_IMPL(myops, Named, m) {
+  m.impl("myadd", myadd_named);
+}
+// END TORCH_LIBRARY_IMPL Named
+#endif
diff --git a/advanced_source/dispatcher/test.py b/advanced_source/dispatcher/test.py
new file mode 100644
index 00000000000..cd35b05a47a
--- /dev/null
+++ b/advanced_source/dispatcher/test.py
@@ -0,0 +1,11 @@
+import torch
+
+torch.ops.load_library("build/libdispatcher.so")
+print(torch.ops.myops.myadd(torch.randn(32, 32), torch.rand(32, 32)))
+"""
+# Doesn't currently work, because Python frontend on torch.ops doesn't
+# support names (for not a good reason?)
+x = torch.randn(32, 32, names=('A', 'B'))
+y = torch.rand(32, 32, names=('A', 'B'))
+print(torch.ops.myops.myadd(x, y))
+"""
diff --git a/index.rst b/index.rst
index 9a8762e8026..2b0f38b9cdb 100644
--- a/index.rst
+++ b/index.rst
@@ -472,6 +472,7 @@ Additional Resources
    advanced/torch_script_custom_ops
    advanced/torch_script_custom_classes
    advanced/cpp_autograd
+   advanced/dispatcher
 
 .. toctree::
    :maxdepth: 2

From b478ae97cb0b47b676737f4083ff57b00cca82f1 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <ezyang@fb.com>
Date: Wed, 15 Jul 2020 13:44:29 -0700
Subject: [PATCH 2/4] typofix

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
---
 advanced_source/dispatcher.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst
index 91b1653c0a0..3867333d6b5 100644
--- a/advanced_source/dispatcher.rst
+++ b/advanced_source/dispatcher.rst
@@ -111,8 +111,8 @@ way under the hood.  Here's what the dispatching function looks like:
 
 .. literalinclude:: ../advanced_source/dispatcher/op.cpp
   :language: cpp
-  :start-after: myadd
-  :end-before: myadd
+  :start-after: BEGIN myadd
+  :end-before: END myadd
 
 Let's break it down:
 
@@ -140,8 +140,8 @@ With the dispatch function in hand, we can now write the autograd kernel:
 
 .. literalinclude:: ../advanced_source/dispatcher/op.cpp
   :language: cpp
-  :start-after: myadd_autograd
-  :end-before: myadd_autograd
+  :start-after: BEGIN myadd_autograd
+  :end-before: END myadd_autograd
 
 The autograd function is written as normal using ``torch::autograd::Function``,
 except that instead of directly writing the implementation in ``forward()``,

From 898dede14ef91bf3b46a83575d4fc3a019c74aaa Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <ezyang@fb.com>
Date: Thu, 16 Jul 2020 11:33:11 -0700
Subject: [PATCH 3/4] morefix

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
---
 advanced_source/dispatcher.rst | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst
index 3867333d6b5..888744c90ec 100644
--- a/advanced_source/dispatcher.rst
+++ b/advanced_source/dispatcher.rst
@@ -33,9 +33,9 @@ Defining schema and backend implementations
 The general principle behind the dispatcher is that it divides the
 implementation of an operator into multiple kernels, each of which
 implements functionality for a specific *dispatch key*; for example,
-`CPU`, `CUDA` or `Autograd`.  The end effect is that when you call
-an operator, we first execute the `Autograd` kernel, and then we
-redispatch to the `CPU` or `CUDA` kernel depending on the device
+CPU, CUDA or Autograd.  The end effect is that when you call
+an operator, we first execute the Autograd kernel, and then we
+redispatch to the CPU or CUDA kernel depending on the device
 types of the passed in tensors.
 
 Let's take a look at the various parts involved in making this
@@ -69,7 +69,7 @@ To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro:
   :end-before: END TORCH_LIBRARY_IMPL CPU
 
 The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on
-a specific dispatch key (in this case, ``CPU``).  Each call to ``impl``
+a specific dispatch key (in this case, CPU).  Each call to ``impl``
 associates a CPU kernel with the corresponding operator (which we previously
 defined in the ``TORCH_LIBRARY`` block).  You can have as many
 ``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example,
@@ -147,7 +147,7 @@ The autograd function is written as normal using ``torch::autograd::Function``,
 except that instead of directly writing the implementation in ``forward()``,
 we:
 
-1. Turn off autograd handling with the `at::AutoNonVariableTypeMode`` RAII
+1. Turn off autograd handling with the ``at::AutoNonVariableTypeMode`` RAII
    guard, and then
 2. Call the dispatch function ``myadd`` to call back into the dispatcher.
 
@@ -249,24 +249,6 @@ general rules:
 * Any operation that does a convolution or gemm under the hood should
   probably be float16
 
-..
-
-    NB: This doesn't work because torch.ops doesn't support names.
-
-    Named
-    ^^^^^
-
-    `Named tensors <https://pytorch.org/docs/stable/named_tensor.html>`_ allow
-    users to associate explicit names with tensor dimensions, and then have those
-    dimensions be propagated when you run operations on those tensors.  If you
-    define a new operator, you have to also define rules for how names should
-    be checked and propagated.  The Named kernel handles implementing these rules.
-
-    .. literalinclude:: ../advanced_source/dispatcher/op.cpp
-      :language: cpp
-      :start-after: BEGIN TORCH_LIBRARY_IMPL Named
-      :end-before: END TORCH_LIBRARY_IMPL Named
-
 Batched
 ^^^^^^^
 
@@ -282,5 +264,5 @@ Tracer
 The Tracer dispatch key implements support for recording invocations of operators
 into a trace when you run ``torch.jit.trace``.  We intend to provide a
 boxed fallback that will implement tracing for arbitrary operations,
-see `issue #41478 <https://github.com/pytorch/pytorch/issues/41478>` to track
+see `issue #41478 <https://github.com/pytorch/pytorch/issues/41478>`_ to track
 progress.

From 2a96e0d5c42ad8a96f46ada92e8e35b8a5daa9b4 Mon Sep 17 00:00:00 2001
From: "Edward Z. Yang" <ezyang@fb.com>
Date: Wed, 22 Jul 2020 08:01:00 -0700
Subject: [PATCH 4/4] copyedit based on comments

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
---
 advanced_source/dispatcher.rst | 49 +++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst
index 888744c90ec..7a7d806c328 100644
--- a/advanced_source/dispatcher.rst
+++ b/advanced_source/dispatcher.rst
@@ -31,12 +31,14 @@ Defining schema and backend implementations
 -------------------------------------------
 
 The general principle behind the dispatcher is that it divides the
-implementation of an operator into multiple kernels, each of which
-implements functionality for a specific *dispatch key*; for example,
-CPU, CUDA or Autograd.  The end effect is that when you call
-an operator, we first execute the Autograd kernel, and then we
-redispatch to the CPU or CUDA kernel depending on the device
-types of the passed in tensors.
+implementation of an operator into multiple kernels, each of which implements
+functionality for a specific *dispatch key*; for example, CPU, CUDA or Autograd.
+The dispatcher determines what the highest priority dispatch key is at the time
+you call an operator (this is done by looking at both the tensor arguments as
+well as some thread local state), and transfers control to the kernel for that
+dispatch key.  The end effect is that when you call an operator, we first
+execute the Autograd kernel, and then we redispatch to the CPU or CUDA kernel
+depending on the device types of the passed in tensors.
 
 Let's take a look at the various parts involved in making this
 happen.  First, we must define the schema for the operator in question.
@@ -58,10 +60,12 @@ For concreteness, here is a really simple implementation of addition on CPU:
   :start-after: BEGIN myadd_cpu
   :end-before: END myadd_cpu
 
-We'd like to register this function as an implementation of ``myops::myadd``, but we
-don't want to register it as a catch-all kernel to be run in all cases; we
-only want it to be run when we call ``myops::myadd`` at the backend on CPU tensors.
-To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro:
+We'd like to register this function as an implementation of ``myops::myadd``.
+However, the simple way of registering it (``def("myadd", myadd_cpu)``) would
+register the kernel to run in all cases, even if the tensor is not a CPU
+tensor!  (Internally, we refer to these as "catch-all" kernels, since they
+catch all cases.)  To ensure that ``myadd_cpu`` is only run for
+CPU tensors, we can use the ``TORCH_LIBRARY_IMPL`` macro:
 
 .. literalinclude:: ../advanced_source/dispatcher/op.cpp
   :language: cpp
@@ -71,10 +75,8 @@ To do this, we can use the ``TORCH_LIBRARY_IMPL`` macro:
 The ``TORCH_LIBRARY_IMPL`` lets us register implementations for operators on
 a specific dispatch key (in this case, CPU).  Each call to ``impl``
 associates a CPU kernel with the corresponding operator (which we previously
-defined in the ``TORCH_LIBRARY`` block).  You can have as many
-``TORCH_LIBRARY_IMPL`` blocks for a namespace as you like; so for example,
-if we also have a CUDA implementation ``myadd_cuda``, we can register it
-with:
+defined in the ``TORCH_LIBRARY`` block).  If we also have a CUDA implementation ``myadd_cuda``,
+we can register it in a separate ``TORCH_LIBRARY_IMPL`` block:
 
 .. literalinclude:: ../advanced_source/dispatcher/op.cpp
   :language: cpp
@@ -83,7 +85,17 @@ with:
 
 These registrations can be split across files or even across library boundaries; so
 for example, you could have these two ``TORCH_LIBRARY_IMPL`` blocks compiled
-into a separate ``myops_cpu`` and ``myops_cuda`` dynamic library.
+into a separate ``myops_cpu`` and ``myops_cuda`` dynamic libraries.  Generally,
+speaking, the structure of your registrations will look like this:
+
+1. A single ``TORCH_LIBRARY`` that lists every custom operator in your namespace
+   in a centralized place.
+2. A ``TORCH_LIBRARY_IMPL`` per dispatch key that registers implementations for
+   that key (e.g., CPU or CUDA).  If you like, you can further subdivide
+   ``TORCH_LIBRARY_IMPL`` blocks into a block per operator. This is convenient
+   if you have a separate file per operator implementation, but don't want to
+   expose the operators in a header; you can just put the registration in the
+   cpp file that defines your operator.
 
 .. note::
 
@@ -152,9 +164,10 @@ we:
 2. Call the dispatch function ``myadd`` to call back into the dispatcher.
 
 Without (1), your calls will infinite loop (and stack overflow), because
-``myadd`` will send you back to the autograd implementation!  With (1),
-the redispatch will skip over autograd and go to the next handlers,
-which will either be CPU and CUDA.
+``myadd`` will send you back to this function (as the highest priority dispatch
+key would still be autograd.) With (1),
+autograd is excluded from the set of dispatch keys under consideration, and
+we will go to the next handlers, which will either be CPU and CUDA.
 
 We can now register this function in the same way we registered the CPU/CUDA
 functions: