pytorch · svekars · Jan 18, 2023 · Jan 11, 2023
diff --git a/prototype_source/backend_config_tutorial.rst b/prototype_source/backend_config_tutorial.rst
@@ -11,145 +11,6 @@ For more information on the motivation and implementation details behind
 BackendConfig, please refer to this
 `README <https://github.com/pytorch/pytorch/tree/master/torch/ao/quantization/backend_config>`__.
 
-BackendConfig API Specification
--------------------------------
-
-On a high level, BackendConfig specifies the quantization behavior for
-each supported operator pattern (e.g. linear, conv-bn-relu). The API is
-broken down into the following class hierarchy:
-
-- `BackendConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendConfig.html>`__:
-  The main class to be passed to prepare and convert functions.
-- `BackendPatternConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendPatternConfig.html>`__:
-  Config object that specifies quantization behavior for a given
-  operator pattern. Each BackendConfig consists of many of these.
-- `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html>`__:
-  Config object that specifies the supported data types passed as
-  arguments to quantize ops in the reference model spec, for input
-  and output activations, weights, and biases. This object also
-  optionally specifies constraints associated with the data types.
-  Each BackendPatternConfig consists of one or more of these.
-- `DTypeWithConstraints <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeWithConstraints.html>`__:
-  Constraints imposed by the backend on the quantization parameters
-  (scale and zero point) and ranges when quantizing to a given data
-  type. Each DTypeConfig consists of many of these.
-
-The pattern specified in BackendPatternConfig follows the format
-described `here <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/backend_config/README.md#pattern-specification>`__.
-
-BackendPatternConfig Specification
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-set_observation_type
-^^^^^^^^^^^^^^^^^^^^
-
-Observation type here refers to how observers (or quant-dequant ops)
-will be placed in the graph. This is used to produce the desired
-reference patterns understood by the backend. Weighted ops such as
-linear and conv require different observers (or quantization parameters
-passed to quantize ops in the reference model) for the input and the
-output (see `ObservationType <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.ObservationType.html>`__).
-
-Note: This will be renamed in the near future, since we will soon insert
-QuantDeQuantStubs with observers (and fake quantizes) attached instead
-of observers themselves.
-
-set_dtype_configs / add_type_config
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-Each operator pattern may support one or more sets of
-input/output/weight/bias data types, and each set may have their own
-constraints. These requirements are captured in DTypeConfigs, which will
-be described in more detail in the next section.
-
-set_root_module / set_reference_quantized_module
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-When we construct the reference quantized model during the convert
-phase, the root modules (e.g. ``torch.nn.Linear`` for
-``torch.ao.nn.intrinsic.LinearReLU``) will be swapped to the
-corresponding reference quantized modules (e.g.
-``torch.ao.nn.reference.quantized.Linear``). This allows custom backends
-to specify custom reference quantized module implementations to match
-the numerics of their lowered operators. Since this is a one-to-one
-mapping, both the root module and the reference quantized module must be
-specified in the same BackendPatternConfig in order for the conversion
-to take place.
-
-set_fuser_method
-^^^^^^^^^^^^^^^^
-
-As an optimization, operator patterns such as (``torch.nn.Linear``,
-``torch.nn.ReLU``) may be fused into ``nni.LinearReLU``.
-``set_fuser_method`` specifies the function through which this is
-performed. The first argument of this function is ``is_qat``, and the
-rest of the arguments are the items in the tuple pattern, e.g. the fuser
-method for the above pattern will have three arguments, ``is_qat``,
-``linear``, and ``relu``. See `this
-example <https://gist.github.com/jerryzh168/8bea7180a8ba3c279f2c9b050f2a69a6>`__
-for a slightly more complicated usage.
-
-set_fused_module
-^^^^^^^^^^^^^^^^
-
-This is used to identify fused weighted modules (e.g.
-``torch.ao.nn.intrinsic.LinearReLU``) that need to be converted to
-reference quantized modules.
-
-Data Type Restrictions
-~~~~~~~~~~~~~~~~~~~~~~
-
-Each DTypeConfig attached to a BackendPatternConfig represents a set of
-supported data types passed as arguments to quantize ops in the reference
-model spec. For example, consider the following reference model::
-
-  quant1 - [dequant1 - fp32_linear - quant2] - dequant2
-
-The pattern in the square brackets refers to the reference pattern of
-statically quantized linear. Setting the input dtype as `torch.quint8`
-in the DTypeConfig means we pass in `torch.quint8` as the dtype argument
-to the first quantize op (quant1). Similarly, setting the output dtype as
-`torch.quint8` means we pass in `torch.quint8` as the dtype argument to
-the second quantize op (quant2).
-
-Note that the dtype here does not refer to the interface dtypes of the
-op. For example, the "input dtype" here is not the dtype of the input
-tensor passed to the quantized linear op. Though it can still be the
-same as the interface dtype, this is not always the case, e.g. the
-interface dtype is fp32 in dynamic quantization but the "input dtype"
-specified in the DTypeConfig would still be quint8. The semantics of
-dtypes here are the same as the semantics of the dtypes specified in
-the observers.
-
-These dtypes are matched against the ones specified in the user’s
-QConfig. If there is a match, and the QConfig satisfies the constraints
-specified in the DTypeConfig (if any), then we will quantize the given
-pattern using this DTypeConfig. Otherwise, the QConfig is ignored and
-the pattern will not be quantized.
-
-There are two ways of specifying ``input_dtype``, ``output_dtype``, and
-``weight_dtype``, as simple ``torch.dtype`` or as
-``DTypeWithConstraints``. The constraints currently supported are:
-
-- **quant_min_lower_bound** and **quant_max_upper_bound**: Lower and upper
-  bounds for the minimum and maximum quantized values respectively. If the
-  QConfig’s ``quant_min`` and ``quant_max`` fall outside this range, then
-  the QConfig will be ignored.
-- **scale_min_lower_bound** and **scale_max_upper_bound**: Lower and
-  upper bounds for the minimum and  aximum scale values respectively. If
-  the QConfig’s minimum scale value (currently exposed as ``eps``) falls
-  below the lower bound, then the QConfig will be ignored. Note that the
-  upper bound is currently not enforced.
-- **scale_exact_match** and **zero_point_exact_match**: Exact match
-  requirements for scale and zero point, to be used for operators with
-  fixed quantization parameters such as sigmoid and tanh. If the observer
-  specified in the QConfig is neither ``FixedQParamsObserver`` nor
-  ``FixedQParamsFakeQuantize``, or if the quantization parameters don't
-  match, then the QConfig will be ignored.
-
-End-to-End Example
-------------------
-
 Suppose we are a backend developer and we wish to integrate our backend
 with PyTorch's quantization APIs. Our backend consists of two ops only:
 quantized linear and quantized conv-relu. In this section, we will walk
@@ -175,6 +36,9 @@ BackendConfig through `prepare_fx` and `convert_fx`.
     )
     from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
 
+1. Derive reference pattern for each quantized operator
+--------------------------------------------------------
+
 For quantized linear, suppose our backend expects the reference pattern
 `[dequant - fp32_linear - quant]` and lowers it into a single quantized
 linear op. The way to achieve this is to first insert quant-dequant ops
@@ -183,17 +47,21 @@ reference model::
 
   quant1 - [dequant1 - fp32_linear - quant2] - dequant2
 
-Here we specify using different observers (will be renamed) for the input
-and output for the linear op, so the quantization params passed to the two
-quantize ops (quant1 and quant2) will be different. This is commonly the
-case for weighted ops like linear and conv.
+Similarly, for quantized conv-relu, we wish to produce the following
+reference model, where the reference pattern in the square brackets will
+be lowered into a single quantized conv-relu op::
+
+  quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2
 
-The input dtype specified in the DTypeConfig will be passed as the dtype
-argument to quant1, while the output dtype will be passed as the dtype
-argument to quant2. If the output dtype is fp32, as in the case of dynamic
-quantization, then the output quant-dequant pair will not be inserted.
-This example also shows how to specify restrictions on quantization and
-scale ranges on a particular dtype.
+2. Set DTypeConfigs with backend constraints
+---------------------------------------------
+
+In the reference patterns above, the input dtype specified in the
+DTypeConfig will be passed as the dtype argument to quant1, while the
+output dtype will be passed as the dtype argument to quant2. If the output
+dtype is fp32, as in the case of dynamic quantization, then the output
+quant-dequant pair will not be inserted. This example also shows how to
+specify restrictions on quantization and scale ranges on a particular dtype.
 
 .. code:: ipython3
 
@@ -211,6 +79,38 @@ scale ranges on a particular dtype.
         weight_dtype=torch.qint8,
         bias_dtype=torch.float)
 
+3. Set up fusion for conv-relu
+-------------------------------
+
+Note that the original user model contains separate conv and relu ops,
+so we need to first fuse the conv and relu ops into a single conv-relu
+op (`fp32_conv_relu`), and then quantize this op similar to how the linear
+op is quantized. We can set up fusion by defining a function that accepts
+3 arguments, where the first is whether or not this is for QAT, and the
+remaining arguments refer to the individual items of the fused pattern.
+
+.. code:: ipython3
+
+   def fuse_conv2d_relu(is_qat, conv, relu):
+       """Return a fused ConvReLU2d from individual conv and relu modules."""
+       return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)
+
+4. Define the BackendConfig
+----------------------------
+
+Now we have all the necessary pieces, so we go ahead and define our
+BackendConfig. Here we use different observers (will be renamed) for
+the input and output for the linear op, so the quantization params
+passed to the two quantize ops (quant1 and quant2) will be different.
+This is commonly the case for weighted ops like linear and conv.
+
+For the conv-relu op, the observation type is the same. However, we
+need two BackendPatternConfigs to support this op, one for fusion
+and one for quantization. For both conv-relu and linear, we use the
+DTypeConfig defined above.
+
+.. code:: ipython3
+
     linear_config = BackendPatternConfig() \
         .set_pattern(torch.nn.Linear) \
         .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
@@ -219,24 +119,6 @@ scale ranges on a particular dtype.
         .set_qat_module(torch.nn.qat.Linear) \
         .set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)
 
-For quantized conv-relu, the observation type and DTypeConfig settings
-are the same, since we wish to produce the following reference model,
-where the reference pattern in the square brackets will be lowered into
-a single quantized conv-relu op::
-
-  quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2
-
-However, first we need to fuse the conv and relu ops into a single
-conv-relu op (`fp32_conv_relu`), and then quantize this op similar to
-how the linear op is quantized. Thus, we need two BackendPatternConfigs
-to support this op, one for fusion and one for quantization:
-
-.. code:: ipython3
-
-   def fuse_conv2d_relu(is_qat, conv, relu):
-       """Return a fused ConvReLU2d from individual conv and relu modules."""
-       return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)
-
     # For fusing Conv2d + ReLU into ConvReLU2d
     # No need to set observation type and dtype config here, since we are not
     # inserting quant-dequant ops in this step yet
@@ -254,23 +136,43 @@ to support this op, one for fusion and one for quantization:
         .set_qat_module(torch.ao.nn.intrinsic.qat.ConvReLU2d) \
         .set_reference_quantized_module(torch.ao.nn.quantized.reference.Conv2d)
 
-Now we have all the necessary pieces, so we go ahead and define our
-BackendConfig and test it out on an example model. Here we see that
-both linear and conv-relu are quantized.
-
-.. code:: ipython3
-
     backend_config = BackendConfig("my_backend") \
         .set_backend_pattern_config(linear_config) \
         .set_backend_pattern_config(conv_relu_config) \
         .set_backend_pattern_config(fused_conv_relu_config)
 
+5. Set up QConfigMapping that satisfies the backend constraints
+----------------------------------------------------------------
+
+In order to use the ops defined above, the user must define a QConfig
+that satisfies the constraints specified in the DTypeConfig. For more
+detail, see the documentation for `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html>`__.
+We will then use this QConfig for all the modules used in the patterns
+we wish to quantize.
+
+.. code:: ipython3
+
+    # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
+    activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
+    qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)
+
+    # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
+    # (Conv2d, ReLU), must have the same QConfig
+    qconfig_mapping = QConfigMapping() \
+        .set_object_type(torch.nn.Linear, qconfig) \
+        .set_object_type(torch.nn.Conv2d, qconfig) \
+        .set_object_type(torch.nn.BatchNorm2d, qconfig) \
+        .set_object_type(torch.nn.ReLU, qconfig)
+
+6. Quantize the model through prepare and convert
+--------------------------------------------------
+
+Finally, we quantize the model by passing the BackendConfig we defined
+into prepare and convert. This produces a quantized linear module and
+a fused quantized conv-relu module.
+
 .. code:: ipython3
 
-    # ====================
-    #  Example user model
-    # ====================
-
     class MyModel(torch.nn.Module):
         def __init__(self, use_bn: bool):
             super().__init__()
@@ -280,7 +182,7 @@ both linear and conv-relu are quantized.
             self.relu = torch.nn.ReLU()
             self.sigmoid = torch.nn.Sigmoid()
             self.use_bn = use_bn
-    
+
         def forward(self, x):
             x = self.linear(x)
             x = self.conv(x)
@@ -290,31 +192,6 @@ both linear and conv-relu are quantized.
             x = self.sigmoid(x)
             return x
 
-.. code:: ipython3
-
-    # =======================
-    #  Custom QConfigMapping
-    # =======================
-
-    # Define a QConfig that satisfies the constraints specified in DTypeConfig
-    # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
-    activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
-    qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)
-
-    # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
-    # (Conv2d, ReLU), must have the same QConfig
-    qconfig_mapping = QConfigMapping() \
-        .set_object_type(torch.nn.Linear, qconfig) \
-        .set_object_type(torch.nn.Conv2d, qconfig) \
-        .set_object_type(torch.nn.BatchNorm2d, qconfig) \
-        .set_object_type(torch.nn.ReLU, qconfig)
-
-.. code:: ipython3
-
-    # =====================
-    #  Prepare and Convert
-    # =====================
-
     example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
     model = MyModel(use_bn=False)
     prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
@@ -341,17 +218,16 @@ both linear and conv-relu are quantized.
         sigmoid = self.sigmoid(dequantize_2);  dequantize_2 = None
         return sigmoid
 
+(7. Experiment with faulty BackendConfig setups)
+-------------------------------------------------
+
 As an experiment, here we modify the model to use conv-bn-relu
 instead of conv-relu, but use the same BackendConfig, which doesn't
 know how to quantize conv-bn-relu. As a result, only linear is
 quantized, but conv-bn-relu is neither fused nor quantized.
 
 .. code:: ipython3
-
-    # ================================================
-    #  Prepare and Convert (only linear is quantized)
-    # ================================================
-
+    # Only linear is quantized, since there's no rule for fusing conv-bn-relu
     example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
     model = MyModel(use_bn=True)
     prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
@@ -387,11 +263,7 @@ doesn't satisfy the dtype constraints specified in the backend. As
 a result, nothing is quantized since the QConfigs are simply ignored.
 
 .. code:: ipython3
-
-    # ============================================
-    #  Prepare and Convert (nothing is quantized)
-    # ============================================
-
+    # Nothing is quantized or fused, since backend constraints are not satisfied
     example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
     model = MyModel(use_bn=True)
     prepared = prepare_fx(model, get_default_qconfig_mapping(), example_inputs, backend_config=backend_config)