Skip to content

[Quant] Move parts of BackendConfig tutorial #2169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 18, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
298 changes: 85 additions & 213 deletions prototype_source/backend_config_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,145 +11,6 @@ For more information on the motivation and implementation details behind
BackendConfig, please refer to this
`README <https://github.com/pytorch/pytorch/tree/master/torch/ao/quantization/backend_config>`__.

BackendConfig API Specification
-------------------------------

On a high level, BackendConfig specifies the quantization behavior for
each supported operator pattern (e.g. linear, conv-bn-relu). The API is
broken down into the following class hierarchy:

- `BackendConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendConfig.html>`__:
The main class to be passed to prepare and convert functions.
- `BackendPatternConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.BackendPatternConfig.html>`__:
Config object that specifies quantization behavior for a given
operator pattern. Each BackendConfig consists of many of these.
- `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html>`__:
Config object that specifies the supported data types passed as
arguments to quantize ops in the reference model spec, for input
and output activations, weights, and biases. This object also
optionally specifies constraints associated with the data types.
Each BackendPatternConfig consists of one or more of these.
- `DTypeWithConstraints <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeWithConstraints.html>`__:
Constraints imposed by the backend on the quantization parameters
(scale and zero point) and ranges when quantizing to a given data
type. Each DTypeConfig consists of many of these.

The pattern specified in BackendPatternConfig follows the format
described `here <https://github.com/pytorch/pytorch/blob/master/torch/ao/quantization/backend_config/README.md#pattern-specification>`__.

BackendPatternConfig Specification
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

set_observation_type
^^^^^^^^^^^^^^^^^^^^

Observation type here refers to how observers (or quant-dequant ops)
will be placed in the graph. This is used to produce the desired
reference patterns understood by the backend. Weighted ops such as
linear and conv require different observers (or quantization parameters
passed to quantize ops in the reference model) for the input and the
output (see `ObservationType <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.ObservationType.html>`__).

Note: This will be renamed in the near future, since we will soon insert
QuantDeQuantStubs with observers (and fake quantizes) attached instead
of observers themselves.

set_dtype_configs / add_type_config
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each operator pattern may support one or more sets of
input/output/weight/bias data types, and each set may have their own
constraints. These requirements are captured in DTypeConfigs, which will
be described in more detail in the next section.

set_root_module / set_reference_quantized_module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When we construct the reference quantized model during the convert
phase, the root modules (e.g. ``torch.nn.Linear`` for
``torch.ao.nn.intrinsic.LinearReLU``) will be swapped to the
corresponding reference quantized modules (e.g.
``torch.ao.nn.reference.quantized.Linear``). This allows custom backends
to specify custom reference quantized module implementations to match
the numerics of their lowered operators. Since this is a one-to-one
mapping, both the root module and the reference quantized module must be
specified in the same BackendPatternConfig in order for the conversion
to take place.

set_fuser_method
^^^^^^^^^^^^^^^^

As an optimization, operator patterns such as (``torch.nn.Linear``,
``torch.nn.ReLU``) may be fused into ``nni.LinearReLU``.
``set_fuser_method`` specifies the function through which this is
performed. The first argument of this function is ``is_qat``, and the
rest of the arguments are the items in the tuple pattern, e.g. the fuser
method for the above pattern will have three arguments, ``is_qat``,
``linear``, and ``relu``. See `this
example <https://gist.github.com/jerryzh168/8bea7180a8ba3c279f2c9b050f2a69a6>`__
for a slightly more complicated usage.

set_fused_module
^^^^^^^^^^^^^^^^

This is used to identify fused weighted modules (e.g.
``torch.ao.nn.intrinsic.LinearReLU``) that need to be converted to
reference quantized modules.

Data Type Restrictions
~~~~~~~~~~~~~~~~~~~~~~

Each DTypeConfig attached to a BackendPatternConfig represents a set of
supported data types passed as arguments to quantize ops in the reference
model spec. For example, consider the following reference model::

quant1 - [dequant1 - fp32_linear - quant2] - dequant2

The pattern in the square brackets refers to the reference pattern of
statically quantized linear. Setting the input dtype as `torch.quint8`
in the DTypeConfig means we pass in `torch.quint8` as the dtype argument
to the first quantize op (quant1). Similarly, setting the output dtype as
`torch.quint8` means we pass in `torch.quint8` as the dtype argument to
the second quantize op (quant2).

Note that the dtype here does not refer to the interface dtypes of the
op. For example, the "input dtype" here is not the dtype of the input
tensor passed to the quantized linear op. Though it can still be the
same as the interface dtype, this is not always the case, e.g. the
interface dtype is fp32 in dynamic quantization but the "input dtype"
specified in the DTypeConfig would still be quint8. The semantics of
dtypes here are the same as the semantics of the dtypes specified in
the observers.

These dtypes are matched against the ones specified in the user’s
QConfig. If there is a match, and the QConfig satisfies the constraints
specified in the DTypeConfig (if any), then we will quantize the given
pattern using this DTypeConfig. Otherwise, the QConfig is ignored and
the pattern will not be quantized.

There are two ways of specifying ``input_dtype``, ``output_dtype``, and
``weight_dtype``, as simple ``torch.dtype`` or as
``DTypeWithConstraints``. The constraints currently supported are:

- **quant_min_lower_bound** and **quant_max_upper_bound**: Lower and upper
bounds for the minimum and maximum quantized values respectively. If the
QConfig’s ``quant_min`` and ``quant_max`` fall outside this range, then
the QConfig will be ignored.
- **scale_min_lower_bound** and **scale_max_upper_bound**: Lower and
upper bounds for the minimum and aximum scale values respectively. If
the QConfig’s minimum scale value (currently exposed as ``eps``) falls
below the lower bound, then the QConfig will be ignored. Note that the
upper bound is currently not enforced.
- **scale_exact_match** and **zero_point_exact_match**: Exact match
requirements for scale and zero point, to be used for operators with
fixed quantization parameters such as sigmoid and tanh. If the observer
specified in the QConfig is neither ``FixedQParamsObserver`` nor
``FixedQParamsFakeQuantize``, or if the quantization parameters don't
match, then the QConfig will be ignored.

End-to-End Example
------------------

Suppose we are a backend developer and we wish to integrate our backend
with PyTorch's quantization APIs. Our backend consists of two ops only:
quantized linear and quantized conv-relu. In this section, we will walk
Expand All @@ -175,6 +36,9 @@ BackendConfig through `prepare_fx` and `convert_fx`.
)
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx

1. Derive reference pattern for each quantized operator
--------------------------------------------------------

For quantized linear, suppose our backend expects the reference pattern
`[dequant - fp32_linear - quant]` and lowers it into a single quantized
linear op. The way to achieve this is to first insert quant-dequant ops
Expand All @@ -183,17 +47,21 @@ reference model::

quant1 - [dequant1 - fp32_linear - quant2] - dequant2

Here we specify using different observers (will be renamed) for the input
and output for the linear op, so the quantization params passed to the two
quantize ops (quant1 and quant2) will be different. This is commonly the
case for weighted ops like linear and conv.
Similarly, for quantized conv-relu, we wish to produce the following
reference model, where the reference pattern in the square brackets will
be lowered into a single quantized conv-relu op::

quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2

The input dtype specified in the DTypeConfig will be passed as the dtype
argument to quant1, while the output dtype will be passed as the dtype
argument to quant2. If the output dtype is fp32, as in the case of dynamic
quantization, then the output quant-dequant pair will not be inserted.
This example also shows how to specify restrictions on quantization and
scale ranges on a particular dtype.
2. Set DTypeConfigs with backend constraints
---------------------------------------------

In the reference patterns above, the input dtype specified in the
DTypeConfig will be passed as the dtype argument to quant1, while the
output dtype will be passed as the dtype argument to quant2. If the output
dtype is fp32, as in the case of dynamic quantization, then the output
quant-dequant pair will not be inserted. This example also shows how to
specify restrictions on quantization and scale ranges on a particular dtype.

.. code:: ipython3

Expand All @@ -211,6 +79,38 @@ scale ranges on a particular dtype.
weight_dtype=torch.qint8,
bias_dtype=torch.float)

3. Set up fusion for conv-relu
-------------------------------

Note that the original user model contains separate conv and relu ops,
so we need to first fuse the conv and relu ops into a single conv-relu
op (`fp32_conv_relu`), and then quantize this op similar to how the linear
op is quantized. We can set up fusion by defining a function that accepts
3 arguments, where the first is whether or not this is for QAT, and the
remaining arguments refer to the individual items of the fused pattern.

.. code:: ipython3

def fuse_conv2d_relu(is_qat, conv, relu):
"""Return a fused ConvReLU2d from individual conv and relu modules."""
return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)

4. Define the BackendConfig
----------------------------

Now we have all the necessary pieces, so we go ahead and define our
BackendConfig. Here we use different observers (will be renamed) for
the input and output for the linear op, so the quantization params
passed to the two quantize ops (quant1 and quant2) will be different.
This is commonly the case for weighted ops like linear and conv.

For the conv-relu op, the observation type is the same. However, we
need two BackendPatternConfigs to support this op, one for fusion
and one for quantization. For both conv-relu and linear, we use the
DTypeConfig defined above.

.. code:: ipython3

linear_config = BackendPatternConfig() \
.set_pattern(torch.nn.Linear) \
.set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \
Expand All @@ -219,24 +119,6 @@ scale ranges on a particular dtype.
.set_qat_module(torch.nn.qat.Linear) \
.set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear)

For quantized conv-relu, the observation type and DTypeConfig settings
are the same, since we wish to produce the following reference model,
where the reference pattern in the square brackets will be lowered into
a single quantized conv-relu op::

quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2

However, first we need to fuse the conv and relu ops into a single
conv-relu op (`fp32_conv_relu`), and then quantize this op similar to
how the linear op is quantized. Thus, we need two BackendPatternConfigs
to support this op, one for fusion and one for quantization:

.. code:: ipython3

def fuse_conv2d_relu(is_qat, conv, relu):
"""Return a fused ConvReLU2d from individual conv and relu modules."""
return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu)

# For fusing Conv2d + ReLU into ConvReLU2d
# No need to set observation type and dtype config here, since we are not
# inserting quant-dequant ops in this step yet
Expand All @@ -254,23 +136,43 @@ to support this op, one for fusion and one for quantization:
.set_qat_module(torch.ao.nn.intrinsic.qat.ConvReLU2d) \
.set_reference_quantized_module(torch.ao.nn.quantized.reference.Conv2d)

Now we have all the necessary pieces, so we go ahead and define our
BackendConfig and test it out on an example model. Here we see that
both linear and conv-relu are quantized.

.. code:: ipython3

backend_config = BackendConfig("my_backend") \
.set_backend_pattern_config(linear_config) \
.set_backend_pattern_config(conv_relu_config) \
.set_backend_pattern_config(fused_conv_relu_config)

5. Set up QConfigMapping that satisfies the backend constraints
----------------------------------------------------------------

In order to use the ops defined above, the user must define a QConfig
that satisfies the constraints specified in the DTypeConfig. For more
detail, see the documentation for `DTypeConfig <https://pytorch.org/docs/stable/generated/torch.ao.quantization.backend_config.DTypeConfig.html>`__.
We will then use this QConfig for all the modules used in the patterns
we wish to quantize.

.. code:: ipython3

# Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)

# Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
# (Conv2d, ReLU), must have the same QConfig
qconfig_mapping = QConfigMapping() \
.set_object_type(torch.nn.Linear, qconfig) \
.set_object_type(torch.nn.Conv2d, qconfig) \
.set_object_type(torch.nn.BatchNorm2d, qconfig) \
.set_object_type(torch.nn.ReLU, qconfig)

6. Quantize the model through prepare and convert
--------------------------------------------------

Finally, we quantize the model by passing the BackendConfig we defined
into prepare and convert. This produces a quantized linear module and
a fused quantized conv-relu module.

.. code:: ipython3

# ====================
# Example user model
# ====================

class MyModel(torch.nn.Module):
def __init__(self, use_bn: bool):
super().__init__()
Expand All @@ -280,7 +182,7 @@ both linear and conv-relu are quantized.
self.relu = torch.nn.ReLU()
self.sigmoid = torch.nn.Sigmoid()
self.use_bn = use_bn

def forward(self, x):
x = self.linear(x)
x = self.conv(x)
Expand All @@ -290,31 +192,6 @@ both linear and conv-relu are quantized.
x = self.sigmoid(x)
return x

.. code:: ipython3

# =======================
# Custom QConfigMapping
# =======================

# Define a QConfig that satisfies the constraints specified in DTypeConfig
# Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`)
activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12)
qconfig = QConfig(activation=activation_observer, weight=default_weight_observer)

# Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in
# (Conv2d, ReLU), must have the same QConfig
qconfig_mapping = QConfigMapping() \
.set_object_type(torch.nn.Linear, qconfig) \
.set_object_type(torch.nn.Conv2d, qconfig) \
.set_object_type(torch.nn.BatchNorm2d, qconfig) \
.set_object_type(torch.nn.ReLU, qconfig)

.. code:: ipython3

# =====================
# Prepare and Convert
# =====================

example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
model = MyModel(use_bn=False)
prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
Expand All @@ -341,17 +218,16 @@ both linear and conv-relu are quantized.
sigmoid = self.sigmoid(dequantize_2); dequantize_2 = None
return sigmoid

(7. Experiment with faulty BackendConfig setups)
-------------------------------------------------

As an experiment, here we modify the model to use conv-bn-relu
instead of conv-relu, but use the same BackendConfig, which doesn't
know how to quantize conv-bn-relu. As a result, only linear is
quantized, but conv-bn-relu is neither fused nor quantized.

.. code:: ipython3

# ================================================
# Prepare and Convert (only linear is quantized)
# ================================================

# Only linear is quantized, since there's no rule for fusing conv-bn-relu
example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
model = MyModel(use_bn=True)
prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config)
Expand Down Expand Up @@ -387,11 +263,7 @@ doesn't satisfy the dtype constraints specified in the backend. As
a result, nothing is quantized since the QConfigs are simply ignored.

.. code:: ipython3

# ============================================
# Prepare and Convert (nothing is quantized)
# ============================================

# Nothing is quantized or fused, since backend constraints are not satisfied
example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),)
model = MyModel(use_bn=True)
prepared = prepare_fx(model, get_default_qconfig_mapping(), example_inputs, backend_config=backend_config)
Expand Down