diff --git a/prototype_source/backend_config_tutorial.rst b/prototype_source/backend_config_tutorial.rst index 037e2ab3073..ba6729285e5 100644 --- a/prototype_source/backend_config_tutorial.rst +++ b/prototype_source/backend_config_tutorial.rst @@ -11,145 +11,6 @@ For more information on the motivation and implementation details behind BackendConfig, please refer to this `README `__. -BackendConfig API Specification -------------------------------- - -On a high level, BackendConfig specifies the quantization behavior for -each supported operator pattern (e.g. linear, conv-bn-relu). The API is -broken down into the following class hierarchy: - -- `BackendConfig `__: - The main class to be passed to prepare and convert functions. -- `BackendPatternConfig `__: - Config object that specifies quantization behavior for a given - operator pattern. Each BackendConfig consists of many of these. -- `DTypeConfig `__: - Config object that specifies the supported data types passed as - arguments to quantize ops in the reference model spec, for input - and output activations, weights, and biases. This object also - optionally specifies constraints associated with the data types. - Each BackendPatternConfig consists of one or more of these. -- `DTypeWithConstraints `__: - Constraints imposed by the backend on the quantization parameters - (scale and zero point) and ranges when quantizing to a given data - type. Each DTypeConfig consists of many of these. - -The pattern specified in BackendPatternConfig follows the format -described `here `__. - -BackendPatternConfig Specification -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -set_observation_type -^^^^^^^^^^^^^^^^^^^^ - -Observation type here refers to how observers (or quant-dequant ops) -will be placed in the graph. This is used to produce the desired -reference patterns understood by the backend. Weighted ops such as -linear and conv require different observers (or quantization parameters -passed to quantize ops in the reference model) for the input and the -output (see `ObservationType `__). - -Note: This will be renamed in the near future, since we will soon insert -QuantDeQuantStubs with observers (and fake quantizes) attached instead -of observers themselves. - -set_dtype_configs / add_type_config -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Each operator pattern may support one or more sets of -input/output/weight/bias data types, and each set may have their own -constraints. These requirements are captured in DTypeConfigs, which will -be described in more detail in the next section. - -set_root_module / set_reference_quantized_module -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -When we construct the reference quantized model during the convert -phase, the root modules (e.g. ``torch.nn.Linear`` for -``torch.ao.nn.intrinsic.LinearReLU``) will be swapped to the -corresponding reference quantized modules (e.g. -``torch.ao.nn.reference.quantized.Linear``). This allows custom backends -to specify custom reference quantized module implementations to match -the numerics of their lowered operators. Since this is a one-to-one -mapping, both the root module and the reference quantized module must be -specified in the same BackendPatternConfig in order for the conversion -to take place. - -set_fuser_method -^^^^^^^^^^^^^^^^ - -As an optimization, operator patterns such as (``torch.nn.Linear``, -``torch.nn.ReLU``) may be fused into ``nni.LinearReLU``. -``set_fuser_method`` specifies the function through which this is -performed. The first argument of this function is ``is_qat``, and the -rest of the arguments are the items in the tuple pattern, e.g. the fuser -method for the above pattern will have three arguments, ``is_qat``, -``linear``, and ``relu``. See `this -example `__ -for a slightly more complicated usage. - -set_fused_module -^^^^^^^^^^^^^^^^ - -This is used to identify fused weighted modules (e.g. -``torch.ao.nn.intrinsic.LinearReLU``) that need to be converted to -reference quantized modules. - -Data Type Restrictions -~~~~~~~~~~~~~~~~~~~~~~ - -Each DTypeConfig attached to a BackendPatternConfig represents a set of -supported data types passed as arguments to quantize ops in the reference -model spec. For example, consider the following reference model:: - - quant1 - [dequant1 - fp32_linear - quant2] - dequant2 - -The pattern in the square brackets refers to the reference pattern of -statically quantized linear. Setting the input dtype as `torch.quint8` -in the DTypeConfig means we pass in `torch.quint8` as the dtype argument -to the first quantize op (quant1). Similarly, setting the output dtype as -`torch.quint8` means we pass in `torch.quint8` as the dtype argument to -the second quantize op (quant2). - -Note that the dtype here does not refer to the interface dtypes of the -op. For example, the "input dtype" here is not the dtype of the input -tensor passed to the quantized linear op. Though it can still be the -same as the interface dtype, this is not always the case, e.g. the -interface dtype is fp32 in dynamic quantization but the "input dtype" -specified in the DTypeConfig would still be quint8. The semantics of -dtypes here are the same as the semantics of the dtypes specified in -the observers. - -These dtypes are matched against the ones specified in the user’s -QConfig. If there is a match, and the QConfig satisfies the constraints -specified in the DTypeConfig (if any), then we will quantize the given -pattern using this DTypeConfig. Otherwise, the QConfig is ignored and -the pattern will not be quantized. - -There are two ways of specifying ``input_dtype``, ``output_dtype``, and -``weight_dtype``, as simple ``torch.dtype`` or as -``DTypeWithConstraints``. The constraints currently supported are: - -- **quant_min_lower_bound** and **quant_max_upper_bound**: Lower and upper - bounds for the minimum and maximum quantized values respectively. If the - QConfig’s ``quant_min`` and ``quant_max`` fall outside this range, then - the QConfig will be ignored. -- **scale_min_lower_bound** and **scale_max_upper_bound**: Lower and - upper bounds for the minimum and aximum scale values respectively. If - the QConfig’s minimum scale value (currently exposed as ``eps``) falls - below the lower bound, then the QConfig will be ignored. Note that the - upper bound is currently not enforced. -- **scale_exact_match** and **zero_point_exact_match**: Exact match - requirements for scale and zero point, to be used for operators with - fixed quantization parameters such as sigmoid and tanh. If the observer - specified in the QConfig is neither ``FixedQParamsObserver`` nor - ``FixedQParamsFakeQuantize``, or if the quantization parameters don't - match, then the QConfig will be ignored. - -End-to-End Example ------------------- - Suppose we are a backend developer and we wish to integrate our backend with PyTorch's quantization APIs. Our backend consists of two ops only: quantized linear and quantized conv-relu. In this section, we will walk @@ -175,6 +36,9 @@ BackendConfig through `prepare_fx` and `convert_fx`. ) from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx +1. Derive reference pattern for each quantized operator +-------------------------------------------------------- + For quantized linear, suppose our backend expects the reference pattern `[dequant - fp32_linear - quant]` and lowers it into a single quantized linear op. The way to achieve this is to first insert quant-dequant ops @@ -183,17 +47,21 @@ reference model:: quant1 - [dequant1 - fp32_linear - quant2] - dequant2 -Here we specify using different observers (will be renamed) for the input -and output for the linear op, so the quantization params passed to the two -quantize ops (quant1 and quant2) will be different. This is commonly the -case for weighted ops like linear and conv. +Similarly, for quantized conv-relu, we wish to produce the following +reference model, where the reference pattern in the square brackets will +be lowered into a single quantized conv-relu op:: + + quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2 -The input dtype specified in the DTypeConfig will be passed as the dtype -argument to quant1, while the output dtype will be passed as the dtype -argument to quant2. If the output dtype is fp32, as in the case of dynamic -quantization, then the output quant-dequant pair will not be inserted. -This example also shows how to specify restrictions on quantization and -scale ranges on a particular dtype. +2. Set DTypeConfigs with backend constraints +--------------------------------------------- + +In the reference patterns above, the input dtype specified in the +DTypeConfig will be passed as the dtype argument to quant1, while the +output dtype will be passed as the dtype argument to quant2. If the output +dtype is fp32, as in the case of dynamic quantization, then the output +quant-dequant pair will not be inserted. This example also shows how to +specify restrictions on quantization and scale ranges on a particular dtype. .. code:: ipython3 @@ -211,6 +79,38 @@ scale ranges on a particular dtype. weight_dtype=torch.qint8, bias_dtype=torch.float) +3. Set up fusion for conv-relu +------------------------------- + +Note that the original user model contains separate conv and relu ops, +so we need to first fuse the conv and relu ops into a single conv-relu +op (`fp32_conv_relu`), and then quantize this op similar to how the linear +op is quantized. We can set up fusion by defining a function that accepts +3 arguments, where the first is whether or not this is for QAT, and the +remaining arguments refer to the individual items of the fused pattern. + +.. code:: ipython3 + + def fuse_conv2d_relu(is_qat, conv, relu): + """Return a fused ConvReLU2d from individual conv and relu modules.""" + return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu) + +4. Define the BackendConfig +---------------------------- + +Now we have all the necessary pieces, so we go ahead and define our +BackendConfig. Here we use different observers (will be renamed) for +the input and output for the linear op, so the quantization params +passed to the two quantize ops (quant1 and quant2) will be different. +This is commonly the case for weighted ops like linear and conv. + +For the conv-relu op, the observation type is the same. However, we +need two BackendPatternConfigs to support this op, one for fusion +and one for quantization. For both conv-relu and linear, we use the +DTypeConfig defined above. + +.. code:: ipython3 + linear_config = BackendPatternConfig() \ .set_pattern(torch.nn.Linear) \ .set_observation_type(ObservationType.OUTPUT_USE_DIFFERENT_OBSERVER_AS_INPUT) \ @@ -219,24 +119,6 @@ scale ranges on a particular dtype. .set_qat_module(torch.nn.qat.Linear) \ .set_reference_quantized_module(torch.ao.nn.quantized.reference.Linear) -For quantized conv-relu, the observation type and DTypeConfig settings -are the same, since we wish to produce the following reference model, -where the reference pattern in the square brackets will be lowered into -a single quantized conv-relu op:: - - quant1 - [dequant1 - fp32_conv_relu - quant2] - dequant2 - -However, first we need to fuse the conv and relu ops into a single -conv-relu op (`fp32_conv_relu`), and then quantize this op similar to -how the linear op is quantized. Thus, we need two BackendPatternConfigs -to support this op, one for fusion and one for quantization: - -.. code:: ipython3 - - def fuse_conv2d_relu(is_qat, conv, relu): - """Return a fused ConvReLU2d from individual conv and relu modules.""" - return torch.ao.nn.intrinsic.ConvReLU2d(conv, relu) - # For fusing Conv2d + ReLU into ConvReLU2d # No need to set observation type and dtype config here, since we are not # inserting quant-dequant ops in this step yet @@ -254,23 +136,43 @@ to support this op, one for fusion and one for quantization: .set_qat_module(torch.ao.nn.intrinsic.qat.ConvReLU2d) \ .set_reference_quantized_module(torch.ao.nn.quantized.reference.Conv2d) -Now we have all the necessary pieces, so we go ahead and define our -BackendConfig and test it out on an example model. Here we see that -both linear and conv-relu are quantized. - -.. code:: ipython3 - backend_config = BackendConfig("my_backend") \ .set_backend_pattern_config(linear_config) \ .set_backend_pattern_config(conv_relu_config) \ .set_backend_pattern_config(fused_conv_relu_config) +5. Set up QConfigMapping that satisfies the backend constraints +---------------------------------------------------------------- + +In order to use the ops defined above, the user must define a QConfig +that satisfies the constraints specified in the DTypeConfig. For more +detail, see the documentation for `DTypeConfig `__. +We will then use this QConfig for all the modules used in the patterns +we wish to quantize. + +.. code:: ipython3 + + # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`) + activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12) + qconfig = QConfig(activation=activation_observer, weight=default_weight_observer) + + # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in + # (Conv2d, ReLU), must have the same QConfig + qconfig_mapping = QConfigMapping() \ + .set_object_type(torch.nn.Linear, qconfig) \ + .set_object_type(torch.nn.Conv2d, qconfig) \ + .set_object_type(torch.nn.BatchNorm2d, qconfig) \ + .set_object_type(torch.nn.ReLU, qconfig) + +6. Quantize the model through prepare and convert +-------------------------------------------------- + +Finally, we quantize the model by passing the BackendConfig we defined +into prepare and convert. This produces a quantized linear module and +a fused quantized conv-relu module. + .. code:: ipython3 - # ==================== - # Example user model - # ==================== - class MyModel(torch.nn.Module): def __init__(self, use_bn: bool): super().__init__() @@ -280,7 +182,7 @@ both linear and conv-relu are quantized. self.relu = torch.nn.ReLU() self.sigmoid = torch.nn.Sigmoid() self.use_bn = use_bn - + def forward(self, x): x = self.linear(x) x = self.conv(x) @@ -290,31 +192,6 @@ both linear and conv-relu are quantized. x = self.sigmoid(x) return x -.. code:: ipython3 - - # ======================= - # Custom QConfigMapping - # ======================= - - # Define a QConfig that satisfies the constraints specified in DTypeConfig - # Note: Here we use a quant_max of 127, but this could be up to 255 (see `quint8_with_constraints`) - activation_observer = MinMaxObserver.with_args(quant_min=0, quant_max=127, eps=2 ** -12) - qconfig = QConfig(activation=activation_observer, weight=default_weight_observer) - - # Note: All individual items of a fused pattern, e.g. Conv2d and ReLU in - # (Conv2d, ReLU), must have the same QConfig - qconfig_mapping = QConfigMapping() \ - .set_object_type(torch.nn.Linear, qconfig) \ - .set_object_type(torch.nn.Conv2d, qconfig) \ - .set_object_type(torch.nn.BatchNorm2d, qconfig) \ - .set_object_type(torch.nn.ReLU, qconfig) - -.. code:: ipython3 - - # ===================== - # Prepare and Convert - # ===================== - example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),) model = MyModel(use_bn=False) prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config) @@ -341,17 +218,16 @@ both linear and conv-relu are quantized. sigmoid = self.sigmoid(dequantize_2); dequantize_2 = None return sigmoid +(7. Experiment with faulty BackendConfig setups) +------------------------------------------------- + As an experiment, here we modify the model to use conv-bn-relu instead of conv-relu, but use the same BackendConfig, which doesn't know how to quantize conv-bn-relu. As a result, only linear is quantized, but conv-bn-relu is neither fused nor quantized. .. code:: ipython3 - - # ================================================ - # Prepare and Convert (only linear is quantized) - # ================================================ - + # Only linear is quantized, since there's no rule for fusing conv-bn-relu example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),) model = MyModel(use_bn=True) prepared = prepare_fx(model, qconfig_mapping, example_inputs, backend_config=backend_config) @@ -387,11 +263,7 @@ doesn't satisfy the dtype constraints specified in the backend. As a result, nothing is quantized since the QConfigs are simply ignored. .. code:: ipython3 - - # ============================================ - # Prepare and Convert (nothing is quantized) - # ============================================ - + # Nothing is quantized or fused, since backend constraints are not satisfied example_inputs = (torch.rand(1, 3, 10, 10, dtype=torch.float),) model = MyModel(use_bn=True) prepared = prepare_fx(model, get_default_qconfig_mapping(), example_inputs, backend_config=backend_config)