[pt2e][quant] Update some docs for pt2 export quantization

jerryzh168 · jerryzh168 · commit a28d7634ec73 · 2023-08-15T16:36:10.000-07:00
Summary:
.

Test Plan:
CI generated docs

Reviewers:

Subscribers:

Tasks:

Tags:
diff --git a/prototype_source/pt2e_quant_ptq_static.rst b/prototype_source/pt2e_quant_ptq_static.rst
@@ -430,23 +430,42 @@ Convert the Calibrated Model to a Quantized Model
     print(quantized_model)
 
 .. note::
-   the model produced here also had some improvement upon the previous
-   `representations <https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md>`_ in the FX graph mode quantizaiton, previously all quantized operators are represented as ``dequantize -> fp32_op -> qauntize``, in the new flow, we choose to represent some of the operators with integer computation so that it's closer to the computation happens in hardwares.
-   For example, here is how we plan to represent a quantized linear operator:
+   At this step, we currently have two representations that you can choose from, but what exact representation
+   we offer in the long term might change based on feedbacks from users.
 
-   .. code-block:: python
+   * Q/DQ Representation (default)
+   Previous documentation for `representations <https://github.com/pytorch/rfcs/blob/master/RFC-0019-Extending-PyTorch-Quantization-to-Custom-Backends.md>`_ all quantized operators are represented as ``dequantize -> fp32_op -> qauntize``.
 
-     def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_int32, bias_scale, bias_zero_point, output_scale, output_zero_point):
-         x_int16 = x_int8.to(torch.int16)
-         weight_int16 = weight_int8.to(torch.int16)
-         acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
-         acc_rescaled_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale)
-         bias_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, bias_int32 - bias_zero_point, bias_scale / output_scale))
-         out_int8 = torch.ops.aten.clamp(acc_rescaled_int32 + bias_int32 + output_zero_point, qmin, qmax).to(torch.int8)
-         return out_int8
-
-   For more details, please see:
-   `Quantized Model Representation <https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit>`_.
+   .. code-block:: python
+      def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
+          x_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
+                   x_i8, x_scale, x_zero_point, x_quant_min, x_quant_max, torch.int8)
+          weight_fp32 = torch.ops.quantized_decomposed.dequantize_per_tensor(
+                   weight_i8, weight_scale, weight_zero_point, weight_quant_min, weight_quant_max, torch.int8)
+          weight_permuted = torch.ops.aten.permute_copy.default(weight_fp32, [1, 0]);
+          out_fp32 = torch.ops.aten.addmm.default(bias_fp32, x_fp32, weight_permuted)
+          out_i8 = torch.ops.quantized_decomposed.quantize_per_tensor(
+          out_fp32, out_scale, out_zero_point, out_quant_min, out_quant_max, torch.int8)
+          return out_i8
+     
+     * Reference Quantized Model Representation (WIP, expected to be ready at end of August): we have special representation for selected ops (e.g. quantized linear), other ops are represented as (dq -> float32_op -> q), and q/dq are decomposed into more primitive operators.
+
+       You can get this representation by: convert_pt2e(..., use_reference_representation=True)
+
+    .. code-block:: python
+       # Reference Quantized Pattern for quantized linear
+       def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_fp32, output_scale, output_zero_point):
+           x_int16 = x_int8.to(torch.int16)
+           weight_int16 = weight_int8.to(torch.int16)
+           acc_int32 = torch.ops.out_dtype(torch.mm, torch.int32, (x_int16 - x_zero_point), (weight_int16 - weight_zero_point))
+           acc_rescaled_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, torch.int32, acc_int32, x_scale * weight_scale / output_scale)
+           bias_scale = x_scale * weight_scale
+           bias_int32 = out_dtype(torch.ops.aten.mul.Tensor, torch.int32, bias_fp32, bias_scale / out_scale)
+           out_int8 = torch.ops.aten.clamp(acc_rescaled_int32 + bias_int32 + output_zero_point, qmin, qmax).to(torch.int8)
+           return out_int8
+
+
+   Please see `<here https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/pt2e/representation/rewrite.py>`_ for the most up to date reference representations.
 
 
 Checking Model Size and Accuracy Evaluation
diff --git a/prototype_source/pt2e_quantizer.rst b/prototype_source/pt2e_quantizer.rst
@@ -9,13 +9,15 @@ Prerequisites:
 ^^^^^^^^^^^^^^^^
 
 Required:
+
 -  `Torchdynamo concepts in PyTorch <https://pytorch.org/docs/stable/dynamo/index.html>`__
    
 -  `Quantization concepts in PyTorch <https://pytorch.org/docs/master/quantization.html#quantization-api-summary>`__
    
 -  `(prototype) PyTorch 2.0 Export Post Training Static Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq_static.html>`__
 
 Optional:
+
 -  `FX Graph Mode post training static quantization <https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_static.html>`__
    
 -  `BackendConfig in PyTorch Quantization FX Graph Mode <https://pytorch.org/tutorials/prototype/backend_config_tutorial.html?highlight=backend>`__
@@ -141,7 +143,33 @@ parameters can be shared among some tensors explicitly. Two typical use cases ar
 
 ``SharedQuantizationSpec`` is designed for this use case to annotate tensors whose quantization
 parameters are shared with other tensors. Input of ``SharedQuantizationSpec`` is an ``EdgeOrNode`` object which 
-can be an input edge or an output value. 
+can be an input edge or an output value.
+
+.. note::
+   * Sharing is Transitive
+     Some Tensors might be effectively be using shared quantization spec due to (1) two nodes/edges are
+     configured to use SharedQuantizationSpec (2) there is existing sharing of some of the nodes
+
+     For example, let's say we have two conv nodes conv1 and conv2, and both of them are fed into a cat
+     node. `cat([conv1_out, conv2_out], ...)` Let's say output of conv1, conv2 and first input of cat are configured
+     with the same configurations of QuantizationSpec, second input of cat is configured to use SharedQuantizationSpec
+     with the first input.
+     conv1_out: qspec1(dtype=torch.int8, ...)
+     conv2_out: qspec1(dtype=torch.int8, ...)
+     cat_input0: qspec1(dtype=torch.int8, ...)
+     cat_input1: SharedQuantizationSpec((conv1, cat))  # conv1 node is the first input of cat
+     
+     First of all, the output of conv1 are implicitly sharing quantization parameter (and observer object)
+     with first input of cat, and same for output of conv2 and second input of cat.
+     So since user configures the two input of cat to share quantization parameters, by transitivity,
+     conv2_out and conv1_out will also be sharing quantization parameters. In the observed graph, you
+     will see:
+     ```
+     conv1 -> obs -> cat
+     conv2 -> obs   /
+     ```
+     and both `obs` will be the same observer instance
+
 
 -  Input edge is the connection between input node and the node consuming the input,
    so it's a ``Tuple[Node, Node]``.