You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
10
+
- `PyTorch 2 Export Quantization-Aware Training tutorial <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_
10
11
- `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
11
12
- `Inductor C++ Wrapper concepts <https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html>`_
12
13
@@ -16,14 +17,15 @@ Introduction
16
17
This tutorial introduces the steps for utilizing the PyTorch 2 Export Quantization flow to generate a quantized model customized
17
18
for the x86 inductor backend and explains how to lower the quantized model into the inductor.
18
19
19
-
The new quantization 2 flow uses the PT2 Export to capture the model into a graph and perform quantization transformations on top of the ATen graph. This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
20
+
The new quantization 2 flow uses the PT2 Export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
21
+
This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
20
22
TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
21
23
22
24
This flow of quantization 2 with Inductor mainly includes three steps:
23
25
24
26
- Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
25
27
- Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
26
-
performing the prepared model's calibration, and converting the prepared model into the quantized model.
28
+
performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
27
29
- Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
28
30
29
31
The high-level architecture of this flow could look like this:
@@ -83,6 +85,8 @@ We will start by performing the necessary imports, capturing the FX Graph from t
83
85
model = models.__dict__[model_name](pretrained=True)
84
86
85
87
# Set the model to eval mode
88
+
# Only apply it for post-training static quantization
89
+
# Skip this step for quantization-aware training
86
90
model = model.eval()
87
91
88
92
# Create the data, using the dummy data here as an example
@@ -116,26 +120,44 @@ Next, we will have the FX Module to be quantized.
116
120
After we capture the FX Module to be quantized, we will import the Backend Quantizer for X86 CPU and configure how to
The default quantization configuration in ``X86InductorQuantizer`` uses 8-bits for both activations and weights.
127
140
When Vector Neural Network Instruction is not available, the oneDNN backend silently chooses kernels that assume
128
141
`multiplications are 7-bit x 8-bit <https://oneapi-src.github.io/oneDNN/dev_guide_int8_computations.html#inputs-of-mixed-type-u8-and-s8>`_. In other words, potential
129
142
numeric saturation and accuracy issue may happen when running on CPU without Vector Neural Network Instruction.
130
143
131
-
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization.
132
-
``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
144
+
After we import the backend-specific Quantizer, we will prepare the model for post-training quantization or quantization-aware training.
145
+
146
+
For post-training static quantization, ``prepare_pt2e`` folds BatchNorm operators into preceding Conv2d operators, and inserts observers in appropriate places in the model.
0 commit comments