|
1 |
| -PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor |
2 |
| -======================================================================================== |
| 1 | +PyTorch 2 Export Quantization with X86 Backend through Inductor |
| 2 | +================================================================== |
3 | 3 |
|
4 | 4 | **Author**: `Leslie Fang <https://github.com/leslie-fang-intel>`_, `Weiwen Xia <https://github.com/Xia-Weiwen>`_, `Jiong Gong <https://github.com/jgong5>`_, `Jerry Zhang <https://github.com/jerryzh168>`_
|
5 | 5 |
|
6 | 6 | Prerequisites
|
7 |
| -^^^^^^^^^^^^^^^ |
| 7 | +--------------- |
8 | 8 |
|
9 | 9 | - `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
|
| 10 | +- `PyTorch 2 Export Quantization-Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_ |
10 | 11 | - `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
|
11 | 12 | - `Inductor C++ Wrapper concepts <https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html>`_
|
12 | 13 |
|
13 | 14 | Introduction
|
14 |
| -^^^^^^^^^^^^^^ |
| 15 | +-------------- |
15 | 16 |
|
16 | 17 | This tutorial introduces the steps for utilizing the PyTorch 2 Export Quantization flow to generate a quantized model customized
|
17 | 18 | for the x86 inductor backend and explains how to lower the quantized model into the inductor.
|
18 | 19 |
|
19 |
| -The new quantization 2 flow uses the PT2 Export to capture the model into a graph and perform quantization transformations on top of the ATen graph. This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX. |
| 20 | +The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph. |
| 21 | +This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX. |
20 | 22 | TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
|
21 | 23 |
|
22 | 24 | This flow of quantization 2 with Inductor mainly includes three steps:
|
23 | 25 |
|
24 | 26 | - Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
|
25 | 27 | - Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
|
26 |
| - performing the prepared model's calibration, and converting the prepared model into the quantized model. |
| 28 | + performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model. |
27 | 29 | - Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
|
28 | 30 |
|
29 | 31 | The high-level architecture of this flow could look like this:
|
@@ -61,10 +63,14 @@ and outstanding out-of-box performance with the compiler backend. Especially on
|
61 | 63 | further boost the models' performance by leveraging the
|
62 | 64 | `advanced-matrix-extensions <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ feature.
|
63 | 65 |
|
64 |
| -Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_. |
| 66 | +Post Training Quantization |
| 67 | +---------------------------- |
| 68 | + |
| 69 | +Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_ |
| 70 | +for post training quantization. |
65 | 71 |
|
66 | 72 | 1. Capture FX Graph
|
67 |
| ---------------------- |
| 73 | +^^^^^^^^^^^^^^^^^^^^^ |
68 | 74 |
|
69 | 75 | We will start by performing the necessary imports, capturing the FX Graph from the eager module.
|
70 | 76 |
|
@@ -111,7 +117,7 @@ We will start by performing the necessary imports, capturing the FX Graph from t
|
111 | 117 | Next, we will have the FX Module to be quantized.
|
112 | 118 |
|
113 | 119 | 2. Apply Quantization
|
114 |
| ----------------------------- |
| 120 | +^^^^^^^^^^^^^^^^^^^^^^^ |
115 | 121 |
|
116 | 122 | After we capture the FX Module to be quantized, we will import the Backend Quantizer for X86 CPU and configure how to
|
117 | 123 | quantize the model.
|
@@ -160,7 +166,7 @@ After these steps, we finished running the quantization flow and we will get the
|
160 | 166 |
|
161 | 167 |
|
162 | 168 | 3. Lower into Inductor
|
163 |
| ------------------------- |
| 169 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
164 | 170 |
|
165 | 171 | After we get the quantized model, we will further lower it to the inductor backend. The default Inductor wrapper
|
166 | 172 | generates Python code to invoke both generated kernels and external kernels. Additionally, Inductor supports
|
@@ -222,8 +228,74 @@ With PyTorch 2.1 release, all CNN models from TorchBench test suite have been me
|
222 | 228 | to `this document <https://dev-discuss.pytorch.org/t/torchinductor-update-6-cpu-backend-performance-update-and-new-features-in-pytorch-2-1/1514#int8-inference-with-post-training-static-quantization-3>`_
|
223 | 229 | for detail benchmark number.
|
224 | 230 |
|
225 |
| -4. Conclusion |
226 |
| ---------------- |
| 231 | +Quantization Aware Training |
| 232 | +----------------------------- |
| 233 | + |
| 234 | +The PyTorch 2 Export Quantization-Aware Training (QAT) is now supported on X86 CPU using X86InductorQuantizer, |
| 235 | +followed by the subsequent lowering of the quantized model into Inductor. |
| 236 | +For a more in-depth understanding of PT2 Export Quantization-Aware Training, |
| 237 | +we recommend referring to the dedicated `PyTorch 2 Export Quantization-Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_. |
| 238 | + |
| 239 | +The PyTorch 2 Export QAT flow is largely similar to the PTQ flow: |
| 240 | + |
| 241 | +.. code:: python |
| 242 | +
|
| 243 | + import torch |
| 244 | + from torch._export import capture_pre_autograd_graph |
| 245 | + from torch.ao.quantization.quantize_pt2e import ( |
| 246 | + prepare_qat_pt2e, |
| 247 | + convert_pt2e, |
| 248 | + ) |
| 249 | + import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq |
| 250 | + from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer |
| 251 | +
|
| 252 | + class M(torch.nn.Module): |
| 253 | + def __init__(self): |
| 254 | + super().__init__() |
| 255 | + self.linear = torch.nn.Linear(1024, 1000) |
| 256 | +
|
| 257 | + def forward(self, x): |
| 258 | + return self.linear(x) |
| 259 | +
|
| 260 | + example_inputs = (torch.randn(1, 1024),) |
| 261 | + m = M() |
| 262 | +
|
| 263 | + # Step 1. program capture |
| 264 | + # NOTE: this API will be updated to torch.export API in the future, but the captured |
| 265 | + # result shoud mostly stay the same |
| 266 | + exported_model = capture_pre_autograd_graph(m, example_inputs) |
| 267 | + # we get a model with aten ops |
| 268 | +
|
| 269 | + # Step 2. quantization-aware training |
| 270 | + # Use Backend Quantizer for X86 CPU |
| 271 | + quantizer = X86InductorQuantizer() |
| 272 | + quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_qat=True)) |
| 273 | + prepared_model = prepare_qat_pt2e(exported_model, quantizer) |
| 274 | +
|
| 275 | + # train omitted |
| 276 | +
|
| 277 | + converted_model = convert_pt2e(prepared_model) |
| 278 | + # we have a model with aten ops doing integer computations when possible |
| 279 | +
|
| 280 | + # move the quantized model to eval mode, equivalent to `m.eval()` |
| 281 | + torch.ao.quantization.move_exported_model_to_eval(converted_model) |
| 282 | +
|
| 283 | + # Lower the model into Inductor |
| 284 | + with torch.no_grad(): |
| 285 | + optimized_model = torch.compile(converted_model) |
| 286 | + _ = optimized_model(*example_inputs) |
| 287 | +
|
| 288 | +Please note that the Inductor ``freeze`` feature is not enabled by default. |
| 289 | +To use this feature, you need to run example code with ``TORCHINDUCTOR_FREEZING=1``. |
| 290 | + |
| 291 | +For example: |
| 292 | + |
| 293 | +:: |
| 294 | + |
| 295 | + TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_qat.py |
| 296 | + |
| 297 | +Conclusion |
| 298 | +------------ |
227 | 299 |
|
228 | 300 | With this tutorial, we introduce how to use Inductor with X86 CPU in PyTorch 2 Quantization. Users can learn about
|
229 | 301 | how to use ``X86InductorQuantizer`` to quantize a model and lower it into the inductor with X86 CPU devices.
|
0 commit comments