From dcbc93c56e9591c6b2f1f1ff8963571894523bd5 Mon Sep 17 00:00:00 2001 From: Svetlana Karslioglu Date: Mon, 31 Jul 2023 09:55:52 -0700 Subject: [PATCH 1/3] Editorial changes to the pt2e tutorial --- prototype_source/pt2e_quant_ptq_static.rst | 257 +++++++++++++++------ 1 file changed, 181 insertions(+), 76 deletions(-) diff --git a/prototype_source/pt2e_quant_ptq_static.rst b/prototype_source/pt2e_quant_ptq_static.rst index e2ee62ec0d9..56829857461 100644 --- a/prototype_source/pt2e_quant_ptq_static.rst +++ b/prototype_source/pt2e_quant_ptq_static.rst @@ -2,12 +2,19 @@ ================================================================ **Author**: `Jerry Zhang `_ -This tutorial introduces the steps to do post training static quantization in graph mode based on -`torch._export.export `_. Compared to `FX Graph Mode Quantization `_, this flow is expected to have significantly higher model coverage (`88% on 14K models `_), better programmability, and a simplified UX. +This tutorial introduces the steps to do post training static quantization in +graph mode based on +`torch._export.export `_. Compared +to `FX Graph Mode Quantization `_, +this flow is expected to have significantly higher model coverage +(`88% on 14K models `_), +better programmability, and a simplified UX. -Exportable by `torch._export.export` is a prerequisite to use the flow, you can find what are the constructs that's supported in `Export DB `_. +Exportable by `torch._export.export` is a prerequisite to use the flow, you can +find what are the constructs that's supported in `Export DB `_. -The high level architecture of quantization 2.0 with quantizer could look like this: +The high level architecture of quantization 2.0 with quantizer could look like +this: :: @@ -38,7 +45,7 @@ The high level architecture of quantization 2.0 with quantizer could look like t | Lowering | —-------------------------------------------------------- | - Executorch, or Inductor, or + Executorch, or Inductor, or The PyTorch 2.0 export quantization API looks like this: @@ -73,7 +80,8 @@ The PyTorch 2.0 export quantization API looks like this: XNNPACKQuantizer, get_symmetric_quantization_config, ) - # backend developer will write their own Quantizer and expose methods to allow users to express how they + # backend developer will write their own Quantizer and expose methods to allow + # users to express how they # want the model to be quantized quantizer = XNNPACKQuantizer().set_global(get_symmetric_quantization_config()) m = prepare_pt2e(m, quantizer) @@ -83,36 +91,92 @@ The PyTorch 2.0 export quantization API looks like this: m = convert_pt2e(m) # we have a model with aten ops doing integer computations when possible - -1. Motivation of PyTorch 2.0 Export Quantization ------------------------------------------------- -In PyTorch versions prior to 2.0, we have FX Graph Mode Quantization that uses `QConfigMapping `_ and `BackendConfig `_ for customizations. ``QConfigMapping`` allows modeling users to specify how they want their model to be quantized, ``BackendConfig`` allows backend developers to specify the supported ways of quantization in their backend. While that API covers most use cases relatively well, it is not fully extensible. There are two main limitations for current API: +Motivation of PyTorch 2.0 Export Quantization +--------------------------------------------- -1. Limitation around expressing quantization intentions for complicated operator patterns (how an operator pattern should be observed/quantized) using existing objects: ``QConfig`` and ``QConfigMapping``. -2. Limited support on how user can express their intention of how they want their model to be quantized. For example, if users want to quantize the every other linear in the model, or the quantization behavior has some dependency on the actual shape of the Tensor (for example, only observe/quantize inputs and outputs when the linear has a 3D input), backend developer or modeling users need to change the core quantization api/flow. +In PyTorch versions prior to 2.0, we have FX Graph Mode Quantization that uses +`QConfigMapping `_ +and `BackendConfig `_ +for customizations. ``QConfigMapping`` allows modeling users to specify how +they want their model to be quantized, ``BackendConfig`` allows backend +developers to specify the supported ways of quantization in their backend. While +that API covers most use cases relatively well, it is not fully extensible. +There are two main limitations for the current API: -A few improvements could make the existing flow better: -3. We use ``QConfigMapping`` and ``BackendConfig`` as separate objects, ``QConfigMapping`` describes user’s intention of how they want their model to be quantized, ``BackendConfig`` describes what kind of quantization a backend support. ``BackendConfig`` is backend specific, but ``QConfigMapping`` is not, and user can provide a ``QConfigMapping`` that is incompatible with a specific ``BackendConfig``, this is not a great UX. Ideally we can structure this better by making both configuration (``QConfigMapping``) and quantization capability (``BackendConfig``) backend specific, so there will be less confusion about incompatibilities. - -4. In ``QConfig`` we are exposing observer/fake_quant observer classes as an object for user to configure quantization, this increases the things that user may need to care about, e.g. not only the dtype but also how the observation should happen, these could potentially be hidden from user so that the user interface is simpler. - -Here is a summary of the benefits of the new API: +* Limitation around expressing quantization intentions for complicated operator + patterns (how an operator pattern should be observed/quantized) using existing + objects: ``QConfig`` and ``QConfigMapping``. -- Programmability (addressing 1. and 2.): When a user’s quantization needs are not covered by available quantizers, users can build their own quantizer and compose it with other quantizers as mentioned above. -- Simplified UX (addressing 3.): Provides a single instance with which both backend and users interact. Thus you no longer have 1) user facing quantization config mapping to map users intent and 2) a separate quantization config that backends interact with to configure what backend support. We will still have a method for users to query what is supported in a quantizer. With a single instance, composing different quantization capabilities also becomes more natural than previously. For example XNNPACK does not support embedding_byte and we have native support for this in ExecuTorch. Thus if we had ExecuTorchQuantizer that only quantized embedding_byte, then it can be composed with XNNPACKQuantizer. (Previously this will be concatenating the two ``BackendConfig`` together and since options in ``QConfigMapping`` are not backend specific, user also need to figure out how to specify the configurations by themselves that matches the quantization capabilities of the combined backend. with a single quantizer instance, we can compose two quantizers and query the composed quantizer for capabilities, which makes it less error prone and cleaner, e.g. composed_quantizer.quantization_capabilities()) -- Separation of Concerns (addressing 4.): As we design the quantizer API, we also decouple specification of quantization, as expressed in terms of ``dtype``, min/max (# of bits), symmetric, and so on, from the observer concept. Currently, the observer captures both quantization specification and how to observe (Histogram vs MinMax observer). Modeling users are freed from interacting with observer and fake quant objects with this change. +* Limited support on how user can express their intention of how they want + their model to be quantized. For example, if users want to quantize the every + other linear in the model, or the quantization behavior has some dependency on + the actual shape of the Tensor (for example, only observe/quantize inputs + and outputs when the linear has a 3D input), backend developer or modeling + users need to change the core quantization API/flow. -2. Define Helper Functions and Prepare Dataset ----------------------------------------------- +A few improvements could make the existing flow better: -We’ll start by doing the necessary imports, defining some helper functions and prepare the data. -These steps are identitcal to `Static Quantization with Eager Mode in PyTorch `_. +* We use ``QConfigMapping`` and ``BackendConfig`` as separate objects, + ``QConfigMapping`` describes user’s intention of how they want their model to + be quantized, ``BackendConfig`` describes what kind of quantization a backend + supports. ``BackendConfig`` is backend-specific, but ``QConfigMapping`` is not, + and the user can provide a ``QConfigMapping`` that is incompatible with a specific + ``BackendConfig``, this is not a great UX. Ideally, we can structure this better + by making both configuration (``QConfigMapping``) and quantization capability + (``BackendConfig``) backend-specific, so there will be less confusion about + incompatibilities. +* In ``QConfig`` we are exposing observer/ ``fake_quant`` observer classes as an + object for the user to configure quantization, this increases the things that + the user may need to care about. For example, not only the ``dtype`` but also + how the observation should happen, these could potentially be hidden from the + user so that the user flow is simpler. -To run the code in this tutorial using the entire ImageNet dataset, first download Imagenet by following the instructions at here `ImageNet Data `_. Unzip the downloaded file into the ``data_path`` folder. +Here is a summary of the benefits of the new API: -Download the `torchvision resnet18 model `_ and rename it to -``data/resnet18_pretrained_float.pth``. +- **Programmability** (addressing 1. and 2.): When a user’s quantization needs + are not covered by available quantizers, users can build their own quantizer and + compose it with other quantizers as mentioned above. +- **Simplified UX** (addressing 3.): Provides a single instance with which both + backend and users interact. Thus you no longer have the user facing quantization + config mapping to map users intent and a separate quantization config that + backends interact with to configure what backend support. We will still have a + method for users to query what is supported in a quantizer. With a single + instance, composing different quantization capabilities also becomes more + natural than previously. + + For example XNNPACK does not support ``embedding_byte`` + and we have natively support for this in ExecuTorch. Thus, if we had + ``ExecuTorchQuantizer`` that only quantized ``embedding_byte``, then it can be + composed with ``XNNPACKQuantizer``. (Previously, this used to be concatenating the + two ``BackendConfig`` together and since options in ``QConfigMapping`` are not + backend specific, user also need to figure out how to specify the configurations + by themselves that matches the quantization capabilities of the combined + backend. With a single quantizer instance, we can compose two quantizers and + query the composed quantizer for capabilities, which makes it less error prone + and cleaner, for example, ``composed_quantizer.quantization_capabilities())``. + +- **Separation of concerns** (addressing 4.): As we design the quantizer API, we + also decouple specification of quantization, as expressed in terms of ``dtype``, + min/max (# of bits), symmetric, and so on, from the observer concept. + Currently, the observer captures both quantization specification and how to + observe (Histogram vs MinMax observer). Modeling users are freed from + interacting with observer and fake quant objects with this change. + + Define Helper Functions and Prepare Dataset +------------------------------------------- + +We’ll start by doing the necessary imports, defining some helper functions and +prepare the data. These steps are identitcal to +`Static Quantization with Eager Mode in PyTorch `_. + +To run the code in this tutorial using the entire ImageNet dataset, first +download Imagenet by following the instructions at here +`ImageNet Data `_. Unzip the downloaded file +into the ``data_path`` folder. + +Download the `torchvision resnet18 model `_ +and rename it to ``data/resnet18_pretrained_float.pth``. .. code:: python @@ -173,7 +237,10 @@ Download the `torchvision resnet18 model `_ that talks about how to write a new ``Quantizer``. +.. note:: + + Check out our + `tutorial `_ + that describes how to write a new ``Quantizer``. -6. Prepare the Model for Post Training Static Quantization +Prepare the Model for Post Training Static Quantization ---------------------------------------------------------- -``prepare_pt2e`` folds ``BatchNorm`` operators into preceding ``Conv2d`` operators, and inserts observers -in appropriate places in the model. +``prepare_pt2e`` folds ``BatchNorm`` operators into preceding ``Conv2d`` +operators, and inserts observers in appropriate places in the model. -.. code:: python +.. code-block:: python prepared_model = prepare_pt2e(exported_model, quantizer) print(prepared_model.graph) -7. Calibration +Calibration -------------- + The calibration function is run after the observers are inserted in the model. -The purpose for calibration is to run through some sample examples that is representative of the workload -(for example a sample of the training data set) so that the observers in the model are able to observe -the statistics of the Tensors and we can later use this information to calculate quantization parameters. +The purpose for calibration is to run through some sample examples that is +representative of the workload (for example a sample of the training data set) +so that the observers in themodel are able to observe the statistics of the +Tensors and we can later use this information to calculate quantization +parameters. -.. code:: python +.. code-block:: python def calibrate(model, data_loader): model.eval() @@ -337,21 +418,23 @@ the statistics of the Tensors and we can later use this information to calculate model(image) calibrate(prepared_model, data_loader_test) # run calibration on sample data -8. Convert the Calibrated Model to a Quantized Model ----------------------------------------------------- +Convert the Calibrated Model to a Quantized Model +------------------------------------------------- + ``convert_pt2e`` takes a calibrated model and produces a quantized model. -.. code:: python +.. code-block:: python quantized_model = convert_pt2e(prepared_model) print(quantized_model) .. note:: - the model produced here also had some improvement upon the previous `representations `_ in the FX graph mode quantizaiton, previously all quantized operators are represented as ``dequantize -> fp32_op -> qauntize``, in the new flow, we choose to represent some of the operators with integer computation so that it's closer to the computation happens in hardwares. + the model produced here also had some improvement upon the previous + `representations `_ in the FX graph mode quantizaiton, previously all quantized operators are represented as ``dequantize -> fp32_op -> qauntize``, in the new flow, we choose to represent some of the operators with integer computation so that it's closer to the computation happens in hardwares. For example, here is how we plan to represent a quantized linear operator: - + .. code-block:: python - + def quantized_linear(x_int8, x_scale, x_zero_point, weight_int8, weight_scale, weight_zero_point, bias_int32, bias_scale, bias_zero_point, output_scale, output_zero_point): x_int16 = x_int8.to(torch.int16) weight_int16 = weight_int8.to(torch.int16) @@ -360,15 +443,17 @@ the statistics of the Tensors and we can later use this information to calculate bias_int32 = torch.ops.out_dtype(torch.ops.aten.mul.Scalar, bias_int32 - bias_zero_point, bias_scale / output_scale)) out_int8 = torch.ops.aten.clamp(acc_rescaled_int32 + bias_int32 + output_zero_point, qmin, qmax).to(torch.int8) return out_int8 - - For more details, please see: `Quantized Model Representation `_ (TODO: make this a public API doc/issue). - -9. Checking Model Size and Accuracy Evaluation + For more details, please see: + `Quantized Model Representation `_. + + +Checking Model Size and Accuracy Evaluation ---------------------------------------------- + Now we can compare the size and model accuracy with baseline model. -.. code:: python +.. code-block:: python # Baseline model size and accuracy scripted_float_model_file = "resnet18_scripted.pth" @@ -382,21 +467,28 @@ Now we can compare the size and model accuracy with baseline model. # Quantized model size and accuracy print("Size of model after quantization") print_size_of_model(quantized_model) - + top1, top5 = evaluate(quantized_model, criterion, data_loader_test) print("[before serilaization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg)) -Note: we can't do performance evaluation now since the model is not lowered to target device, it's just a representation of quantized computation in aten operators. Each backend should have their tutorial about how to lower to their backend, for example, we'll have separate tutorials on how to do lowering in executorch for models that target edge devices. +.. note:: + We can't do performance evaluation now since the model is not lowered to + target device, it's just a representation of quantized computation in ATen + operators. -If you want to get better accuracy or performance, try configuring ``quantizer`` in different ways, and each ``quantizer`` will have its own way of configuration, so please consult the documentation for the -quantization you are using to learn more about how you can have more control over how to quantize a model. +If you want to get better accuracy or performance, try configuring +``quantizer`` in different ways, and each ``quantizer`` will have its own way +of configuration, so please consult the documentation for the +quantization you are using to learn more about how you can have more control +over how to quantize a model. -10. Save and Load Quantized Model +Save and Load Quantized Model --------------------------------- + We'll show how to save and load the quantized model. -.. code:: python +.. code-block:: python # 1. Save state_dict pt2e_quantized_model_file_path = saved_model_dir + "resnet18_pt2e_quantized.pth" @@ -434,16 +526,29 @@ We'll show how to save and load the quantized model. top1, top5 = evaluate(loaded_quantized_model, criterion, data_loader_test) print("[after serialization/deserialization] Evaluation accuracy on test dataset: %2.2f, %2.2f"%(top1.avg, top5.avg)) -11. Debugging Quantized Model +Debugging the Quantized Model ---------------------------- -We have `Numeric Suite `_ that can help with debugging in eager mode and FX graph mode. The new version of Numeric Suite working with PyTorch 2.0 Export models is still in development. +You can use `Numeric Suite `_ +that can help with debugging in eager mode and FX graph mode. The new version of +Numeric Suite working with PyTorch 2.0 Export models is still in development. -12. Lowering and Performance Evaluation ---------------------------------------- +Lowering and Performance Evaluation +------------------------------------ -The model produced at this point is not the final model that runs on device, it is a reference quantized model that captures the intended quantized computation from user, expressed as aten operators, to get a model that runs in real devices, we'll need to lower the model. For example for models that runs on edge devices, we can lower to executorch. +The model produced at this point is not the final model that runs on the device, +it is a reference quantized model that captures the intended quantized computation +from the user, expressed as ATen operators, to get a model that runs on real +devices, we'll need to lower the model. For example for the models that run on +edge devices, we can lower to executorch. -13. Conclusion +Conclusion -------------- -In this tutorial, we went through the overall quantization flow in PyTorch 2.0 Export Quantization using ``XNNPACKQuantizer`` and get a quantized model that could be further lowered to a backend that supports inference with XNNPACK backend. To use this for your own backend, please first follow the `tutorial `__ and implement a ``Quantizer`` for your backend, and then quantize the model with that ``Quantizer``. + +In this tutorial, we went through the overall quantization flow in PyTorch 2.0 +Export Quantization using ``XNNPACKQuantizer`` and got a quantized model that +could be further lowered to a backend that supports inference with XNNPACK +backend. To use this for your own backend, please first follow the +`tutorial `__ and +implement a ``Quantizer`` for your backend, and then quantize the model with +that ``Quantizer``. From 203919b714d658f17cb6bb8529b7f4f12fc915b6 Mon Sep 17 00:00:00 2001 From: Svetlana Karslioglu Date: Mon, 31 Jul 2023 12:22:39 -0700 Subject: [PATCH 2/3] Fix --- prototype_source/pt2e_quant_ptq_static.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/prototype_source/pt2e_quant_ptq_static.rst b/prototype_source/pt2e_quant_ptq_static.rst index 56829857461..72dd8004c34 100644 --- a/prototype_source/pt2e_quant_ptq_static.rst +++ b/prototype_source/pt2e_quant_ptq_static.rst @@ -351,6 +351,7 @@ Export the model with torch.export Here is how you can use ``torch.export`` to export the model: .. code-block:: python + import torch._dynamo as torchdynamo example_inputs = (torch.rand(2, 3, 224, 224),) From adb01aa1e37dd9e9e1b801239ccf3ff7e42d7621 Mon Sep 17 00:00:00 2001 From: Svetlana Karslioglu Date: Mon, 31 Jul 2023 13:10:53 -0700 Subject: [PATCH 3/3] Fix --- prototype_source/pt2e_quant_ptq_static.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/prototype_source/pt2e_quant_ptq_static.rst b/prototype_source/pt2e_quant_ptq_static.rst index 72dd8004c34..4e7a7ea82fa 100644 --- a/prototype_source/pt2e_quant_ptq_static.rst +++ b/prototype_source/pt2e_quant_ptq_static.rst @@ -163,7 +163,7 @@ Here is a summary of the benefits of the new API: observe (Histogram vs MinMax observer). Modeling users are freed from interacting with observer and fake quant objects with this change. - Define Helper Functions and Prepare Dataset +Define Helper Functions and Prepare Dataset ------------------------------------------- We’ll start by doing the necessary imports, defining some helper functions and