Merge branch 'main' into add_device_mesh_recipe

svekars · web-flow · commit b4cb3875a3db · 2024-01-24T09:47:17.000-08:00
diff --git a/prototype_source/prototype_index.rst b/prototype_source/prototype_index.rst
@@ -89,6 +89,12 @@ Prototype features are not available as part of binary distributions like PyPI o
    :link: ../prototype/pt2e_quant_qat.html
    :tags: Quantization
 
+.. customcarditem::
+   :header: PyTorch 2 Export Quantization with X86 Backend through Inductor
+   :card_description: Learn how to use PT2 Export Quantization with X86 Backend through Inductor.
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../prototype/pt2e_quant_x86_inductor.html
+   :tags: Quantization
 
 .. Sparsity
 
diff --git a/prototype_source/pt2e_quant_x86_inductor.rst b/prototype_source/pt2e_quant_x86_inductor.rst
@@ -1,29 +1,31 @@
-PyTorch 2 Export Post Training Quantization with X86 Backend through Inductor
-========================================================================================
+PyTorch 2 Export Quantization with X86 Backend through Inductor
+==================================================================
 
 **Author**: `Leslie Fang <https://github.com/leslie-fang-intel>`_, `Weiwen Xia <https://github.com/Xia-Weiwen>`_, `Jiong Gong <https://github.com/jgong5>`_, `Jerry Zhang <https://github.com/jerryzh168>`_
 
 Prerequisites
-^^^^^^^^^^^^^^^
+---------------
 
 -  `PyTorch 2 Export Post Training Quantization <https://pytorch.org/tutorials/prototype/pt2e_quant_ptq.html>`_
+-  `PyTorch 2 Export Quantization-Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_
 -  `TorchInductor and torch.compile concepts in PyTorch <https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html>`_
 -  `Inductor C++ Wrapper concepts <https://pytorch.org/tutorials/prototype/inductor_cpp_wrapper_tutorial.html>`_
 
 Introduction
-^^^^^^^^^^^^^^
+--------------
 
 This tutorial introduces the steps for utilizing the PyTorch 2 Export Quantization flow to generate a quantized model customized
 for the x86 inductor backend and explains how to lower the quantized model into the inductor.
 
-The new quantization 2 flow uses the PT2 Export to capture the model into a graph and perform quantization transformations on top of the ATen graph. This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
+The pytorch 2 export quantization flow uses the torch.export to capture the model into a graph and perform quantization transformations on top of the ATen graph.
+This approach is expected to have significantly higher model coverage, better programmability, and a simplified UX.
 TorchInductor is the new compiler backend that compiles the FX Graphs generated by TorchDynamo into optimized C++/Triton kernels.
 
 This flow of quantization 2 with Inductor mainly includes three steps:
 
 - Step 1: Capture the FX Graph from the eager Model based on the `torch export mechanism <https://pytorch.org/docs/main/export.html>`_.
 - Step 2: Apply the Quantization flow based on the captured FX Graph, including defining the backend-specific quantizer, generating the prepared model with observers,
-  performing the prepared model's calibration, and converting the prepared model into the quantized model.
+  performing the prepared model's calibration or quantization-aware training, and converting the prepared model into the quantized model.
 - Step 3: Lower the quantized model into inductor with the API ``torch.compile``.
 
 The high-level architecture of this flow could look like this:
@@ -61,10 +63,14 @@ and outstanding out-of-box performance with the compiler backend. Especially on
 further boost the models' performance by leveraging the
 `advanced-matrix-extensions <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ feature.
 
-Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_.
+Post Training Quantization
+----------------------------
+
+Now, we will walk you through a step-by-step tutorial for how to use it with `torchvision resnet18 model <https://download.pytorch.org/models/resnet18-f37072fd.pth>`_
+for post training quantization.
 
 1. Capture FX Graph
----------------------
+^^^^^^^^^^^^^^^^^^^^^
 
 We will start by performing the necessary imports, capturing the FX Graph from the eager module.
 
@@ -111,7 +117,7 @@ We will start by performing the necessary imports, capturing the FX Graph from t
 Next, we will have the FX Module to be quantized.
 
 2. Apply Quantization
-----------------------------
+^^^^^^^^^^^^^^^^^^^^^^^
 
 After we capture the FX Module to be quantized, we will import the Backend Quantizer for X86 CPU and configure how to
 quantize the model.
@@ -160,7 +166,7 @@ After these steps, we finished running the quantization flow and we will get the
 
 
 3. Lower into Inductor
-------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^
 
 After we get the quantized model, we will further lower it to the inductor backend. The default Inductor wrapper
 generates Python code to invoke both generated kernels and external kernels. Additionally, Inductor supports
@@ -222,8 +228,74 @@ With PyTorch 2.1 release, all CNN models from TorchBench test suite have been me
 to `this document <https://dev-discuss.pytorch.org/t/torchinductor-update-6-cpu-backend-performance-update-and-new-features-in-pytorch-2-1/1514#int8-inference-with-post-training-static-quantization-3>`_
 for detail benchmark number.
 
-4. Conclusion
----------------
+Quantization Aware Training
+-----------------------------
+
+The PyTorch 2 Export Quantization-Aware Training (QAT) is now supported on X86 CPU using X86InductorQuantizer,
+followed by the subsequent lowering of the quantized model into Inductor.
+For a more in-depth understanding of PT2 Export Quantization-Aware Training,
+we recommend referring to the dedicated `PyTorch 2 Export Quantization-Aware Training <https://pytorch.org/tutorials/prototype/pt2e_quant_qat.html>`_.
+
+The PyTorch 2 Export QAT flow is largely similar to the PTQ flow:
+
+.. code:: python
+
+  import torch
+  from torch._export import capture_pre_autograd_graph
+  from torch.ao.quantization.quantize_pt2e import (
+    prepare_qat_pt2e,
+    convert_pt2e,
+  )
+  import torch.ao.quantization.quantizer.x86_inductor_quantizer as xiq
+  from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizer
+
+  class M(torch.nn.Module):
+     def __init__(self):
+        super().__init__()
+        self.linear = torch.nn.Linear(1024, 1000)
+
+     def forward(self, x):
+        return self.linear(x)
+
+  example_inputs = (torch.randn(1, 1024),)
+  m = M()
+
+  # Step 1. program capture
+  # NOTE: this API will be updated to torch.export API in the future, but the captured
+  # result shoud mostly stay the same
+  exported_model = capture_pre_autograd_graph(m, example_inputs)
+  # we get a model with aten ops
+
+  # Step 2. quantization-aware training
+  # Use Backend Quantizer for X86 CPU
+  quantizer = X86InductorQuantizer()
+  quantizer.set_global(xiq.get_default_x86_inductor_quantization_config(is_qat=True))
+  prepared_model = prepare_qat_pt2e(exported_model, quantizer)
+
+  # train omitted
+
+  converted_model = convert_pt2e(prepared_model)
+  # we have a model with aten ops doing integer computations when possible
+
+  # move the quantized model to eval mode, equivalent to `m.eval()`
+  torch.ao.quantization.move_exported_model_to_eval(converted_model)
+
+  # Lower the model into Inductor
+  with torch.no_grad():
+    optimized_model = torch.compile(converted_model)
+    _ = optimized_model(*example_inputs)
+
+Please note that the Inductor ``freeze`` feature is not enabled by default.
+To use this feature, you need to run example code with ``TORCHINDUCTOR_FREEZING=1``.
+
+For example:
+
+::
+
+    TORCHINDUCTOR_FREEZING=1 python example_x86inductorquantizer_qat.py
+
+Conclusion
+------------
 
 With this tutorial, we introduce how to use Inductor with X86 CPU in PyTorch 2 Quantization. Users can learn about
 how to use ``X86InductorQuantizer`` to quantize a model and lower it into the inductor with X86 CPU devices. 
diff --git a/recipes_source/compiling_optimizer.rst b/recipes_source/compiling_optimizer.rst
@@ -0,0 +1,88 @@
+(beta) Compiling the optimizer with torch.compile
+==========================================================================================
+
+**Author:** `Michael Lazos <https://github.com/mlazos>`_
+
+The optimizer is a key algorithm for training any deep learning model.
+Since it is responsible for updating every model parameter, it can often
+become the bottleneck in training performance for large models. In this recipe, 
+we will apply ``torch.compile`` to the optimizer to observe the GPU performance 
+improvement.
+
+.. note::
+
+   This tutorial requires PyTorch 2.2.0 or later.
+
+Model Setup
+~~~~~~~~~~~~~~~~~~~~~
+For this example, we'll use a simple sequence of linear layers.
+Since we are only benchmarking the optimizer, the choice of model doesn't matter
+because optimizer performance is a function of the number of parameters.
+
+Depending on what machine you are using, your exact results may vary.
+
+.. code-block:: python
+
+   import torch
+   
+   model = torch.nn.Sequential(
+       *[torch.nn.Linear(1024, 1024, False, device="cuda") for _ in range(10)]
+   )
+   input = torch.rand(1024, device="cuda")
+   output = model(input)
+   output.sum().backward()
+
+Setting up and running the optimizer benchmark
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In this example, we'll use the Adam optimizer
+and create a helper function to wrap the step()
+in ``torch.compile()``.
+
+.. note::
+   
+   ``torch.compile`` is only supported on cuda devices with compute capability >= 7.0
+
+.. code-block:: python
+
+   # exit cleanly if we are on a device that doesn't support torch.compile
+   if torch.cuda.get_device_capability() < (7, 0):
+       print("Exiting because torch.compile is not supported on this device.")
+       import sys
+       sys.exit(0)
+
+
+   opt = torch.optim.Adam(model.parameters(), lr=0.01)
+
+
+   @torch.compile(fullgraph=False)
+   def fn():
+       opt.step()
+   
+   
+   # Let's define a helpful benchmarking function:
+   import torch.utils.benchmark as benchmark
+   
+   
+   def benchmark_torch_function_in_microseconds(f, *args, **kwargs):
+       t0 = benchmark.Timer(
+           stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f}
+       )
+       return t0.blocked_autorange().mean * 1e6
+
+
+   # Warmup runs to compile the function
+   for _ in range(5):
+       fn()
+   
+   eager_runtime = benchmark_torch_function_in_microseconds(opt.step)
+   compiled_runtime = benchmark_torch_function_in_microseconds(fn)
+   
+   assert eager_runtime > compiled_runtime
+   
+   print(f"eager runtime: {eager_runtime}us")
+   print(f"compiled runtime: {compiled_runtime}us")
+
+Sample Results:
+
+* Eager runtime: 747.2437149845064us
+* Compiled runtime: 392.07384741178us
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
@@ -144,6 +144,14 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/recipes/module_load_state_dict_tips.html
    :tags: Basics
 
+.. customcarditem::
+   :header: (beta) Using TORCH_LOGS to observe torch.compile
+   :card_description: Learn how to use the torch logging APIs to observe the compilation process.
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/torch_logs.html
+   :tags: Basics
+
+
 .. Interpretability
 
 .. customcarditem::
@@ -276,6 +284,15 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/amx.html
    :tags: Model-Optimization
 
+.. (beta) Compiling the Optimizer with torch.compile
+
+.. customcarditem::
+   :header: (beta) Compiling the Optimizer with torch.compile
+   :card_description: Speed up the optimizer using torch.compile
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/compiling_optimizer.html
+   :tags: Model-Optimization
+
 .. Intel(R) Extension for PyTorch*
 
 .. customcarditem::
@@ -360,6 +377,7 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
 
    /recipes/recipes/loading_data_recipe
    /recipes/recipes/defining_a_neural_network
+   /recipes/torch_logs
    /recipes/recipes/what_is_state_dict
    /recipes/recipes/saving_and_loading_models_for_inference
    /recipes/recipes/saving_and_loading_a_general_checkpoint
@@ -375,6 +393,7 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    /recipes/recipes/amp_recipe
    /recipes/recipes/tuning_guide
    /recipes/recipes/intel_extension_for_pytorch
+   /recipes/compiling_optimizer
    /recipes/torch_compile_backend_ipex
    /recipes/torchscript_inference
    /recipes/deployment_with_flask
diff --git a/recipes_source/torch_logs.py b/recipes_source/torch_logs.py
@@ -0,0 +1,100 @@
+"""
+(beta) Using TORCH_LOGS python API with torch.compile
+==========================================================================================
+**Author:** `Michael Lazos <https://github.com/mlazos>`_
+"""
+
+import logging
+
+######################################################################
+#
+# This tutorial introduces the ``TORCH_LOGS`` environment variable, as well ass the Python API, and
+# demonstrates how to apply it to observe the phases  of ``torch.compile``.
+#
+# .. note::
+#
+#   This tutorial requires PyTorch 2.2.0 or later.
+#
+#
+
+
+######################################################################
+# Setup
+# ~~~~~~~~~~~~~~~~~~~~~
+# In this example, we'll set up a simple Python function which performs an elementwise
+# add and observe the compilation process with ``TORCH_LOGS`` Python API.
+#
+# .. note::
+#
+#   There is also an environment variable ``TORCH_LOGS``, which can be used to
+#   change logging settings at the command line. The equivalent environment
+#   variable setting is shown for each example.
+
+import torch
+
+# exit cleanly if we are on a device that doesn't support torch.compile
+if torch.cuda.get_device_capability() < (7, 0):
+    print("Exiting because torch.compile is not supported on this device.")
+    import sys
+
+    sys.exit(0)
+
+
+@torch.compile()
+def fn(x, y):
+    z = x + y
+    return z + 2
+
+
+inputs = (torch.ones(2, 2, device="cuda"), torch.zeros(2, 2, device="cuda"))
+
+
+# print separator and reset dynamo
+# between each example
+def separator(name):
+    print(f"==================={name}=========================")
+    torch._dynamo.reset()
+
+
+separator("Dynamo Tracing")
+# View dynamo tracing
+# TORCH_LOGS="+dynamo"
+torch._logging.set_logs(dynamo=logging.DEBUG)
+fn(*inputs)
+
+separator("Traced Graph")
+# View traced graph
+# TORCH_LOGS="graph"
+torch._logging.set_logs(graph=True)
+fn(*inputs)
+
+separator("Fusion Decisions")
+# View fusion decisions
+# TORCH_LOGS="fusion"
+torch._logging.set_logs(fusion=True)
+fn(*inputs)
+
+separator("Output Code")
+# View output code generated by inductor
+# TORCH_LOGS="output_code"
+torch._logging.set_logs(output_code=True)
+fn(*inputs)
+
+separator("")
+
+######################################################################
+# Conclusion
+# ~~~~~~~~~~
+#
+# In this tutorial we introduced the TORCH_LOGS environment variable and python API
+# by experimenting with a small number of the available logging options.
+# To view descriptions of all available options, run any python script
+# which imports torch and set TORCH_LOGS to "help".
+#
+# Alternatively, you can view the `torch._logging documentation`_ to see
+# descriptions of all available logging options.
+#
+# For more information on torch.compile, see the `torch.compile tutorial`_.
+#
+# .. _torch._logging documentation: https://pytorch.org/docs/main/logging.html
+# .. _torch.compile tutorial: https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html