From bbbb580f654afa174ef647443990679128a6cf75 Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Tue, 6 Jun 2023 05:33:45 -0700
Subject: [PATCH 01/10] [Doc] Add AMX document for oneDNN backend

---
 recipes_source/amx.rst           | 127 +++++++++++++++++++++++++++++++
 recipes_source/recipes_index.rst |   9 +++
 2 files changed, 136 insertions(+)
 create mode 100644 recipes_source/amx.rst

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
new file mode 100644
index 00000000000..68f2234a000
--- /dev/null
+++ b/recipes_source/amx.rst
@@ -0,0 +1,127 @@
+Leverage Advanced Matrix Extensions
+==============================================
+
+Introduction
+--------------
+
+Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), is an extensions to the x86 instruction set architecture (ISA).
+AMX is designed to improve performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively.
+For more detailed information of AMX, see `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ and `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html>`_.
+
+Note: AMX will have FP16 support on the next generation of Xeon.
+
+AMX in PyTorch
+--------------
+
+PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
+to get higher performance out-of-box on x86 CPUs with AMX support.
+The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
+No manual operations are required to enable this feature. 
+
+BF16 CPU ops that can leverage AMX:
+"""""""""""""""""""""""""""""""""""
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``bmm``,
+``mm``,
+``baddbmm``,
+``addmm``,
+``addbmm``,
+``linear``,
+``matmul``,
+``_convolution``
+
+Quantization CPU ops that can leverage AMX:
+"""""""""""""""""""""""""""""""""""
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``linear``
+
+Preliminary requirements to activate AMX support for PyTorch:
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+All of the following Instruction sets onboard the hardware platform
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
++---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
+| AVX512F | AVX512BW | AVX512VL | AVX512DQ | AVX512_VNNI | AVX512_BF16 | AMX_TILE | AMX_INT8 | AMX_BF16 |
++---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
+
+Software
+""""""""
+
+For linux:
+
++----------------------+---------------+
+| linux kernel >= 5.16 | Python >= 3.8 | 
++----------------------+---------------+
+
+
+Guidelines of leveraging AMX with workloads
+------------------------------------------
+
+For BFloat16 data type: 
+''''''''''''''''''''
+
+Using `torch.cpu.amp` or `torch.autocast("cpu")` would utilize AMX acceleration.
+
+
+```
+model = model.to(memory_format=torch.channels_last)
+with torch.cpu.amp.autocast():
+    output = model(input)
+```
+
+
+For quantization:
+'''''''''''''''''
+
+Applying quantization would utilize AMX acceleration.
+
+Note: Use channels last format to get better performance. 
+
+For torch.compile:
+'''''''''''''''''
+
+When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
+
+
+Confirm AMX is being utilized
+''''''''''''''''''''''
+
+Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
+For more detailed information of oneDNN, see `here <https://oneapi-src.github.io/oneDNN/index.html>`_.
+
+For example:
+
+Get oneDNN verbose:
+
+```
+onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
+onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
+onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
+onednn_verbose,info,gpu,runtime:none
+onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
+...
+onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
+...
+onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
+...
+```
+
+If we get the verbose of `avx512_core_amx_bf16` for BFloat16 or `avx512_core_amx_int8` for quantization with INT8, it indicates that AMX is activated.
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
index f4e7466ddae..e145f12ede7 100644
--- a/recipes_source/recipes_index.rst
+++ b/recipes_source/recipes_index.rst
@@ -253,6 +253,15 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/recipes/tuning_guide.html
    :tags: Model-Optimization
 
+.. Leverage Advanced Matrix Extensions
+
+.. customcarditem::
+   :header: Leverage Advanced Matrix Extensions
+   :card_description: Learn to leverage Advanced Matrix Extensions.
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/amx.html
+   :tags: Model-Optimization
+
 .. Intel(R) Extension for PyTorch*
 
 .. customcarditem::

From 7945bdb9aafeca221983571e90424f9a078a5677 Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Tue, 6 Jun 2023 22:14:39 -0700
Subject: [PATCH 02/10] Update AMX document

---
 recipes_source/amx.rst | 105 +++++++++++++++++------------------------
 1 file changed, 43 insertions(+), 62 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 68f2234a000..46386b8031b 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -1,26 +1,27 @@
+==============================================
 Leverage Advanced Matrix Extensions
 ==============================================
 
 Introduction
---------------
+============
 
-Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), is an extensions to the x86 instruction set architecture (ISA).
-AMX is designed to improve performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
-AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively.
-For more detailed information of AMX, see `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ and `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html>`_.
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
+Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
+For more detailed information of AMX, see `Intel® AMX Overview`_.
 
-Note: AMX will have FP16 support on the next generation of Xeon.
 
 AMX in PyTorch
---------------
+==============
 
 PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
 to get higher performance out-of-box on x86 CPUs with AMX support.
+For more detailed information of oneDNN, see `oneDNN`_.
+
 The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
 No manual operations are required to enable this feature. 
 
-BF16 CPU ops that can leverage AMX:
-"""""""""""""""""""""""""""""""""""
+- BF16 CPU ops that can leverage AMX:
 
 ``conv1d``,
 ``conv2d``,
@@ -37,8 +38,7 @@ BF16 CPU ops that can leverage AMX:
 ``matmul``,
 ``_convolution``
 
-Quantization CPU ops that can leverage AMX:
-"""""""""""""""""""""""""""""""""""
+- Quantization CPU ops that can leverage AMX:
 
 ``conv1d``,
 ``conv2d``,
@@ -51,77 +51,58 @@ Quantization CPU ops that can leverage AMX:
 ``conv_transpose3d``,
 ``linear``
 
-Preliminary requirements to activate AMX support for PyTorch:
-'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-All of the following Instruction sets onboard the hardware platform
-"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
-| AVX512F | AVX512BW | AVX512VL | AVX512DQ | AVX512_VNNI | AVX512_BF16 | AMX_TILE | AMX_INT8 | AMX_BF16 |
-+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
-
-Software
-""""""""
-
-For linux:
-
-+----------------------+---------------+
-| linux kernel >= 5.16 | Python >= 3.8 | 
-+----------------------+---------------+
-
 
 Guidelines of leveraging AMX with workloads
-------------------------------------------
+--------------------------------------------------
 
-For BFloat16 data type: 
-''''''''''''''''''''
+- BFloat16 data type: 
 
-Using `torch.cpu.amp` or `torch.autocast("cpu")` would utilize AMX acceleration.
+Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration.
 
+::
 
-```
-model = model.to(memory_format=torch.channels_last)
-with torch.cpu.amp.autocast():
-    output = model(input)
-```
+   model = model.to(memory_format=torch.channels_last)
+   with torch.cpu.amp.autocast():
+       output = model(input)
 
+Note: Use channels last format to get better performance. 
 
-For quantization:
-'''''''''''''''''
+- quantization:
 
 Applying quantization would utilize AMX acceleration.
 
-Note: Use channels last format to get better performance. 
-
-For torch.compile:
-'''''''''''''''''
+- torch.compile:
 
 When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
 
 
 Confirm AMX is being utilized
-''''''''''''''''''''''
+------------------------------
 
 Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
-For more detailed information of oneDNN, see `here <https://oneapi-src.github.io/oneDNN/index.html>`_.
 
 For example:
 
 Get oneDNN verbose:
 
-```
-onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
-onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
-onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
-onednn_verbose,info,gpu,runtime:none
-onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
-onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
-...
-onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
-...
-onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
-...
-```
-
-If we get the verbose of `avx512_core_amx_bf16` for BFloat16 or `avx512_core_amx_int8` for quantization with INT8, it indicates that AMX is activated.
+::
+
+   onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
+   onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
+   onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
+   onednn_verbose,info,gpu,runtime:none
+   onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+   onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
+   ...
+   onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
+   ...
+   onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
+   ...
+
+If we get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
+
+.. _Accelerate AI Workloads with Intel® AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html
+
+.. _Intel® AMX Overview: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
+
+.. _oneDNN: https://oneapi-src.github.io/oneDNN/index.html

From a2b148c633fb6a1bbfe3d6a5b1d49cd7d3d76f0f Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Tue, 6 Jun 2023 22:42:34 -0700
Subject: [PATCH 03/10] Update AMX document

---
 recipes_source/amx.rst | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 46386b8031b..ab9079149f4 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -79,11 +79,9 @@ When the generated graph model runs into oneDNN implementations with the support
 Confirm AMX is being utilized
 ------------------------------
 
-Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
+Set environment variable ``export ONEDNN_VERBOSE=1`` to get oneDNN verbose at runtime.
 
-For example:
-
-Get oneDNN verbose:
+For example, get oneDNN verbose:
 
 ::
 

From 1ae03c76acd41c041db11c635f1cc384d5fe549a Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Wed, 7 Jun 2023 02:46:03 -0700
Subject: [PATCH 04/10] Update AMX document

---
 recipes_source/amx.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index ab9079149f4..2500e2281c2 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -19,7 +19,7 @@ to get higher performance out-of-box on x86 CPUs with AMX support.
 For more detailed information of oneDNN, see `oneDNN`_.
 
 The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
-No manual operations are required to enable this feature. 
+Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.
 
 - BF16 CPU ops that can leverage AMX:
 
@@ -51,6 +51,8 @@ No manual operations are required to enable this feature.
 ``conv_transpose3d``,
 ``linear``
 
+Note: For quantized linear, whether to leverage AMX depends on which quantization backend to choose.
+At present, x86 quantization backend is used by default for quantized linear, using fbgemm, while users can specify onednn backend to turn on AMX for quantized linear.
 
 Guidelines of leveraging AMX with workloads
 --------------------------------------------------

From 377610153e3b32b9c1bfe59e1a13abfffba31dae Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Wed, 7 Jun 2023 18:18:27 -0700
Subject: [PATCH 05/10] Update AMX document

---
 recipes_source/amx.rst | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 2500e2281c2..0a1d8e0dbda 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -51,8 +51,7 @@ Since oneDNN is the default acceleration library for CPU, no manual operations a
 ``conv_transpose3d``,
 ``linear``
 
-Note: For quantized linear, whether to leverage AMX depends on which quantization backend to choose.
-At present, x86 quantization backend is used by default for quantized linear, using fbgemm, while users can specify onednn backend to turn on AMX for quantized linear.
+Note: For quantized linear, whether to leverage AMX depends on the policy of the quantization backend.
 
 Guidelines of leveraging AMX with workloads
 --------------------------------------------------

From d3d0aaee3504fd918efdbd233bd1cf8646fb94f5 Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Wed, 7 Jun 2023 19:56:30 -0700
Subject: [PATCH 06/10] Update AMX document

---
 recipes_source/amx.rst | 69 ++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 29 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 0a1d8e0dbda..2f19bb66254 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -5,9 +5,13 @@ Leverage Advanced Matrix Extensions
 Introduction
 ============
 
-Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an x86 extension,
+which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that are able to operate on those tiles.
+AMX is designed to work on matrices to accelerate deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
+
 Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
-AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
+Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Neural Network Instructions (Intel® AVX-512 VNNI),
+4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle, see page 4 of `Accelerate AI Workloads with Intel® AMX`_.
 For more detailed information of AMX, see `Intel® AMX Overview`_.
 
 
@@ -19,7 +23,34 @@ to get higher performance out-of-box on x86 CPUs with AMX support.
 For more detailed information of oneDNN, see `oneDNN`_.
 
 The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
-Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.
+Since oneDNN is the default acceleration library for PyTorch CPU, no manual operations are required to enable the AMX support.
+
+Guidelines of leveraging AMX with workloads
+-------------------------------------------
+
+- BFloat16 data type: 
+
+Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
+
+::
+
+   model = model.to(memory_format=torch.channels_last)
+   with torch.cpu.amp.autocast():
+       output = model(input)
+
+Note: Use channels last format to get better performance. 
+
+- quantization:
+
+Applying quantization would utilize AMX acceleration for supported operators.
+
+- torch.compile:
+
+When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
+
+
+CPU operators that can leverage AMX:
+------------------------------------
 
 - BF16 CPU ops that can leverage AMX:
 
@@ -36,13 +67,9 @@ Since oneDNN is the default acceleration library for CPU, no manual operations a
 ``addbmm``,
 ``linear``,
 ``matmul``,
-``_convolution``
 
 - Quantization CPU ops that can leverage AMX:
 
-``conv1d``,
-``conv2d``,
-``conv3d``,
 ``conv1d``,
 ``conv2d``,
 ``conv3d``,
@@ -53,34 +80,18 @@ Since oneDNN is the default acceleration library for CPU, no manual operations a
 
 Note: For quantized linear, whether to leverage AMX depends on the policy of the quantization backend.
 
-Guidelines of leveraging AMX with workloads
---------------------------------------------------
-
-- BFloat16 data type: 
-
-Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration.
-
-::
-
-   model = model.to(memory_format=torch.channels_last)
-   with torch.cpu.amp.autocast():
-       output = model(input)
-
-Note: Use channels last format to get better performance. 
-
-- quantization:
 
-Applying quantization would utilize AMX acceleration.
-
-- torch.compile:
-
-When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
 
 
 Confirm AMX is being utilized
 ------------------------------
 
-Set environment variable ``export ONEDNN_VERBOSE=1`` to get oneDNN verbose at runtime.
+Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
+
+::
+
+   with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
+       model(input)
 
 For example, get oneDNN verbose:
 

From 47ae25d96a24716a4077620191a681ba0f4beeda Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Wed, 7 Jun 2023 23:06:44 -0700
Subject: [PATCH 07/10] Update document

---
 recipes_source/amx.rst | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 2f19bb66254..6746511d70f 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -6,7 +6,7 @@ Introduction
 ============
 
 Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an x86 extension,
-which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that are able to operate on those tiles.
+which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that is able to operate on those tiles.
 AMX is designed to work on matrices to accelerate deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
 
 Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
@@ -40,7 +40,7 @@ Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX accelerat
 
 Note: Use channels last format to get better performance. 
 
-- quantization:
+- Quantization:
 
 Applying quantization would utilize AMX acceleration for supported operators.
 
@@ -48,6 +48,12 @@ Applying quantization would utilize AMX acceleration for supported operators.
 
 When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
 
+Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
+This means that PyTorch will attempt to leverage the AMX feature whenever possible to speed up matrix multiplication operations.
+However, it's important to note that the decision to dispatch to the AMX kernel ultimately depends on
+the internal optimization strategy of the oneDNN library and the quantization backend, which PyTorch relies on for performance enhancements.
+The specific details of how AMX utilization is handled internally by PyTorch and the oneDNN library may be subject to change with updates and improvements to the framework.
+
 
 CPU operators that can leverage AMX:
 ------------------------------------
@@ -78,9 +84,6 @@ CPU operators that can leverage AMX:
 ``conv_transpose3d``,
 ``linear``
 
-Note: For quantized linear, whether to leverage AMX depends on the policy of the quantization backend.
-
-
 
 
 Confirm AMX is being utilized
@@ -91,7 +94,8 @@ Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mk
 ::
 
    with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
-       model(input)
+       with torch.cpu.amp.autocast():
+           model(input)
 
 For example, get oneDNN verbose:
 

From a608d00fa7320427cd6327d8b9e2317f1319ad04 Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Thu, 8 Jun 2023 02:20:49 -0700
Subject: [PATCH 08/10] add conclusion

---
 recipes_source/amx.rst | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 6746511d70f..d073cd91076 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -113,7 +113,21 @@ For example, get oneDNN verbose:
    onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
    ...
 
-If we get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
+If you get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
+
+
+Conclusion
+----------
+
+
+In this tutorial, we briefly introduced AMX, how to utilize AMX in PyTorch to accelerate workloads, and how to confirm that AMX is being utilized.
+
+With the improvements and updates of PyTorch and oneDNN, the utilization of AMX may be subject to change accordingly.
+
+As always, if you run into any problems or have any questions, you can use
+`forum <https://discuss.pytorch.org/>`_ or `GitHub issues
+<https://github.com/pytorch/pytorch/issues>`_ to get in touch. 
+
 
 .. _Accelerate AI Workloads with Intel® AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html
 

From 2bbcabf7696c1e9be5270664ae38bd1056058408 Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Mon, 12 Jun 2023 18:38:23 -0700
Subject: [PATCH 09/10] Editorial fixes for proper HTML rendering

---
 recipes_source/amx.rst | 72 ++++++++++++++++++++----------------------
 1 file changed, 35 insertions(+), 37 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index d073cd91076..27d408b336f 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -22,74 +22,72 @@ PyTorch leverages AMX for computing intensive operators with BFloat16 and quanti
 to get higher performance out-of-box on x86 CPUs with AMX support.
 For more detailed information of oneDNN, see `oneDNN`_.
 
-The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
+The operation is fully handled by oneDNN according to the execution code path generated. For example, when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
 Since oneDNN is the default acceleration library for PyTorch CPU, no manual operations are required to enable the AMX support.
 
 Guidelines of leveraging AMX with workloads
 -------------------------------------------
 
+This section provides guidelines on how to leverage AMX with various workloads.
+
 - BFloat16 data type: 
 
-Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
+  - Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
 
-::
+   ::
 
-   model = model.to(memory_format=torch.channels_last)
-   with torch.cpu.amp.autocast():
-       output = model(input)
+      model = model.to(memory_format=torch.channels_last)
+      with torch.cpu.amp.autocast():
+         output = model(input)
 
-Note: Use channels last format to get better performance. 
+.. note:: Use ``torch.channels_last`` memory format to get better performance. 
 
 - Quantization:
 
-Applying quantization would utilize AMX acceleration for supported operators.
+  - Applying quantization would utilize AMX acceleration for supported operators.
 
 - torch.compile:
 
-When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
+  - When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
 
-Note: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default.
-This means that PyTorch will attempt to leverage the AMX feature whenever possible to speed up matrix multiplication operations.
-However, it's important to note that the decision to dispatch to the AMX kernel ultimately depends on
-the internal optimization strategy of the oneDNN library and the quantization backend, which PyTorch relies on for performance enhancements.
-The specific details of how AMX utilization is handled internally by PyTorch and the oneDNN library may be subject to change with updates and improvements to the framework.
+.. note:: When using PyTorch on CPUs that support AMX, the framework will automatically enable AMX usage by default. This means that PyTorch will attempt to leverage the AMX feature whenever possible to speed up matrix multiplication operations. However, it's important to note that the decision to dispatch to the AMX kernel ultimately depends on the internal optimization strategy of the oneDNN library and the quantization backend, which PyTorch relies on for performance enhancements. The specific details of how AMX utilization is handled internally by PyTorch and the oneDNN library may be subject to change with updates and improvements to the framework.
 
 
 CPU operators that can leverage AMX:
 ------------------------------------
 
-- BF16 CPU ops that can leverage AMX:
+BF16 CPU ops that can leverage AMX:
 
-``conv1d``,
-``conv2d``,
-``conv3d``,
-``conv_transpose1d``,
-``conv_transpose2d``,
-``conv_transpose3d``,
-``bmm``,
-``mm``,
-``baddbmm``,
-``addmm``,
-``addbmm``,
-``linear``,
-``matmul``,
+- ``conv1d``
+- ``conv2d``
+- ``conv3d``
+- ``conv_transpose1d``
+- ``conv_transpose2d``
+- ``conv_transpose3d``
+- ``bmm``
+- ``mm``
+- ``baddbmm``
+- ``addmm``
+- ``addbmm``
+- ``linear``
+- ``matmul``
 
-- Quantization CPU ops that can leverage AMX:
+Quantization CPU ops that can leverage AMX:
 
-``conv1d``,
-``conv2d``,
-``conv3d``,
-``conv_transpose1d``,
-``conv_transpose2d``,
-``conv_transpose3d``,
-``linear``
+- ``conv1d``
+- ``conv2d``
+- ``conv3d``
+- ``conv_transpose1d``
+- ``conv_transpose2d``
+- ``conv_transpose3d``
+- ``linear``
 
 
 
 Confirm AMX is being utilized
 ------------------------------
 
-Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
+Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to enable oneDNN to dump verbose messages.
 
 ::
 

From 6d02a18050a2226a2292463b99f91cdb1010807f Mon Sep 17 00:00:00 2001
From: ecao <e.cao@intel.com>
Date: Mon, 12 Jun 2023 19:08:57 -0700
Subject: [PATCH 10/10] Update AMX document

---
 recipes_source/amx.rst           | 2 +-
 recipes_source/recipes_index.rst | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
index 27d408b336f..459e7c5541b 100644
--- a/recipes_source/amx.rst
+++ b/recipes_source/amx.rst
@@ -1,5 +1,5 @@
 ==============================================
-Leverage Advanced Matrix Extensions
+Leverage Intel® Advanced Matrix Extensions
 ==============================================
 
 Introduction
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
index e145f12ede7..0e82434190d 100644
--- a/recipes_source/recipes_index.rst
+++ b/recipes_source/recipes_index.rst
@@ -256,8 +256,8 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
 .. Leverage Advanced Matrix Extensions
 
 .. customcarditem::
-   :header: Leverage Advanced Matrix Extensions
-   :card_description: Learn to leverage Advanced Matrix Extensions.
+   :header: Leverage Intel® Advanced Matrix Extensions
+   :card_description: Learn to leverage Intel® Advanced Matrix Extensions.
    :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
    :link: ../recipes/amx.html
    :tags: Model-Optimization