Update AMX document

CaoE · CaoE · commit 7945bdb9aafe · 2023-06-06T22:14:39.000-07:00
diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
@@ -1,26 +1,27 @@
+==============================================
 Leverage Advanced Matrix Extensions
 ==============================================
 
 Introduction
---------------
+============
 
-Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), is an extensions to the x86 instruction set architecture (ISA).
-AMX is designed to improve performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
-AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively.
-For more detailed information of AMX, see `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ and `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html>`_.
+Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
+Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
+For more detailed information of AMX, see `Intel® AMX Overview`_.
 
-Note: AMX will have FP16 support on the next generation of Xeon.
 
 AMX in PyTorch
---------------
+==============
 
 PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
 to get higher performance out-of-box on x86 CPUs with AMX support.
+For more detailed information of oneDNN, see `oneDNN`_.
+
 The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
 No manual operations are required to enable this feature. 
 
-BF16 CPU ops that can leverage AMX:
-"""""""""""""""""""""""""""""""""""
+- BF16 CPU ops that can leverage AMX:
 
 ``conv1d``,
 ``conv2d``,
@@ -37,8 +38,7 @@ BF16 CPU ops that can leverage AMX:
 ``matmul``,
 ``_convolution``
 
-Quantization CPU ops that can leverage AMX:
-"""""""""""""""""""""""""""""""""""
+- Quantization CPU ops that can leverage AMX:
 
 ``conv1d``,
 ``conv2d``,
@@ -51,77 +51,58 @@ Quantization CPU ops that can leverage AMX:
 ``conv_transpose3d``,
 ``linear``
 
-Preliminary requirements to activate AMX support for PyTorch:
-'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-All of the following Instruction sets onboard the hardware platform
-"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
-| AVX512F | AVX512BW | AVX512VL | AVX512DQ | AVX512_VNNI | AVX512_BF16 | AMX_TILE | AMX_INT8 | AMX_BF16 |
-+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
-
-Software
-""""""""
-
-For linux:
-
-+----------------------+---------------+
-| linux kernel >= 5.16 | Python >= 3.8 | 
-+----------------------+---------------+
-
 
 Guidelines of leveraging AMX with workloads
-------------------------------------------
+--------------------------------------------------
 
-For BFloat16 data type: 
-''''''''''''''''''''
+- BFloat16 data type: 
 
-Using `torch.cpu.amp` or `torch.autocast("cpu")` would utilize AMX acceleration.
+Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration.
 
+::
 
-```
-model = model.to(memory_format=torch.channels_last)
-with torch.cpu.amp.autocast():
-    output = model(input)
-```
+   model = model.to(memory_format=torch.channels_last)
+   with torch.cpu.amp.autocast():
+       output = model(input)
 
+Note: Use channels last format to get better performance. 
 
-For quantization:
-'''''''''''''''''
+- quantization:
 
 Applying quantization would utilize AMX acceleration.
 
-Note: Use channels last format to get better performance. 
-
-For torch.compile:
-'''''''''''''''''
+- torch.compile:
 
 When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
 
 
 Confirm AMX is being utilized
-''''''''''''''''''''''
+------------------------------
 
 Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
-For more detailed information of oneDNN, see `here <https://oneapi-src.github.io/oneDNN/index.html>`_.
 
 For example:
 
 Get oneDNN verbose:
 
-```
-onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
-onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
-onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
-onednn_verbose,info,gpu,runtime:none
-onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
-onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
-...
-onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
-...
-onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
-...
-```
-
-If we get the verbose of `avx512_core_amx_bf16` for BFloat16 or `avx512_core_amx_int8` for quantization with INT8, it indicates that AMX is activated.
+::
+
+   onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
+   onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
+   onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
+   onednn_verbose,info,gpu,runtime:none
+   onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+   onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
+   ...
+   onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
+   ...
+   onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
+   ...
+
+If we get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
+
+.. _Accelerate AI Workloads with Intel® AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html
+
+.. _Intel® AMX Overview: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
+
+.. _oneDNN: https://oneapi-src.github.io/oneDNN/index.html