[Doc] Add AMX document for oneDNN backend

CaoE · CaoE · commit bbbb580f654a · 2023-06-06T06:01:17.000-07:00
diff --git a/recipes_source/amx.rst b/recipes_source/amx.rst
@@ -0,0 +1,127 @@
+Leverage Advanced Matrix Extensions
+==============================================
+
+Introduction
+--------------
+
+Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), is an extensions to the x86 instruction set architecture (ISA).
+AMX is designed to improve performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
+AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively.
+For more detailed information of AMX, see `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ and `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html>`_.
+
+Note: AMX will have FP16 support on the next generation of Xeon.
+
+AMX in PyTorch
+--------------
+
+PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
+to get higher performance out-of-box on x86 CPUs with AMX support.
+The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
+No manual operations are required to enable this feature. 
+
+BF16 CPU ops that can leverage AMX:
+"""""""""""""""""""""""""""""""""""
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``bmm``,
+``mm``,
+``baddbmm``,
+``addmm``,
+``addbmm``,
+``linear``,
+``matmul``,
+``_convolution``
+
+Quantization CPU ops that can leverage AMX:
+"""""""""""""""""""""""""""""""""""
+
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv1d``,
+``conv2d``,
+``conv3d``,
+``conv_transpose1d``,
+``conv_transpose2d``,
+``conv_transpose3d``,
+``linear``
+
+Preliminary requirements to activate AMX support for PyTorch:
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+All of the following Instruction sets onboard the hardware platform
+"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
++---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
+| AVX512F | AVX512BW | AVX512VL | AVX512DQ | AVX512_VNNI | AVX512_BF16 | AMX_TILE | AMX_INT8 | AMX_BF16 |
++---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
+
+Software
+""""""""
+
+For linux:
+
++----------------------+---------------+
+| linux kernel >= 5.16 | Python >= 3.8 | 
++----------------------+---------------+
+
+
+Guidelines of leveraging AMX with workloads
+------------------------------------------
+
+For BFloat16 data type: 
+''''''''''''''''''''
+
+Using `torch.cpu.amp` or `torch.autocast("cpu")` would utilize AMX acceleration.
+
+
+```
+model = model.to(memory_format=torch.channels_last)
+with torch.cpu.amp.autocast():
+    output = model(input)
+```
+
+
+For quantization:
+'''''''''''''''''
+
+Applying quantization would utilize AMX acceleration.
+
+Note: Use channels last format to get better performance. 
+
+For torch.compile:
+'''''''''''''''''
+
+When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
+
+
+Confirm AMX is being utilized
+''''''''''''''''''''''
+
+Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
+For more detailed information of oneDNN, see `here <https://oneapi-src.github.io/oneDNN/index.html>`_.
+
+For example:
+
+Get oneDNN verbose:
+
+```
+onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
+onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
+onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
+onednn_verbose,info,gpu,runtime:none
+onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
+onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
+...
+onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
+...
+onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
+...
+```
+
+If we get the verbose of `avx512_core_amx_bf16` for BFloat16 or `avx512_core_amx_int8` for quantization with INT8, it indicates that AMX is activated.
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
@@ -253,6 +253,15 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/recipes/tuning_guide.html
    :tags: Model-Optimization
 
+.. Leverage Advanced Matrix Extensions
+
+.. customcarditem::
+   :header: Leverage Advanced Matrix Extensions
+   :card_description: Learn to leverage Advanced Matrix Extensions.
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/amx.html
+   :tags: Model-Optimization
+
 .. Intel(R) Extension for PyTorch*
 
 .. customcarditem::