Skip to content

Commit 7945bdb

Browse files
committed
Update AMX document
1 parent bbbb580 commit 7945bdb

File tree

1 file changed

+43
-62
lines changed

1 file changed

+43
-62
lines changed

recipes_source/amx.rst

Lines changed: 43 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,27 @@
1+
==============================================
12
Leverage Advanced Matrix Extensions
23
==============================================
34

45
Introduction
5-
--------------
6+
============
67

7-
Advanced Matrix Extensions (AMX), also known as Intel Advanced Matrix Extensions (Intel AMX), is an extensions to the x86 instruction set architecture (ISA).
8-
AMX is designed to improve performance of deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
9-
AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively.
10-
For more detailed information of AMX, see `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html>`_ and `here <https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html>`_.
8+
Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
9+
Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
10+
AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
11+
For more detailed information of AMX, see `Intel® AMX Overview`_.
1112

12-
Note: AMX will have FP16 support on the next generation of Xeon.
1313

1414
AMX in PyTorch
15-
--------------
15+
==============
1616

1717
PyTorch leverages AMX for computing intensive operators with BFloat16 and quantization with INT8 by its backend oneDNN
1818
to get higher performance out-of-box on x86 CPUs with AMX support.
19+
For more detailed information of oneDNN, see `oneDNN`_.
20+
1921
The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
2022
No manual operations are required to enable this feature.
2123

22-
BF16 CPU ops that can leverage AMX:
23-
"""""""""""""""""""""""""""""""""""
24+
- BF16 CPU ops that can leverage AMX:
2425

2526
``conv1d``,
2627
``conv2d``,
@@ -37,8 +38,7 @@ BF16 CPU ops that can leverage AMX:
3738
``matmul``,
3839
``_convolution``
3940

40-
Quantization CPU ops that can leverage AMX:
41-
"""""""""""""""""""""""""""""""""""
41+
- Quantization CPU ops that can leverage AMX:
4242

4343
``conv1d``,
4444
``conv2d``,
@@ -51,77 +51,58 @@ Quantization CPU ops that can leverage AMX:
5151
``conv_transpose3d``,
5252
``linear``
5353

54-
Preliminary requirements to activate AMX support for PyTorch:
55-
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
56-
57-
All of the following Instruction sets onboard the hardware platform
58-
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
59-
60-
+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
61-
| AVX512F | AVX512BW | AVX512VL | AVX512DQ | AVX512_VNNI | AVX512_BF16 | AMX_TILE | AMX_INT8 | AMX_BF16 |
62-
+---------+----------+----------+----------+-------------+-------------+----------+----------+----------+
63-
64-
Software
65-
""""""""
66-
67-
For linux:
68-
69-
+----------------------+---------------+
70-
| linux kernel >= 5.16 | Python >= 3.8 |
71-
+----------------------+---------------+
72-
7354

7455
Guidelines of leveraging AMX with workloads
75-
------------------------------------------
56+
--------------------------------------------------
7657

77-
For BFloat16 data type:
78-
''''''''''''''''''''
58+
- BFloat16 data type:
7959

80-
Using `torch.cpu.amp` or `torch.autocast("cpu")` would utilize AMX acceleration.
60+
Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration.
8161

62+
::
8263

83-
```
84-
model = model.to(memory_format=torch.channels_last)
85-
with torch.cpu.amp.autocast():
86-
output = model(input)
87-
```
64+
model = model.to(memory_format=torch.channels_last)
65+
with torch.cpu.amp.autocast():
66+
output = model(input)
8867

68+
Note: Use channels last format to get better performance.
8969

90-
For quantization:
91-
'''''''''''''''''
70+
- quantization:
9271

9372
Applying quantization would utilize AMX acceleration.
9473

95-
Note: Use channels last format to get better performance.
96-
97-
For torch.compile:
98-
'''''''''''''''''
74+
- torch.compile:
9975

10076
When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
10177

10278

10379
Confirm AMX is being utilized
104-
''''''''''''''''''''''
80+
------------------------------
10581

10682
Set environment variable `export ONEDNN_VERBOSE=1` to get oneDNN verbose at runtime.
107-
For more detailed information of oneDNN, see `here <https://oneapi-src.github.io/oneDNN/index.html>`_.
10883

10984
For example:
11085

11186
Get oneDNN verbose:
11287

113-
```
114-
onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
115-
onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
116-
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
117-
onednn_verbose,info,gpu,runtime:none
118-
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
119-
onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
120-
...
121-
onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
122-
...
123-
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
124-
...
125-
```
126-
127-
If we get the verbose of `avx512_core_amx_bf16` for BFloat16 or `avx512_core_amx_int8` for quantization with INT8, it indicates that AMX is activated.
88+
::
89+
90+
onednn_verbose,info,oneDNN v2.7.3 (commit 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
91+
onednn_verbose,info,cpu,runtime:OpenMP,nthr:128
92+
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
93+
onednn_verbose,info,gpu,runtime:none
94+
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
95+
onednn_verbose,exec,cpu,reorder,simple:any,undef,src_f32::blocked:a:f0 dst_f32::blocked:a:f0,attr-scratchpad:user ,,2,5.2561
96+
...
97+
onednn_verbose,exec,cpu,convolution,jit:avx512_core_amx_bf16,forward_training,src_bf16::blocked:acdb:f0 wei_bf16:p:blocked:ABcd16b16a2b:f0 bia_f32::blocked:a:f0 dst_bf16::blocked:acdb:f0,attr-scratchpad:user ,alg:convolution_direct,mb7_ic2oc1_ih224oh111kh3sh2dh1ph1_iw224ow111kw3sw2dw1pw1,0.628906
98+
...
99+
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_int8,undef,src_s8::blocked:ab:f0 wei_s8:p:blocked:BA16a64b4a:f0 dst_s8::blocked:ab:f0,attr-scratchpad:user ,,1x30522:30522x768:1x768,7.66382
100+
...
101+
102+
If we get the verbose of ``avx512_core_amx_bf16`` for BFloat16 or ``avx512_core_amx_int8`` for quantization with INT8, it indicates that AMX is activated.
103+
104+
.. _Accelerate AI Workloads with Intel® AMX: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/ai-solution-brief.html
105+
106+
.. _Intel® AMX Overview: https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html
107+
108+
.. _oneDNN: https://oneapi-src.github.io/oneDNN/index.html

0 commit comments

Comments
 (0)