Skip to content

Commit d3d0aae

Browse files
committed
Update AMX document
1 parent 3776101 commit d3d0aae

File tree

1 file changed

+40
-29
lines changed

1 file changed

+40
-29
lines changed

recipes_source/amx.rst

Lines changed: 40 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,13 @@ Leverage Advanced Matrix Extensions
55
Introduction
66
============
77

8-
Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an extension to the x86 instruction set architecture (ISA).
8+
Advanced Matrix Extensions (AMX), also known as Intel® Advanced Matrix Extensions (Intel® AMX), is an x86 extension,
9+
which introduce two new components: a 2-dimensional register file called 'tiles' and an accelerator of Tile Matrix Multiplication (TMUL) that are able to operate on those tiles.
10+
AMX is designed to work on matrices to accelerate deep-learning training and inference on the CPU and is ideal for workloads like natural-language processing, recommendation systems and image recognition.
11+
912
Intel advances AI capabilities with 4th Gen Intel® Xeon® Scalable processors and Intel® AMX, delivering 3x to 10x higher inference and training performance versus the previous generation, see `Accelerate AI Workloads with Intel® AMX`_.
10-
AMX supports two data types, INT8 and BFloat16, compared to AVX512 FP32, it can achieve up to 32x and 16x acceleration, respectively, see figure 6 of `Accelerate AI Workloads with Intel® AMX`_.
13+
Compared to 3rd Gen Intel Xeon Scalable processors running Intel® Advanced Vector Extensions 512 Neural Network Instructions (Intel® AVX-512 VNNI),
14+
4th Gen Intel Xeon Scalable processors running Intel AMX can perform 2,048 INT8 operations per cycle, rather than 256 INT8 operations per cycle. They can also perform 1,024 BF16 operations per cycle, as compared to 64 FP32 operations per cycle, see page 4 of `Accelerate AI Workloads with Intel® AMX`_.
1115
For more detailed information of AMX, see `Intel® AMX Overview`_.
1216

1317

@@ -19,7 +23,34 @@ to get higher performance out-of-box on x86 CPUs with AMX support.
1923
For more detailed information of oneDNN, see `oneDNN`_.
2024

2125
The operation is fully handled by oneDNN according to the execution code path generated. I.e. when a supported operation gets executed into oneDNN implementation on a hardware platform with AMX support, AMX instructions will be invoked automatically inside oneDNN.
22-
Since oneDNN is the default acceleration library for CPU, no manual operations are required to enable the AMX support.
26+
Since oneDNN is the default acceleration library for PyTorch CPU, no manual operations are required to enable the AMX support.
27+
28+
Guidelines of leveraging AMX with workloads
29+
-------------------------------------------
30+
31+
- BFloat16 data type:
32+
33+
Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration for supported operators.
34+
35+
::
36+
37+
model = model.to(memory_format=torch.channels_last)
38+
with torch.cpu.amp.autocast():
39+
output = model(input)
40+
41+
Note: Use channels last format to get better performance.
42+
43+
- quantization:
44+
45+
Applying quantization would utilize AMX acceleration for supported operators.
46+
47+
- torch.compile:
48+
49+
When the generated graph model runs into oneDNN implementations with the supported operators, AMX accelerations will be activated.
50+
51+
52+
CPU operators that can leverage AMX:
53+
------------------------------------
2354

2455
- BF16 CPU ops that can leverage AMX:
2556

@@ -36,13 +67,9 @@ Since oneDNN is the default acceleration library for CPU, no manual operations a
3667
``addbmm``,
3768
``linear``,
3869
``matmul``,
39-
``_convolution``
4070

4171
- Quantization CPU ops that can leverage AMX:
4272

43-
``conv1d``,
44-
``conv2d``,
45-
``conv3d``,
4673
``conv1d``,
4774
``conv2d``,
4875
``conv3d``,
@@ -53,34 +80,18 @@ Since oneDNN is the default acceleration library for CPU, no manual operations a
5380

5481
Note: For quantized linear, whether to leverage AMX depends on the policy of the quantization backend.
5582

56-
Guidelines of leveraging AMX with workloads
57-
--------------------------------------------------
58-
59-
- BFloat16 data type:
60-
61-
Using ``torch.cpu.amp`` or ``torch.autocast("cpu")`` would utilize AMX acceleration.
62-
63-
::
64-
65-
model = model.to(memory_format=torch.channels_last)
66-
with torch.cpu.amp.autocast():
67-
output = model(input)
68-
69-
Note: Use channels last format to get better performance.
70-
71-
- quantization:
7283

73-
Applying quantization would utilize AMX acceleration.
74-
75-
- torch.compile:
76-
77-
When the generated graph model runs into oneDNN implementations with the supported operators mentioned in lists above, AMX accelerations will be activated.
7884

7985

8086
Confirm AMX is being utilized
8187
------------------------------
8288

83-
Set environment variable ``export ONEDNN_VERBOSE=1`` to get oneDNN verbose at runtime.
89+
Set environment variable ``export ONEDNN_VERBOSE=1``, or use ``torch.backends.mkldnn.verbose`` to flexibly enable oneDNN to dump verbose messages.
90+
91+
::
92+
93+
with torch.backends.mkldnn.verbose(torch.backends.mkldnn.VERBOSE_ON):
94+
model(input)
8495

8596
For example, get oneDNN verbose:
8697

0 commit comments

Comments
 (0)