You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-07-09-accelerated-pytorch-inference.md
+11-10Lines changed: 11 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -130,14 +130,13 @@ On successful completion of the inference runs, the script stores the results in
130
130
131
131
Google T5 Small Text Translation model is one of the around 30 Hugging Face models we benchmarked. We’re using it as a sample model to demonstrate how to run inference in eager and compile modes. The additional configurations and APIs required to run it in compile mode are highlighted in **BOLD**. Save the following script as `google_t5_small_text_translation.py`.
132
132
133
-
```
134
-
import argparse
133
+
<pre><code>import argparse
135
134
from transformers import T5Tokenizer, T5Model
136
135
import torch
137
136
from torch.profiler import profile, record_function, ProfilerActivity
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
150
149
151
-
if (mode == 'compile'):
152
-
model = torch.compile(model)
150
+
<b>if (mode == 'compile'):
151
+
model = torch.compile(model)</b>
153
152
154
153
with torch.no_grad():
155
154
for _ in range(50):
@@ -184,7 +183,7 @@ def main() -> None:
184
183
185
184
if __name__ == "__main__":
186
185
main()
187
-
```
186
+
</code></pre>
188
187
189
188
Run the script with the following steps:
190
189
@@ -285,15 +284,15 @@ This completes the torch inductor changes required to compile the graph into opt
285
284
286
285
There were mainly three areas where oneDNN ACL primitives lack support for torch.compile mode. The following section talks about them in detail.
287
286
288
-
**1 ACL primitives didn’t have support for weights in blocked layout**
287
+
**1. ACL primitives didn’t have support for weights in blocked layout**
289
288
290
289
ACL primitives originally designed for eager mode supported weights only in the standard channels last ([NHWC](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html#nhwc)) format, without any pre-packing. Whereas weights pre-packing into blocked layout is one of the main optimizations in the inductor compilation passes where the weights are reordered into blocks specific to the runtime platform. This avoids the redundant and on-the-fly reorders when running the General Matrix Multiplication (GEMM), which otherwise would be the bottleneck for inference performance. But the ACL primitives didn’t have support for blocked layout and hence the operators were run with oneDNN C++ reference kernels instead.
291
290
292
-
**2 Mixed precision primitives weren’t supported in oneDNN**
291
+
**2. Mixed precision primitives weren’t supported in oneDNN**
293
292
294
293
AWS Graviton3 processors support [bfloat16 MMLA instructions](https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/BFMMLA--BFloat16-floating-point-matrix-multiply-accumulate-) which can be used to accelerate fp32 inference with bfloat16 GEMM as a mixed precision compute. ACL supports bfloat16 mixed precision GEMM kernels, and are integrated into oneDNN as a fast math compute option for the existing fp32 operators. However, the fast math approach didn’t work for compile mode because of weights pre-packing optimization. The compile mode requires explicit mixed precision primitive implementation in oneDNN in order to use bfloat16 acceleration.
295
294
296
-
**3 ACL primitives didn’t support fused kernels for some of the activation functions**
295
+
**3. ACL primitives didn’t support fused kernels for some of the activation functions**
297
296
298
297
In eager mode, operators are dispatched individually because the model is run independently as soon as it’s reached. Whereas in compile mode, operator fusion is another important optimization where the operators are fused for runtime efficiency. For example, Gaussian Error Linear Unit ([GELU](https://arxiv.org/pdf/1606.08415.pdf#%3A~%3Atext%3DWe%20propose%20the%20Gaussian%20Error%2Cstandard%20Gaussian%20cumulative%20distribution%20function)) is one of the most widely used activation functions in transformers-based neural network architectures. So, it’s typical to have a linear layer (with matrix multiplications) followed by GELU activation. As part of compiling the model into efficient operators, the torch inductor fuses matmul and GELU into a single linearpointwise+gelu operator. However, oneDNN ACL primitives didn’t have the support for fused kernels with GELU.
299
298
@@ -431,6 +430,8 @@ In this tutorial, we covered how we optimized torch.compile performance on AWS G
431
430
432
431
We would like to thank the PyTorch community for the baseline torch.compile framework and their continued efforts to optimize it further.
0 commit comments