Skip to content

Commit ed7b2df

Browse files
committed
Tweaks to blog post
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
1 parent 0847db4 commit ed7b2df

File tree

2 files changed

+12
-11
lines changed

2 files changed

+12
-11
lines changed

_posts/2024-07-09-accelerated-pytorch-inference.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -130,14 +130,13 @@ On successful completion of the inference runs, the script stores the results in
130130

131131
Google T5 Small Text Translation model is one of the around 30 Hugging Face models we benchmarked. We’re using it as a sample model to demonstrate how to run inference in eager and compile modes. The additional configurations and APIs required to run it in compile mode are highlighted in **BOLD**. Save the following script as `google_t5_small_text_translation.py`.
132132

133-
```
134-
import argparse
133+
<pre><code>import argparse
135134
from transformers import T5Tokenizer, T5Model
136135
import torch
137136
from torch.profiler import profile, record_function, ProfilerActivity
138-
import torch._inductor.config as config
137+
<b>import torch._inductor.config as config
139138
config.cpp.weight_prepack=True
140-
config.freezing=True
139+
config.freezing=True</b>
141140

142141
def test_inference(mode, num_iter):
143142
tokenizer = T5Tokenizer.from_pretrained("t5-small")
@@ -148,8 +147,8 @@ def test_inference(mode, num_iter):
148147
).input_ids # Batch size 1
149148
decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids # Batch size 1
150149

151-
if (mode == 'compile'):
152-
model = torch.compile(model)
150+
<b>if (mode == 'compile'):
151+
model = torch.compile(model)</b>
153152

154153
with torch.no_grad():
155154
for _ in range(50):
@@ -184,7 +183,7 @@ def main() -> None:
184183

185184
if __name__ == "__main__":
186185
main()
187-
```
186+
</code></pre>
188187

189188
Run the script with the following steps:
190189

@@ -285,15 +284,15 @@ This completes the torch inductor changes required to compile the graph into opt
285284

286285
There were mainly three areas where oneDNN ACL primitives lack support for torch.compile mode. The following section talks about them in detail.
287286

288-
**1 ACL primitives didn’t have support for weights in blocked layout**
287+
**1. ACL primitives didn’t have support for weights in blocked layout**
289288

290289
ACL primitives originally designed for eager mode supported weights only in the standard channels last ([NHWC](https://oneapi-src.github.io/oneDNN/dev_guide_understanding_memory_formats.html#nhwc)) format, without any pre-packing. Whereas weights pre-packing into blocked layout is one of the main optimizations in the inductor compilation passes where the weights are reordered into blocks specific to the runtime platform. This avoids the redundant and on-the-fly reorders when running the General Matrix Multiplication (GEMM), which otherwise would be the bottleneck for inference performance. But the ACL primitives didn’t have support for blocked layout and hence the operators were run with oneDNN C++ reference kernels instead.
291290

292-
**2 Mixed precision primitives weren’t supported in oneDNN**
291+
**2. Mixed precision primitives weren’t supported in oneDNN**
293292

294293
AWS Graviton3 processors support [bfloat16 MMLA instructions](https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/BFMMLA--BFloat16-floating-point-matrix-multiply-accumulate-) which can be used to accelerate fp32 inference with bfloat16 GEMM as a mixed precision compute. ACL supports bfloat16 mixed precision GEMM kernels, and are integrated into oneDNN as a fast math compute option for the existing fp32 operators. However, the fast math approach didn’t work for compile mode because of weights pre-packing optimization. The compile mode requires explicit mixed precision primitive implementation in oneDNN in order to use bfloat16 acceleration.
295294

296-
**3 ACL primitives didn’t support fused kernels for some of the activation functions**
295+
**3. ACL primitives didn’t support fused kernels for some of the activation functions**
297296

298297
In eager mode, operators are dispatched individually because the model is run independently as soon as it’s reached. Whereas in compile mode, operator fusion is another important optimization where the operators are fused for runtime efficiency. For example, Gaussian Error Linear Unit ([GELU](https://arxiv.org/pdf/1606.08415.pdf#%3A~%3Atext%3DWe%20propose%20the%20Gaussian%20Error%2Cstandard%20Gaussian%20cumulative%20distribution%20function)) is one of the most widely used activation functions in transformers-based neural network architectures. So, it’s typical to have a linear layer (with matrix multiplications) followed by GELU activation. As part of compiling the model into efficient operators, the torch inductor fuses matmul and GELU into a single linearpointwise+gelu operator. However, oneDNN ACL primitives didn’t have the support for fused kernels with GELU.
299298

@@ -431,6 +430,8 @@ In this tutorial, we covered how we optimized torch.compile performance on AWS G
431430

432431
We would like to thank the PyTorch community for the baseline torch.compile framework and their continued efforts to optimize it further.
433432

433+
References: [https://pytorch.org/assets/pytorch2-2.pdf](https://pytorch.org/assets/pytorch2-2.pdf)
434+
434435

435436
## Author
436437

_sass/code.scss

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
code, kbd, pre, samp {
1+
code, kbd, pre, samp, code b {
22
@include code_font_family;
33
span {
44
@include code_font_family;

0 commit comments

Comments
 (0)