-
Notifications
You must be signed in to change notification settings - Fork 4.2k
add tutorial for PyTorch inference on AWS Graviton CPUs #2719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2719
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 16d0f8f with merge base bcaa9f6 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this submission! I have a few editorial suggestions - please let me know if you have any questions.
@@ -0,0 +1,343 @@ | |||
PyTorch inference performance tuning on AWS Graviton Processors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch inference performance tuning on AWS Graviton Processors | |
PyTorch Inference Performance Tuning on AWS Graviton Processors |
|
||
**Author**: `Sunita Nadampalli <https://github.com/snadampal>`_ | ||
|
||
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2. | |
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for Machine Learning (ML) workloads, including support for ``bfloat16``, Scalable Vector Extension (SVE), and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2. |
|
||
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2. | ||
|
||
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with bfloa16 kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with bfloa16 kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_). | |
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with ``bfloat16`` kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_). |
4. Optimize memory allocation overhead with Linux Transparent huge pages | ||
5. Conclusion | ||
|
||
NOTE: An instance from Graviton3 family (``c7g/r7g/m7g``) is required for this tutorial in order to reproduce the speedup numbers shown below and documented elsewhere. We have used `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ for this tutorial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: An instance from Graviton3 family (``c7g/r7g/m7g``) is required for this tutorial in order to reproduce the speedup numbers shown below and documented elsewhere. We have used `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ for this tutorial. | |
.. note:: | |
To successfully run this tutorial and reproduce the speedup numbers shown below, you need an instance from the Graviton3 family (``c7g/r7g/m7g``) of hardware. For this tutorial, we used the `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ . |
1. Basic Usage | ||
--------------- | ||
|
||
PyTorch natively supports AWS Graviton3 optimizations starting PyTorch 2.0 version. Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch natively supports AWS Graviton3 optimizations starting PyTorch 2.0 version. Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations. | |
PyTorch natively supports AWS Graviton3 optimizations starting with PyTorch 2.0 version. | |
Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations. |
4. Optimize memory allocation overhead with Linux Transparent huge pages | ||
------------------------------------------------------------------------- | ||
|
||
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it. | |
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it: |
``$ export THP_MEM_ALLOC_ENABLE=1`` | ||
|
||
|
||
For the batch dimension of 256 and with fast math mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the batch dimension of 256 and with fast math mode | |
For the batch dimension of 256 and with fast math mode: |
print(prof.key_averages().table(sort_by="self_cpu_time_total")) | ||
|
||
|
||
Following is the profiler output with THP memory allocations enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following is the profiler output with THP memory allocations enabled | |
The following is the profiler output with THP memory allocations enabled: |
aten::relu 0.04% 2.547ms 4.85% 325.115ms 1.626ms 200 | ||
====================== ============ ============ ============ ============ ============== ============ | ||
|
||
``Self CPU time total: 6.697s`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``Self CPU time total: 6.697s`` | |
**Self CPU time total:** 6.697s |
|
||
``Self CPU time total: 6.697s`` | ||
|
||
This is an additional ``1.08x or 8% (6.697s vs 7.262s)`` improvement on top of the already optimized fast math mode measured above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an additional ``1.08x or 8% (6.697s vs 7.262s)`` improvement on top of the already optimized fast math mode measured above. | |
This is an additional **1.08x or 8% (6.697s vs 7.262s)** improvement on top of the already optimized fast math mode measured above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you please add to index.rst
d8c1437
to
363ed4a
Compare
Hi @svekars , thanks for the review, I have incorporated all your feedback on the doc. |
looks like I need to create these two, right? looking into them.
|
@snadampal I'm thinking this probably fits better under Also, please fix the spellchek. Words like OpenBLAS, Graviton, MKLDNN can be added to the en-wordlist.txt. Words like |
logits = self.linear_relu_stack(x) | ||
return logits | ||
|
||
4. Let's create an instance of MyNeuralNetwork, and move it to the device: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4. Let's create an instance of MyNeuralNetwork, and move it to the device: | |
4. Let's create an instance of ``MyNeuralNetwork``, and move it to the device: |
Hi @svekars , sure, I will fix the acronyms part. can you please elaborate on what needs to be added to recipes_index.rst?
I'm planning to add this customcard item under performance section in recipes_index.rst, but wondering whether I need to provide the .html file or it gets built in the repo. Please clarify.
|
e408696
to
a4a8a38
Compare
Hi @svekars , I have updated the PR for the recipes_index.rst and the en-wordlist update. |
Speed up Inference with ``bfloat16`` Fast Math Kernels | ||
---------------------------------------------------------- | ||
|
||
AWS Graviton3 processors support `bfloat16 MMLA instructions <https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/BFMMLA--BFloat16-floating-point-matrix-multiply-accumulate->`_. Arm Compute Library (`ACL <https://github.com/ARM-software/ComputeLibrary>`_) provides optimized ``bfloat16`` General Matrix Multiplication (GEMM) kernels for AWS Graviton processors, and are integrated into PyTorch via MKLDNN backend starting with PyTorch 2.0. The inference performance can be optimized with the fast math GEMM kernels. To enable the fast math GEMM kernels, set the following environment variable: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a line here for why this is not enabled by default
|
||
.. code-block:: bash | ||
|
||
$ export TORCH_MKLDNN_MATMUL_MIN_DIM=64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a line or two for why this is not enabled by default
|
||
.. code-block:: bash | ||
|
||
$ export THP_MEM_ALLOC_ENABLE=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a line for why this is not enabled by default
This is an additional **1.08x or 8% (6.697s vs 7.262s)** improvement on top of the already optimized MKLDNN fast math mode measured above. | ||
|
||
|
||
Conclusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention when each of the 3 optimizations should be used
a4a8a38
to
715cc6e
Compare
addressed the feedback from @agunapal . Please let me know if this looks good now. |
715cc6e
to
899b9c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the changes. LGTM
====================== ============ =========== ============= =========== ============ ============ | ||
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls | ||
====================== ============ =========== ============= =========== ============ ============ | ||
aten::addmm 97.61% 15.813s 98.61% 15.977s 53.255ms 300 | ||
aten::clamp_min 1.09% 177.032ms 1.09% 177.032ms 885.160us 200 | ||
aten::copy_ 1.00% 162.054ms 1.00% 162.054ms 540.180us 300 | ||
mymodel_inference 0.22% 35.738ms 100.00% 16.201s 16.201s 1 | ||
aten::linear 0.02% 2.955ms 98.66% 15.985s 53.282ms 300 | ||
aten::t 0.01% 2.421ms 0.03% 5.043ms 16.810us 300 | ||
aten::relu 0.01% 2.356ms 1.11% 179.388ms 896.940us 200 | ||
====================== ============ =========== ============= =========== ============ ============ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you indent lines 117- 127 under the table directive for correct recognition. More info here
Same comment for all table directives in the PR.
====================== ============ =========== ============= =========== ============ ============ | ||
aten::addmm 97.61% 15.813s 98.61% 15.977s 53.255ms 300 | ||
aten::clamp_min 1.09% 177.032ms 1.09% 177.032ms 885.160us 200 | ||
aten::copy_ 1.00% 162.054ms 1.00% 162.054ms 540.180us 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
The following is the profiler output with THP memory allocations enabled: | ||
|
||
.. table:: output with the fast math and thp memory allocations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. table:: output with the fast math and thp memory allocations | |
.. table:: output with the fast math and THP memory allocations |
en-wordlist.txt
Outdated
fastmath | ||
latencies | ||
openBLAS | ||
thp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thp |
en-wordlist.txt
Outdated
addmm | ||
aten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two should get resolved after indentation is fixed. Also, can you please sort alphabetically and then save. In vim, you should be able to just use :sort
.
addmm | |
aten |
6f23e62
to
aa79f4b
Compare
@svekars , even the indentation didn't fix the |
@snadampal I think this is now in pretty good shape - can you resolve the merge conflict? |
aa79f4b
to
54d64b3
Compare
@svekars , the PR is rebased, could you please check and merge if looks good. |
@pytorchbot merge |
54d64b3
to
7081f4b
Compare
I have rebased to the main |
thanks @snadampal - this looks good to me - we will merge a few days before the release. |
@snadampal Thanks for contributing the Graviton tutorial. Will be good to mention it works with PT 2.0 and higher. Have you also run any tests with torch.compile? Will be good to include an update as follow-up PR showcasing speedups with torch compile and Graviton |
Hi @chauhang , In the tutorial I mentioned the following. Do you suggest adding PyTorch2.0+ in the title itself?
regarding torch.compile(), yes, that's the next thing i'm working on :) currently PyTorch changes are merged, but oneDNN changes are still under review. once they are merged, I will raise a PR for torch.compile() tutorial. |
* Add recipe for PyTorch inference on AWS Graviton CPUs --------- Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
* Add recipe for PyTorch inference on AWS Graviton CPUs --------- Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Fixes #ISSUE_NUMBER
Description
This is a new tutorial for AWS Graviton CPU inference
Checklist