Skip to content

add tutorial for PyTorch inference on AWS Graviton CPUs #2719

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 24, 2024

Conversation

snadampal
Copy link
Contributor

Fixes #ISSUE_NUMBER

Description

This is a new tutorial for AWS Graviton CPU inference

Checklist

  • [ NA] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
  • [ x] Only one issue is addressed in this pull request
  • [ NA] Labels from the issue that this PR is fixing are added to this pull request
  • [ x] No unnecessary issues are included into this pull request.

Copy link

pytorch-bot bot commented Dec 20, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2719

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 16d0f8f with merge base bcaa9f6 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this submission! I have a few editorial suggestions - please let me know if you have any questions.

@@ -0,0 +1,343 @@
PyTorch inference performance tuning on AWS Graviton Processors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyTorch inference performance tuning on AWS Graviton Processors
PyTorch Inference Performance Tuning on AWS Graviton Processors


**Author**: `Sunita Nadampalli <https://github.com/snadampal>`_

`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2.
`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for Machine Learning (ML) workloads, including support for ``bfloat16``, Scalable Vector Extension (SVE), and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2.


`AWS Graviton <https://aws.amazon.com/ec2/graviton/>`_ is a series of ARM-based processors designed by AWS. AWS Graviton3 processors are optimized for ML workloads, including support for bfloat16, SVE and twice the Single Instruction Multiple Data (SIMD) bandwidth compared to Graviton2.

In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with bfloa16 kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with bfloa16 kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_).
In this tutorial we will cover how to achieve the best inference performance for linear layer neural network with ``bfloat16`` kernels and with the right backend on AWS Graviton3 processors (`AWS c7g instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_).

4. Optimize memory allocation overhead with Linux Transparent huge pages
5. Conclusion

NOTE: An instance from Graviton3 family (``c7g/r7g/m7g``) is required for this tutorial in order to reproduce the speedup numbers shown below and documented elsewhere. We have used `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ for this tutorial.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: An instance from Graviton3 family (``c7g/r7g/m7g``) is required for this tutorial in order to reproduce the speedup numbers shown below and documented elsewhere. We have used `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ for this tutorial.
.. note::
To successfully run this tutorial and reproduce the speedup numbers shown below, you need an instance from the Graviton3 family (``c7g/r7g/m7g``) of hardware. For this tutorial, we used the `c7g.xl (4vcpu) instance <https://aws.amazon.com/ec2/instance-types/c7g/>`_ .

1. Basic Usage
---------------

PyTorch natively supports AWS Graviton3 optimizations starting PyTorch 2.0 version. Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyTorch natively supports AWS Graviton3 optimizations starting PyTorch 2.0 version. Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations.
PyTorch natively supports AWS Graviton3 optimizations starting with PyTorch 2.0 version.
Please refer to this `blog <https://pytorch.org/blog/optimized-pytorch-w-graviton/>`_ for more details on the optimizations.

4. Optimize memory allocation overhead with Linux Transparent huge pages
-------------------------------------------------------------------------

We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it.
We also observed that for these larger networks, tensor memory allocations take significant portion of the inference latency. This can be optimized by enabling Linux transparent huge pages allocations from PyTorch C10 memory allocator. Set the following environment variable to enable it:

``$ export THP_MEM_ALLOC_ENABLE=1``


For the batch dimension of 256 and with fast math mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For the batch dimension of 256 and with fast math mode
For the batch dimension of 256 and with fast math mode:

print(prof.key_averages().table(sort_by="self_cpu_time_total"))


Following is the profiler output with THP memory allocations enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Following is the profiler output with THP memory allocations enabled
The following is the profiler output with THP memory allocations enabled:

aten::relu 0.04% 2.547ms 4.85% 325.115ms 1.626ms 200
====================== ============ ============ ============ ============ ============== ============

``Self CPU time total: 6.697s``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
``Self CPU time total: 6.697s``
**Self CPU time total:** 6.697s


``Self CPU time total: 6.697s``

This is an additional ``1.08x or 8% (6.697s vs 7.262s)`` improvement on top of the already optimized fast math mode measured above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is an additional ``1.08x or 8% (6.697s vs 7.262s)`` improvement on top of the already optimized fast math mode measured above.
This is an additional **1.08x or 8% (6.697s vs 7.262s)** improvement on top of the already optimized fast math mode measured above.

@svekars svekars added the 2.2 label Dec 20, 2023
Copy link
Contributor

@svekars svekars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you please add to index.rst

@snadampal snadampal force-pushed the aws_graviton branch 3 times, most recently from d8c1437 to 363ed4a Compare December 21, 2023 03:44
@snadampal
Copy link
Contributor Author

Hi @svekars , thanks for the review, I have incorporated all your feedback on the doc.
Next, I'm checking how to add it to index.rst. If I understand correctly, I need to add it to the "what's new in PyTorch tutorials?" section, right? What link should I provide there?

@snadampal
Copy link
Contributor Author

looks like I need to create these two, right? looking into them.

For Tutorials (except if it is a prototype feature), include it in the toctree directive and create a customcarditem in index.rst.
For Tutorials (except if it is a prototype feature), create a thumbnail in the index.rst file using a command like .. customcarditem:: beginner/your_tutorial.html. For Recipes, create a thumbnail in the recipes_index.rst

@svekars
Copy link
Contributor

svekars commented Dec 21, 2023

@snadampal I'm thinking this probably fits better under recipe_source, can you move under recipe_source and update recipe_source/recipes_index.rst. I also understand that this is a beta feature, please prepend the title with (Beta).

Also, please fix the spellchek. Words like OpenBLAS, Graviton, MKLDNN can be added to the en-wordlist.txt. Words like MyNeuralNetwork should be enclosed into double ticks for the spellcheck to skip them.

logits = self.linear_relu_stack(x)
return logits

4. Let's create an instance of MyNeuralNetwork, and move it to the device:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Let's create an instance of MyNeuralNetwork, and move it to the device:
4. Let's create an instance of ``MyNeuralNetwork``, and move it to the device:

@snadampal
Copy link
Contributor Author

Hi @svekars , sure, I will fix the acronyms part. can you please elaborate on what needs to be added to recipes_index.rst?

update recipe_source/recipes_index.rst

I'm planning to add this customcard item under performance section in recipes_index.rst, but wondering whether I need to provide the .html file or it gets built in the repo. Please clarify.

.. customcarditem::
   :header: PyTorch Inference Performance Tuning on AWS Graviton Processors
   :card_description: how to achieve the best inference performance for linear layer neural network on AWS Graviton3 CPUs
   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
   :link: ../recipes/inference_tuning_on_aws_graviton.html
   :tags: Model-Optimization

@snadampal
Copy link
Contributor Author

Hi @svekars , I have updated the PR for the recipes_index.rst and the en-wordlist update.

Speed up Inference with ``bfloat16`` Fast Math Kernels
----------------------------------------------------------

AWS Graviton3 processors support `bfloat16 MMLA instructions <https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/BFMMLA--BFloat16-floating-point-matrix-multiply-accumulate->`_. Arm Compute Library (`ACL <https://github.com/ARM-software/ComputeLibrary>`_) provides optimized ``bfloat16`` General Matrix Multiplication (GEMM) kernels for AWS Graviton processors, and are integrated into PyTorch via MKLDNN backend starting with PyTorch 2.0. The inference performance can be optimized with the fast math GEMM kernels. To enable the fast math GEMM kernels, set the following environment variable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a line here for why this is not enabled by default


.. code-block:: bash

$ export TORCH_MKLDNN_MATMUL_MIN_DIM=64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a line or two for why this is not enabled by default


.. code-block:: bash

$ export THP_MEM_ALLOC_ENABLE=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a line for why this is not enabled by default

This is an additional **1.08x or 8% (6.697s vs 7.262s)** improvement on top of the already optimized MKLDNN fast math mode measured above.


Conclusion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention when each of the 3 optimizations should be used

@snadampal
Copy link
Contributor Author

addressed the feedback from @agunapal . Please let me know if this looks good now.

Copy link
Contributor

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the changes. LGTM

Comment on lines 117 to 127
====================== ============ =========== ============= =========== ============ ============
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
====================== ============ =========== ============= =========== ============ ============
aten::addmm 97.61% 15.813s 98.61% 15.977s 53.255ms 300
aten::clamp_min 1.09% 177.032ms 1.09% 177.032ms 885.160us 200
aten::copy_ 1.00% 162.054ms 1.00% 162.054ms 540.180us 300
mymodel_inference 0.22% 35.738ms 100.00% 16.201s 16.201s 1
aten::linear 0.02% 2.955ms 98.66% 15.985s 53.282ms 300
aten::t 0.01% 2.421ms 0.03% 5.043ms 16.810us 300
aten::relu 0.01% 2.356ms 1.11% 179.388ms 896.940us 200
====================== ============ =========== ============= =========== ============ ============
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you indent lines 117- 127 under the table directive for correct recognition. More info here

Same comment for all table directives in the PR.

====================== ============ =========== ============= =========== ============ ============
aten::addmm 97.61% 15.813s 98.61% 15.977s 53.255ms 300
aten::clamp_min 1.09% 177.032ms 1.09% 177.032ms 885.160us 200
aten::copy_ 1.00% 162.054ms 1.00% 162.054ms 540.180us 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aten::copy_ for some reason creates a link in the HTML:
Screenshot 2023-12-26 at 7 56 08 PM
I wonder if the correct indentation will help with that.


The following is the profiler output with THP memory allocations enabled:

.. table:: output with the fast math and thp memory allocations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. table:: output with the fast math and thp memory allocations
.. table:: output with the fast math and THP memory allocations

en-wordlist.txt Outdated
fastmath
latencies
openBLAS
thp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
thp

en-wordlist.txt Outdated
Comment on lines 566 to 567
addmm
aten
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two should get resolved after indentation is fixed. Also, can you please sort alphabetically and then save. In vim, you should be able to just use :sort .

Suggested change
addmm
aten

@snadampal snadampal force-pushed the aws_graviton branch 3 times, most recently from 6f23e62 to aa79f4b Compare January 2, 2024 19:52
@snadampal
Copy link
Contributor Author

@svekars , even the indentation didn't fix the copy_ hyperlink issue, so, I have removed the _.

@svekars
Copy link
Contributor

svekars commented Jan 8, 2024

@snadampal I think this is now in pretty good shape - can you resolve the merge conflict?

@snadampal
Copy link
Contributor Author

@svekars , the PR is rebased, could you please check and merge if looks good.

@snadampal
Copy link
Contributor Author

@pytorchbot merge

@snadampal
Copy link
Contributor Author

I have rebased to the main

@svekars
Copy link
Contributor

svekars commented Jan 12, 2024

thanks @snadampal - this looks good to me - we will merge a few days before the release.

@chauhang
Copy link

@snadampal Thanks for contributing the Graviton tutorial. Will be good to mention it works with PT 2.0 and higher. Have you also run any tests with torch.compile? Will be good to include an update as follow-up PR showcasing speedups with torch compile and Graviton

@snadampal
Copy link
Contributor Author

Hi @chauhang , In the tutorial I mentioned the following. Do you suggest adding PyTorch2.0+ in the title itself?

PyTorch natively supports AWS Graviton3 optimizations starting with PyTorch 2.0 version.

regarding torch.compile(), yes, that's the next thing i'm working on :) currently PyTorch changes are merged, but oneDNN changes are still under review. once they are merged, I will raise a PR for torch.compile() tutorial.

@svekars svekars merged commit d9a0d6b into pytorch:main Jan 24, 2024
HDCharles pushed a commit that referenced this pull request Jan 26, 2024
* Add recipe for PyTorch inference on AWS Graviton CPUs
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
svekars added a commit that referenced this pull request Feb 2, 2024
* Add recipe for PyTorch inference on AWS Graviton CPUs
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants