Skip to content

Commit e9244af

Browse files
Updating Superscript
1 parent 8bfb3e8 commit e9244af

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_posts/2022-8-16-empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: blog_detail
3-
title: "Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16"
3+
title: "Empowering PyTorch on Intel<sup>®</sup> Xeon<sup>®</sup> Scalable processors with Bfloat16"
44
author: Mingfei Ma (Intel), Vitaly Fedyunin (Meta), Wei Wei (Meta)
55
featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processors-with-bfloat16.png'
66
---
@@ -9,7 +9,7 @@ featured-img: '\assets\images\empowering-pytorch-on-intel-xeon-scalable-processo
99

1010
Recent years, the growing complexity of AI models have been posing requirements on hardware for more and more compute capability. Reduced precision numeric format has been proposed to address this problem. Bfloat16 is a custom 16-bit floating point format for AI which consists of one sign bit, eight exponent bits, and seven mantissa bits. With the same dynamic range as float32, bfloat16 doesn’t require a special handling such as loss scaling. Therefore, bfloat16 is a drop-in replacement for float32 when running deep neural networks for both inference and training.
1111

12-
The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel® Advanced Vector Extensions-512 (Intel® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel® Xeon® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel® AMX) instruction set extension.
12+
The 3rd Gen Intel<sup>®</sup> Xeon<sup>®</sup> Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel<sup>®</sup> Advanced Vector Extensions-512 (Intel<sup>®</sup> AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel<sup>®</sup> Xeon<sup>®</sup> Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel<sup>®</sup> AMX) instruction set extension.
1313

1414
Intel and Meta previously collaborated to enable bfloat16 on PyTorch, and the related work was published in an earlier [blog](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659) during launch of Cooper Lake. In that blog, we introduced the hardware advancement for native bfloat16 support and showcased a performance boost of 1.4x to 1.6x of bfloat16 over float32 from DLRM, ResNet-50 and ResNext-101-32x4d.
1515

@@ -63,7 +63,7 @@ We benchmarked inference performance of TorchVision models on Intel® Xeon® Pla
6363

6464
## Conclusion & Future Work
6565

66-
In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel® Xeon® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel® Xeon® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future!
66+
In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel<sup>®</sup> Xeon<sup>®</sup> Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel<sup>®</sup> Xeon<sup>®</sup> Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future!
6767

6868
## Acknowledgement
6969

@@ -73,4 +73,4 @@ The results presented in this blog is a joint effort of Meta and Intel PyTorch t
7373

7474
- [The bfloat16 numerical format](https://cloud.google.com/tpu/docs/bfloat16?hl=en)
7575
- [https://pytorch.org/docs/master/amp.html#torch.autocast](https://pytorch.org/docs/master/amp.html#torch.autocast)
76-
- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel® Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659)
76+
- [Intel and Facebook Accelerate PyTorch Performance with 3rd Gen Intel® Xeon® Processors and Intel<sup>®</sup> Deep Learning Boost’s new BFloat16 capability](https://community.intel.com/t5/Blogs/Tech-Innovation/Artificial-Intelligence-AI/Intel-and-Facebook-Accelerate-PyTorch-Performance-with-3rd-Gen/post/1335659)

0 commit comments

Comments
 (0)