Skip to content

Commit 5eb9fa2

Browse files
committed
Update 2024-05-21-perfboost-windows-cpu.md
1 parent 839aae9 commit 5eb9fa2

File tree

1 file changed

+20
-26
lines changed

1 file changed

+20
-26
lines changed

_posts/2024-05-21-perfboost-windows-cpu.md

Lines changed: 20 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,49 @@
11
---
22
layout: blog_detail
3-
title: "The Path to Achieve PyTorch Windows Performance boost on CPU"
4-
author: Zhaoqiong Zheng from Intel
3+
title: "The Path to Achieve PyTorch Performance Boost on Windows CPU"
4+
author: Intel Corporation
55
---
66

7-
The challenge of PyTorch's lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through meticulous investigation, we've identified one of the primary reasons for poor CPU performance on Windows, which is linked to the Windows malloc memory allocator.
7+
The challenge of PyTorch's lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through our investigation, we've identified one of the primary reasons for poor CPU performance on Windows, which is linked to the Windows default malloc memory allocator.
88

9-
In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, which, compared to the malloc used in PyTorch Linux version 2.0, significantly increases the time for memory allocation, resulting in decreased performance. Intel engineer [Xu Han](https://github.com/xuhancn) took the initiative to replace the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs (See the following graph).
9+
In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, when compared to the malloc used in PyTorch Linux version 2.0, significantly increases the time for memory allocation, resulting in decreased performance. We replaced the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs as shown below in Figure 1.
1010

1111
![Windows PC Performance Improvement](/assets/images/2024-05-21-perfboost-windows-cpu/windows_compare.png){:style="width:100%;"}
1212

1313
_Figure 1: Relative throughput improvement achieved by upgrading from Windows PyTorch version 2.0 to 2.1 (higher is better)._
1414

15-
**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory.
16-
17-
18-
From this graph, it's evident that PyTorch on Windows CPU showcases significant performance improvements. The variations in performance enhancements across different workloads mainly stem from varying proportions of different operations within distinct models, consequently affecting the frequency of memory access operations. It shows a comparatively smaller enhancement in BERT model performance, while there is a more substantial improvement in ResNet50 and MobileNetv3 Large model performances.
15+
From this graph, we see that PyTorch 2.1 on Windows CPU shows significant performance improvements. The variations in performance enhancements across different workloads mainly stem from varying proportions of different operations within distinct models, consequently affecting the frequency of memory access operations. It shows a comparatively smaller enhancement in BERT model performance, while there is a more substantial improvement in ResNet50 and MobileNet-v3 Large model performances.
1916

2017
On a high-performance CPU, memory allocation becomes a performance bottleneck. This is also why addressing this issue has led to such significant performance improvements.
2118

2219
![Windows vs Linux Performance on PyTorch 2.0](/assets/images/2024-05-21-perfboost-windows-cpu/pytorch_20_win_linux.png){:style="width:100%;"}
2320

2421
_Figure 2.1: Relative performance of Windows vs Linux with PyTorch version 2.0 (higher is better)._
2522

26-
**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory.
27-
2823
![Windows vs Linux Performance on PyTorch 2.1](/assets/images/2024-05-21-perfboost-windows-cpu/pytorch_21_win_linux.png){:style="width:100%;"}
2924

3025
_Figure 2.2: Relative performance of Windows vs Linux with PyTorch version 2.1 (higher is better)._
3126

32-
**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory.
33-
34-
As shown in the graphs, it is evident that PyTorch's performance on Windows CPUs can significantly improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to delve into this issue, collaborating with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux.
27+
As shown in the graphs, we see that PyTorch's performance on Windows CPUs can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux.
3528

3629

3730
## HOW TO TAKE ADVANTAGE OF THE OPTIMIZATIONS
3831

39-
Install PyTorch version 2.1 or higher using the Windows CPU wheel from the official repository, and you will automatically experience a performance boost.
32+
Install PyTorch version 2.1 or higher using the Windows CPU wheel from the official repository, and you may automatically experience a performance boost.
4033

4134

4235
## CONCLUSION
4336

44-
Comparing PyTorch 2.0 and PyTorch 2.1, we can observe varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called within an op in a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases.
37+
When comparing PyTorch 2.0 and PyTorch 2.1, we observed varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called in a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases.
4538

46-
This performance enhancement to a certain extent helps to bridge the PyTorch CPU performance gap between Windows and Linux. Intel will continue to collaborate with Meta, dedicated to enhancing the performance of PyTorch on CPUs!
39+
To a certain extent, this performance enhancement helps to bridge the PyTorch CPU performance gap between Windows and Linux. Intel will continue to collaborate with Meta, enhance the performance of PyTorch on CPUs.
4740

4841
## ACKNOWLEDGMENTS
4942

50-
The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen), [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10) and [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) for their contributions to this blog.
51-
43+
The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen) and [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10), [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) and [Zhaoqiong Zheng](https://github.com/ZhaoqiongZ) for their contributions to this blog.
5244

53-
## Notices and Disclaimers
54-
55-
Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/).
5645

57-
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
58-
59-
Your costs and results may vary.
60-
61-
### Configurations
46+
### Product and Performance Information
6247

6348
The configurations in the table are collected with [svr-info](https://github.com/intel/svr-info)
6449

@@ -99,3 +84,12 @@ The configurations in the table are collected with [svr-info](https://github.com
9984
| Frequency Governor | powersave | powersave |
10085
| Frequency Driver | intel_pstate | intel_pstate |
10186
| Max C-State | 9 | 9 |
87+
88+
89+
## Notices and Disclaimers
90+
91+
Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/).
92+
93+
Performance results are based on testing as of dates shown in [configurations](#product-and-performance-information) and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
94+
95+
Your costs and results may vary.

0 commit comments

Comments
 (0)