|
1 | 1 | ---
|
2 | 2 | layout: blog_detail
|
3 |
| -title: "The Path to Achieve PyTorch Windows Performance boost on CPU" |
4 |
| -author: Zhaoqiong Zheng from Intel |
| 3 | +title: "The Path to Achieve PyTorch Performance Boost on Windows CPU" |
| 4 | +author: Intel Corporation |
5 | 5 | ---
|
6 | 6 |
|
7 |
| -The challenge of PyTorch's lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through meticulous investigation, we've identified one of the primary reasons for poor CPU performance on Windows, which is linked to the Windows malloc memory allocator. |
| 7 | +The challenge of PyTorch's lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through our investigation, we've identified one of the primary reasons for poor CPU performance on Windows, which is linked to the Windows default malloc memory allocator. |
8 | 8 |
|
9 |
| -In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, which, compared to the malloc used in PyTorch Linux version 2.0, significantly increases the time for memory allocation, resulting in decreased performance. Intel engineer [Xu Han](https://github.com/xuhancn) took the initiative to replace the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs (See the following graph). |
| 9 | +In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, when compared to the malloc used in PyTorch Linux version 2.0, significantly increases the time for memory allocation, resulting in decreased performance. We replaced the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs as shown below in Figure 1. |
10 | 10 |
|
11 | 11 | {:style="width:100%;"}
|
12 | 12 |
|
13 | 13 | _Figure 1: Relative throughput improvement achieved by upgrading from Windows PyTorch version 2.0 to 2.1 (higher is better)._
|
14 | 14 |
|
15 |
| -**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory. |
16 |
| - |
17 |
| - |
18 |
| -From this graph, it's evident that PyTorch on Windows CPU showcases significant performance improvements. The variations in performance enhancements across different workloads mainly stem from varying proportions of different operations within distinct models, consequently affecting the frequency of memory access operations. It shows a comparatively smaller enhancement in BERT model performance, while there is a more substantial improvement in ResNet50 and MobileNetv3 Large model performances. |
| 15 | +From this graph, we see that PyTorch 2.1 on Windows CPU shows significant performance improvements. The variations in performance enhancements across different workloads mainly stem from varying proportions of different operations within distinct models, consequently affecting the frequency of memory access operations. It shows a comparatively smaller enhancement in BERT model performance, while there is a more substantial improvement in ResNet50 and MobileNet-v3 Large model performances. |
19 | 16 |
|
20 | 17 | On a high-performance CPU, memory allocation becomes a performance bottleneck. This is also why addressing this issue has led to such significant performance improvements.
|
21 | 18 |
|
22 | 19 | {:style="width:100%;"}
|
23 | 20 |
|
24 | 21 | _Figure 2.1: Relative performance of Windows vs Linux with PyTorch version 2.0 (higher is better)._
|
25 | 22 |
|
26 |
| -**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory. |
27 |
| - |
28 | 23 | {:style="width:100%;"}
|
29 | 24 |
|
30 | 25 | _Figure 2.2: Relative performance of Windows vs Linux with PyTorch version 2.1 (higher is better)._
|
31 | 26 |
|
32 |
| -**Note**: The performance is measured on Intel Core 13th Gen i7-13700H with 32G Memory. |
33 |
| - |
34 |
| -As shown in the graphs, it is evident that PyTorch's performance on Windows CPUs can significantly improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to delve into this issue, collaborating with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux. |
| 27 | +As shown in the graphs, we see that PyTorch's performance on Windows CPUs can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux. |
35 | 28 |
|
36 | 29 |
|
37 | 30 | ## HOW TO TAKE ADVANTAGE OF THE OPTIMIZATIONS
|
38 | 31 |
|
39 |
| -Install PyTorch version 2.1 or higher using the Windows CPU wheel from the official repository, and you will automatically experience a performance boost. |
| 32 | +Install PyTorch version 2.1 or higher using the Windows CPU wheel from the official repository, and you may automatically experience a performance boost. |
40 | 33 |
|
41 | 34 |
|
42 | 35 | ## CONCLUSION
|
43 | 36 |
|
44 |
| -Comparing PyTorch 2.0 and PyTorch 2.1, we can observe varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called within an op in a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases. |
| 37 | +When comparing PyTorch 2.0 and PyTorch 2.1, we observed varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called in a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases. |
45 | 38 |
|
46 |
| -This performance enhancement to a certain extent helps to bridge the PyTorch CPU performance gap between Windows and Linux. Intel will continue to collaborate with Meta, dedicated to enhancing the performance of PyTorch on CPUs! |
| 39 | +To a certain extent, this performance enhancement helps to bridge the PyTorch CPU performance gap between Windows and Linux. Intel will continue to collaborate with Meta, enhance the performance of PyTorch on CPUs. |
47 | 40 |
|
48 | 41 | ## ACKNOWLEDGMENTS
|
49 | 42 |
|
50 |
| -The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen), [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10) and [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) for their contributions to this blog. |
51 |
| - |
| 43 | +The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen) and [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10), [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) and [Zhaoqiong Zheng](https://github.com/ZhaoqiongZ) for their contributions to this blog. |
52 | 44 |
|
53 |
| -## Notices and Disclaimers |
54 |
| - |
55 |
| -Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). |
56 | 45 |
|
57 |
| -Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. |
58 |
| - |
59 |
| -Your costs and results may vary. |
60 |
| - |
61 |
| -### Configurations |
| 46 | +### Product and Performance Information |
62 | 47 |
|
63 | 48 | The configurations in the table are collected with [svr-info](https://github.com/intel/svr-info)
|
64 | 49 |
|
@@ -99,3 +84,12 @@ The configurations in the table are collected with [svr-info](https://github.com
|
99 | 84 | | Frequency Governor | powersave | powersave |
|
100 | 85 | | Frequency Driver | intel_pstate | intel_pstate |
|
101 | 86 | | Max C-State | 9 | 9 |
|
| 87 | + |
| 88 | + |
| 89 | +## Notices and Disclaimers |
| 90 | + |
| 91 | +Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/). |
| 92 | + |
| 93 | +Performance results are based on testing as of dates shown in [configurations](#product-and-performance-information) and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. |
| 94 | + |
| 95 | +Your costs and results may vary. |
0 commit comments