You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2024-05-21-perfboost-windows-cpu.md
+17-9Lines changed: 17 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ author: Intel Corporation
6
6
7
7
The challenge of PyTorch's lower CPU performance on Windows compared to Linux has been a significant issue. There are multiple factors leading to this performance disparity. Through our investigation, we've identified one of the primary reasons for poor CPU performance on Windows, which is linked to the Windows default malloc memory allocator.
8
8
9
-
In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, when compared to the malloc used in PyTorch Linux version 2.0, significantly increases the time for memory allocation, resulting in decreased performance. We replaced the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs as shown below in Figure 1.
9
+
In version 2.0, PyTorch on Windows with CPU directly utilizes the default malloc mechanism of Windows, when it is compared to the malloc used in PyTorch Linux version 2.0, it significantly increases the time for memory allocation, which results in decreased performance. We replaced the original Windows malloc mechanism, which PyTorch automatically calls, with another well-known malloc library developed by Microsoft, known as mimalloc. This replacement of malloc has already been released with PyTorch v2.1 and can significantly improve PyTorch's performance on Windows CPUs as shown below in Figure 1.
10
10
11
11
{:style="width:100%;"}
12
12
@@ -16,6 +16,9 @@ From this graph, we see that PyTorch 2.1 on Windows CPU shows significant perfor
16
16
17
17
On a high-performance CPU, memory allocation becomes a performance bottleneck. This is also why addressing this issue has led to such significant performance improvements.
18
18
19
+
As shown in the graphs below, we see that PyTorch's performance on Windows CPUs can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux.
20
+
21
+
19
22
{:style="width:100%;"}
20
23
21
24
_Figure 2.1: Relative performance of Windows vs Linux with PyTorch version 2.0 (higher is better)._
@@ -24,8 +27,6 @@ _Figure 2.1: Relative performance of Windows vs Linux with PyTorch version 2.0 (
24
27
25
28
_Figure 2.2: Relative performance of Windows vs Linux with PyTorch version 2.1 (higher is better)._
26
29
27
-
As shown in the graphs, we see that PyTorch's performance on Windows CPUs can significantly be improved. However, there is still a noticeable gap when compared to its performance on Linux. This can be attributed to several factors, including the fact that malloc has not yet fully reached the performance level of Linux, among other reasons. Intel engineers will continue to collaborate with Meta engineers, to reduce the performance gap of PyTorch between Windows and Linux.
28
-
29
30
30
31
## HOW TO TAKE ADVANTAGE OF THE OPTIMIZATIONS
31
32
@@ -34,18 +35,27 @@ Install PyTorch version 2.1 or higher on Windows CPU from the [official reposito
34
35
35
36
## CONCLUSION
36
37
37
-
When comparing PyTorch 2.0 and PyTorch 2.1, we observed varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called in a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases.
38
+
When comparing PyTorch 2.0 and PyTorch 2.1, we observed varying degrees of performance improvement on Windows CPU. The extent of performance improvement becomes more pronounced as the number of memory allocation operations called within a workload increases. A more powerful CPU computing capability will also make this performance enhancement more pronounced, as the proportion of operations outside of computation increases.
38
39
39
40
To a certain extent, this performance enhancement helps to bridge the PyTorch CPU performance gap between Windows and Linux. Intel will continue to collaborate with Meta, enhance the performance of PyTorch on CPUs.
40
41
41
42
## ACKNOWLEDGMENTS
42
43
43
-
The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen) and [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10), [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) and [Zhaoqiong Zheng](https://github.com/ZhaoqiongZ) for their contributions to this blog.
44
+
The results presented in this blog post was achieved through the collaborative effort of the Intel PyTorch team and Meta. We would like to express our sincere gratitude to [Xu Han](https://github.com/xuhancn), [Jiong Gong](https://github.com/jgong5), [Mingfei Ma](https://github.com/mingfeima), [Haozhe Zhu](https://github.com/zhuhaozhe), [Chuanqi Wang](https://github.com/chuanqi129), [Guobing Chen](https://github.com/Guobing-Chen) and [Eikan Wang](https://github.com/EikanWang). Their expertise and dedication have been instrumental in achieving the optimizations and performance improvements discussed here. Thanks to [Jiachen Pu](https://github.com/peterjc123) from community for his participation in the issue discussion and suggesting the use of [mimalloc](https://github.com/microsoft/mimalloc). We'd also like to express our gratitude to Microsoft for providing such an easily integrated and performant mallocation library. Finally we want to thank [Jing Xu](https://github.com/jingxu10), [Weizhuo Zhang](https://github.com/WeizhuoZhang-intel) and [Zhaoqiong Zheng](https://github.com/ZhaoqiongZ) for their contributions to this blog.
45
+
46
+
47
+
## Notices and Disclaimers
48
+
49
+
Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/).
50
+
51
+
Performance results are based on testing as of dates shown in [configurations](#product-and-performance-information) and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.
52
+
53
+
Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
44
54
45
55
46
56
### Product and Performance Information
47
57
48
-
The configurations in the table are collected with [svr-info](https://github.com/intel/svr-info)
58
+
The configurations in the table are collected with [svr-info](https://github.com/intel/svr-info). Test by Intel on April 15, 2024.
@@ -86,8 +96,6 @@ The configurations in the table are collected with [svr-info](https://github.com
86
96
| Max C-State | 9 | 9 |
87
97
88
98
89
-
## Notices and Disclaimers
90
99
91
-
Performance varies by use, configuration and other factors. Learn more on the [Performance Index site](https://edc.intel.com/content/www/us/en/products/performance/benchmarks/overview/).
92
100
93
-
Performance results are based on testing as of dates shown in [configurations](#product-and-performance-information) and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation.
0 commit comments