You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: recipes_source/xeon_run_cpu.rst
+10-10Lines changed: 10 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -8,14 +8,14 @@ For memory management, it configures NUMA binding and preloads optimized memory
8
8
In addition, the script provides tunable parameters for compute resource allocation in both single instance and multiple instance scenarios,
9
9
helping the users try out an optimal coordination of resource utilization for the specific workloads.
10
10
11
-
What you will learn
11
+
What You Will Learn
12
12
-------------------
13
13
14
-
* How to utilize tools like ``numactl``, ``taskset``, Intel(R) OpenMP Runtime Library and optimized memory allocators such as TCMalloc and JeMalloc for enhanced performance.
15
-
* How to configure CPU cores and memory management to maximize PyTorch inference performance on Intel(R) Xeon(R) processors.
14
+
* How to utilize tools like ``numactl``, ``taskset``, Intel(R) OpenMP Runtime Library and optimized memory allocators such as ``TCMalloc`` and ``JeMalloc`` for enhanced performance.
15
+
* How to configure CPU resources and memory management to maximize PyTorch inference performance on Intel(R) Xeon(R) processors.
16
16
17
-
Introduction for Optimizations
18
-
------------------------------
17
+
Introduction of the Optimizations
18
+
---------------------------------
19
19
20
20
Applying NUMA Access Control
21
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -107,9 +107,9 @@ or
107
107
Choosing an Optimized Memory Allocator
108
108
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
109
109
110
-
Memory allocator plays an important role from performance perspective as well. A more efficient memory usage reduces overhead on unnecessary memory allocations or destructions, and thus results in a faster execution. From practical experiences, for deep learning workloads, JeMalloc or TCMalloc can get better performance by reusing memory as much as possible than default malloc function.
110
+
Memory allocator plays an important role from performance perspective as well. A more efficient memory usage reduces overhead on unnecessary memory allocations or destructions, and thus results in a faster execution. From practical experiences, for deep learning workloads, ``TCMalloc`` or ``JeMalloc`` can get better performance by reusing memory as much as possible than default malloc operations.
111
111
112
-
You can install TCMalloc by running the following command on Ubuntu:
112
+
You can install ``TCMalloc`` by running the following command on Ubuntu:
113
113
114
114
.. code-block:: console
115
115
@@ -127,7 +127,7 @@ In a conda environment, it can also be installed by running:
127
127
128
128
$ conda install conda-forge::gperftools
129
129
130
-
On Ubuntu JeMalloc can be installed by this command:
130
+
On Ubuntu ``JeMalloc`` can be installed by this command:
131
131
132
132
.. code-block:: console
133
133
@@ -289,8 +289,8 @@ Conclusion
289
289
----------
290
290
291
291
In this tutorial, we explored a variety of advanced configurations and tools designed to optimize PyTorch inference performance on Intel(R) Xeon(R) Scalable Processors.
292
-
By leveraging the ``torch.backends.xeon.run_cpu script``, we demonstrated how to fine-tune thread and memory management to achieve peak performance.
293
-
We covered essential concepts such as NUMA access control, optimized memory allocators like TCMalloc and JeMalloc, and the use of Intel(R) OpenMP for efficient multithreading.
292
+
By leveraging the ``torch.backends.xeon.run_cpu`` script, we demonstrated how to fine-tune thread and memory management to achieve peak performance.
293
+
We covered essential concepts such as NUMA access control, optimized memory allocators like ``TCMalloc`` and ``JeMalloc``, and the use of Intel(R) OpenMP for efficient multithreading.
294
294
295
295
Additionally, we provided practical command-line examples to guide you through setting up single and multiple instance scenarios, ensuring optimal resource utilization tailored to specific workloads.
296
296
By understanding and applying these techniques, users can significantly enhance the efficiency and speed of their PyTorch applications on Intel(R) Xeon(R) platforms.
0 commit comments