You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several configuration options that can impact the performance of PyTorch inference when executed on Intel(R) Xeon(R) Scalable Processors.
4
+
There are several configuration options that can impact the performance of PyTorch inference when executed on Intel® Xeon® Scalable Processors.
5
5
To get peak performance, the ``torch.backends.xeon.run_cpu`` script is provided that optimizes the configuration of thread and memory management.
6
-
For thread management, the script configures thread affinity and the preload of Intel(R) OMP library.
6
+
For thread management, the script configures thread affinity and the preload of Intel® OMP library.
7
7
For memory management, it configures NUMA binding and preloads optimized memory allocation libraries, such as TCMalloc and JeMalloc.
8
8
In addition, the script provides tunable parameters for compute resource allocation in both single instance and multiple instance scenarios,
9
9
helping the users try out an optimal coordination of resource utilization for the specific workloads.
10
10
11
11
What You Will Learn
12
12
-------------------
13
13
14
-
* How to utilize tools like ``numactl``, ``taskset``, Intel(R) OpenMP Runtime Library and
14
+
* How to utilize tools like ``numactl``, ``taskset``, Intel® OpenMP Runtime Library and
15
15
optimized memory allocators such as ``TCMalloc`` and ``JeMalloc`` for enhanced performance.
16
16
* How to configure CPU resources and memory management to maximize PyTorch
17
-
inference performance on Intel(R) Xeon(R) processors.
17
+
inference performance on Intel® Xeon® processors.
18
18
19
19
Introduction of the Optimizations
20
20
---------------------------------
@@ -31,7 +31,7 @@ Local memory access is much faster than remote memory access.
31
31
32
32
Users can get CPU information with ``lscpu`` command on Linux to learn how many cores and sockets are there on the machine.
33
33
Additionally, this command provides NUMA information, such as the distribution of CPU cores.
34
-
Below is an example of executing ``lscpu`` on a machine equipped with an Intel(R) Xeon(R) CPU Max 9480:
34
+
Below is an example of executing ``lscpu`` on a machine equipped with an Intel® Xeon® CPU Max 9480:
35
35
36
36
.. code-block:: console
37
37
@@ -88,13 +88,13 @@ on CentOS you can run the following command:
88
88
89
89
$ yum install util-linux
90
90
91
-
Using Intel(R) OpenMP Runtime Library
91
+
Using Intel® OpenMP Runtime Library
92
92
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
93
93
94
94
OpenMP is an implementation of multithreading, a method of parallelizing where a primary thread (a series of instructions executed consecutively) forks a specified number of sub-threads and the system divides a task among them. The threads then run concurrently, with the runtime environment allocating threads to different processors.
95
-
Users can control OpenMP behaviors with some environment variable settings to fit for their workloads, the settings are read and executed by OMP libraries. By default, PyTorch uses GNU OpenMP Library (GNU libgomp) for parallel computation. On Intel(R) platforms, Intel(R) OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It usually brings more performance benefits compared to libgomp.
95
+
Users can control OpenMP behaviors with some environment variable settings to fit for their workloads, the settings are read and executed by OMP libraries. By default, PyTorch uses GNU OpenMP Library (GNU libgomp) for parallel computation. On Intel® platforms, Intel® OpenMP Runtime Library (libiomp) provides OpenMP API specification support. It usually brings more performance benefits compared to libgomp.
96
96
97
-
The Intel(R) OpenMP Runtime Library can be installed using one of these commands:
97
+
The Intel® OpenMP Runtime Library can be installed using one of these commands:
98
98
99
99
.. code-block:: console
100
100
@@ -260,7 +260,7 @@ Knobs for applying or disabling optimizations are:
260
260
* - ``--disable-iomp``
261
261
- bool
262
262
- False
263
-
- By default, Intel(R) OpenMP lib will be used if installed. Setting this flag would disable the usage of Intel(R) OpenMP.
263
+
- By default, Intel® OpenMP lib will be used if installed. Setting this flag would disable the usage of Intel® OpenMP.
264
264
265
265
.. note::
266
266
@@ -351,12 +351,12 @@ Knobs for controlling instance number and compute resource allocation are:
351
351
Conclusion
352
352
----------
353
353
354
-
In this tutorial, we explored a variety of advanced configurations and tools designed to optimize PyTorch inference performance on Intel(R) Xeon(R) Scalable Processors.
354
+
In this tutorial, we explored a variety of advanced configurations and tools designed to optimize PyTorch inference performance on Intel® Xeon® Scalable Processors.
355
355
By leveraging the ``torch.backends.xeon.run_cpu`` script, we demonstrated how to fine-tune thread and memory management to achieve peak performance.
356
-
We covered essential concepts such as NUMA access control, optimized memory allocators like ``TCMalloc`` and ``JeMalloc``, and the use of Intel(R) OpenMP for efficient multithreading.
356
+
We covered essential concepts such as NUMA access control, optimized memory allocators like ``TCMalloc`` and ``JeMalloc``, and the use of Intel® OpenMP for efficient multithreading.
357
357
358
358
Additionally, we provided practical command-line examples to guide you through setting up single and multiple instance scenarios, ensuring optimal resource utilization tailored to specific workloads.
359
-
By understanding and applying these techniques, users can significantly enhance the efficiency and speed of their PyTorch applications on Intel(R) Xeon(R) platforms.
359
+
By understanding and applying these techniques, users can significantly enhance the efficiency and speed of their PyTorch applications on Intel® Xeon® platforms.
0 commit comments