@@ -51,9 +51,9 @@ time for a specific job. Are they primarily engaged in computation, communicatio
51
51
memory events, or are they idle? The temporal breakdown feature provides a detailed
52
52
analysis of the time spent in these three categories.
53
53
54
- 1) Idle time - GPU is idle.
55
- 2) Compute time - GPU is being used for matrix multiplications or vector operations.
56
- 3) Non-compute time - GPU is being used for communication or memory events.
54
+ * Idle time - GPU is idle.
55
+ * Compute time - GPU is being used for matrix multiplications or vector operations.
56
+ * Non-compute time - GPU is being used for communication or memory events.
57
57
58
58
To achieve high training efficiency, the code should maximize compute time and
59
59
minimize idle time and non-compute time. The following function generates a
@@ -82,20 +82,20 @@ reasons behind it can help guide optimization strategies. A GPU is
82
82
considered idle when no kernel is running on it. We have developed an
83
83
algorithm to categorize the `Idle ` time into three distinct categories:
84
84
85
- #. **Host wait: ** refers to the idle time on the GPU that is caused by
86
- the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87
- These types of inefficiencies can be addressed by examining the CPU
88
- operators that are contributing to the slowdown, increasing the batch
89
- size and applying operator fusion.
85
+ * **Host wait: ** refers to the idle time on the GPU that is caused by
86
+ the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87
+ These types of inefficiencies can be addressed by examining the CPU
88
+ operators that are contributing to the slowdown, increasing the batch
89
+ size and applying operator fusion.
90
90
91
- #. **Kernel wait: ** This refers to brief overhead associated with launching
92
- consecutive kernels on the GPU. The idle time attributed to this category
93
- can be minimized by using CUDA Graph optimizations.
91
+ * **Kernel wait: ** This refers to brief overhead associated with launching
92
+ consecutive kernels on the GPU. The idle time attributed to this category
93
+ can be minimized by using CUDA Graph optimizations.
94
94
95
- #. **Other wait: ** This category includes idle time that cannot currently
96
- be attributed due to insufficient information. The likely causes include
97
- synchronization among CUDA streams using CUDA events and delays in launching
98
- kernels.
95
+ * **Other wait: ** This category includes idle time that cannot currently
96
+ be attributed due to insufficient information. The likely causes include
97
+ synchronization among CUDA streams using CUDA events and delays in launching
98
+ kernels.
99
99
100
100
The host wait time can be interpreted as the time when the GPU is stalling due
101
101
to the CPU. To attribute the idle time as kernel wait we use the following
@@ -286,17 +286,13 @@ HTA also provides a summary of the memory copy bandwidth and queue length
286
286
counters as well as the time series of the counters for the profiled portion of
287
287
the code using the following API:
288
288
289
- * `get_memory_bw_summary
290
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary> `_
289
+ * `get_memory_bw_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary >`_
291
290
292
- * `get_queue_length_summary
293
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary> `_
291
+ * `get_queue_length_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary >`_
294
292
295
- * `get_memory_bw_time_series
296
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series> `_
293
+ * `get_memory_bw_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series >`_
297
294
298
- * `get_queue_length_time_series
299
- <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series> `_
295
+ * `get_queue_length_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series >`_
300
296
301
297
To view the summary and time series, use:
302
298
@@ -349,8 +345,8 @@ A screenshot of the generated dataframe is given below.
349
345
The duration of the CPU op, GPU kernel, and the launch delay allow us to find
350
346
the following:
351
347
352
- * **Short GPU kernels ** - GPU kernels with duration less than the
353
- corresponding CPU runtime event.
348
+ * **Short GPU kernels ** - GPU kernels with duration less than the corresponding
349
+ CPU runtime event.
354
350
355
351
* **Runtime event outliers ** - CPU runtime events with excessive duration.
356
352
0 commit comments