1
1
Introduction to Holistic Trace Analysis
2
2
=======================================
3
- **Author: ** `Anupam Bhatnagar <https://github.com/anupambhatnagar >`_
4
3
5
- Setup
6
- -----
4
+ **Author: ** `Anupam Bhatnagar <https://github.com/anupambhatnagar >`_
7
5
8
- In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to
6
+ In this tutorial, we demonstrate how to use Holistic Trace Analysis (HTA) to
9
7
analyze traces from a distributed training job. To get started follow the steps
10
- below:
8
+ below.
11
9
12
10
Installing HTA
13
- ^^^^^^^^^^^^^^
11
+ ~~~~~~~~~~~~~~
14
12
15
13
We recommend using a Conda environment to install HTA. To install Anaconda, see
16
- `here <https://docs.anaconda.com/anaconda/install/index.html >`_.
14
+ `the official Anaconda documentation <https://docs.anaconda.com/anaconda/install/index.html >`_.
17
15
18
- 1) Install HTA using pip
16
+ 1. Install HTA using pip:
19
17
20
- .. code-block :: python
18
+ .. code-block :: python
21
19
22
- pip install HolisticTraceAnalysis
20
+ pip install HolisticTraceAnalysis
23
21
24
- 2) [ Optional and recommended] Setup a conda environment
22
+ 2. ( Optional and recommended) Set up a Conda environment:
25
23
26
- .. code-block :: python
24
+ .. code-block :: python
27
25
28
- # create the environment env_name
29
- conda create - n env_name
26
+ # create the environment env_name
27
+ conda create - n env_name
30
28
31
- # activate the environment
32
- conda activate env_name
29
+ # activate the environment
30
+ conda activate env_name
33
31
34
- # deactivate the environment
35
- conda deactivate
32
+ # When you are done, deactivate the environment by running ``conda deactivate``
36
33
37
- Getting started
38
- ^^^^^^^^^^^^^^^
34
+ Getting Started
35
+ ~~~~~~~~~~~~~~~
39
36
40
- Launch a jupyter notebook and set the ``trace_dir `` variable to the location of the traces.
37
+ Launch a Jupyter notebook and set the ``trace_dir `` variable to the location of the traces.
41
38
42
39
.. code-block :: python
43
40
44
- from hta.trace_analysis import TraceAnalysis
45
- trace_dir = " /path/to/folder/with/traces"
46
- analyzer = TraceAnalysis(trace_dir = trace_dir)
41
+ from hta.trace_analysis import TraceAnalysis
42
+ trace_dir = " /path/to/folder/with/traces"
43
+ analyzer = TraceAnalysis(trace_dir = trace_dir)
47
44
48
45
49
46
Temporal Breakdown
50
47
------------------
51
48
52
- To best utilize the GPUs it is vital to understand where the GPU is spending
53
- time for a given job. Is the GPU spending time on computation, communication,
54
- memory events, or is it idle? The temporal breakdown feature breaks down the
55
- time spent in three categories
49
+ To effectively utilize the GPUs, it is crucial to understand how they are spending
50
+ time for a specific job. Are they primarily engaged in computation, communication,
51
+ memory events, or are they idle? The temporal breakdown feature provides a detailed
52
+ analysis of the time spent in these three categories.
56
53
57
54
1) Idle time - GPU is idle.
58
55
2) Compute time - GPU is being used for matrix multiplications or vector operations.
59
56
3) Non-compute time - GPU is being used for communication or memory events.
60
57
61
- To achieve high training efficiency the code should maximize compute time and
62
- minimize idle time and non-compute time. The function below returns
63
- a dataframe containing the temporal breakdown for each rank.
58
+ To achieve high training efficiency, the code should maximize compute time and
59
+ minimize idle time and non-compute time. The following function generates a
60
+ dataframe that provides a detailed breakdown of the temporal usage for each rank.
64
61
65
62
.. code-block :: python
66
63
67
- analyzer = TraceAnalysis(trace_dir = " /path/to/trace/folder" )
68
- time_spent_df = analyzer.get_temporal_breakdown()
64
+ analyzer = TraceAnalysis(trace_dir = " /path/to/trace/folder" )
65
+ time_spent_df = analyzer.get_temporal_breakdown()
69
66
70
67
71
68
.. image :: ../_static/img/hta/temporal_breakdown_df.png
72
69
73
- When the ``visualize `` argument is set to True in the `get_temporal_breakdown
70
+ When the ``visualize `` argument is set to `` True `` in the `get_temporal_breakdown
74
71
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown> `_
75
72
function it also generates a bar graph representing the breakdown by rank.
76
73
@@ -79,21 +76,24 @@ function it also generates a bar graph representing the breakdown by rank.
79
76
80
77
Idle Time Breakdown
81
78
-------------------
82
- Understanding how much time the GPU is idle and its causes can help direct
83
- optimization strategies. A GPU is considered idle when no kernel is running on
84
- it. We developed an algorithm to categorize the Idle time into 3 categories:
85
79
86
- #. Host wait: is the idle duration on the GPU due to the CPU not enqueuing
87
- kernels fast enough to keep the GPU busy. These kinds of inefficiencies can
88
- be resolved by examining the CPU operators that are contributing to the slow
89
- down, increasing the batch size and applying operator fusion.
80
+ Gaining insight into the amount of time the GPU spends idle and the
81
+ reasons behind it can help guide optimization strategies. A GPU is
82
+ considered idle when no kernel is running on it. We have developed an
83
+ algorithm to categorize the `Idle ` time into three distinct categories:
84
+
85
+ #. **Host wait: ** refers to the idle time on the GPU that is caused by
86
+ the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87
+ These types of inefficiencies can be addressed by examining the CPU
88
+ operators that are contributing to the slowdown, increasing the batch
89
+ size and applying operator fusion.
90
90
91
- #. Kernel wait: constitutes the short overhead to launch consecutive kernels on
92
- the GPU. The idle time attributed to this category can be minimized by using
93
- CUDA Graph optimizations.
91
+ #. ** Kernel wait: ** This refers to brief overhead associated with launching
92
+ consecutive kernels on the GPU. The idle time attributed to this category
93
+ can be minimized by using CUDA Graph optimizations.
94
94
95
- #. Other wait: Lastly, this category includes idle we could not currently
96
- attribute due to insufficient information. The likely causes include
95
+ #. ** Other wait: ** This category includes idle time that cannot currently
96
+ be attributed due to insufficient information. The likely causes include
97
97
synchronization among CUDA streams using CUDA events and delays in launching
98
98
kernels.
99
99
@@ -132,6 +132,7 @@ on each rank.
132
132
:scale: 100%
133
133
134
134
.. tip ::
135
+
135
136
By default, the idle time breakdown presents the percentage of each of the
136
137
idle time categories. Setting the ``visualize_pctg `` argument to ``False ``,
137
138
the function renders with absolute time on the y-axis.
@@ -140,10 +141,10 @@ on each rank.
140
141
Kernel Breakdown
141
142
----------------
142
143
143
- The kernel breakdown feature breaks down the time spent for each kernel type
144
- i.e. communication (COMM), computation (COMP), and memory (MEM) across all
145
- ranks and presents the proportion of time spent in each category. The
146
- percentage of time spent in each category as a pie chart.
144
+ The kernel breakdown feature breaks down the time spent for each kernel type,
145
+ such as communication (COMM), computation (COMP), and memory (MEM), across all
146
+ ranks and presents the proportion of time spent in each category. Here is the
147
+ percentage of time spent in each category as a pie chart:
147
148
148
149
.. image :: ../_static/img/hta/kernel_type_breakdown.png
149
150
:align: center
@@ -156,15 +157,15 @@ The kernel breakdown can be calculated as follows:
156
157
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown()
157
158
158
159
The first dataframe returned by the function contains the raw values used to
159
- generate the Pie chart.
160
+ generate the pie chart.
160
161
161
162
Kernel Duration Distribution
162
163
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163
164
164
165
The second dataframe returned by `get_gpu_kernel_breakdown
165
166
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown> `_
166
167
contains duration summary statistics for each kernel. In particular, this
167
- includes the count, min, max, average, standard deviation, sum and kernel type
168
+ includes the count, min, max, average, standard deviation, sum, and kernel type
168
169
for each kernel on each rank.
169
170
170
171
.. image :: ../_static/img/hta/kernel_metrics_df.png
@@ -181,11 +182,12 @@ bottlenecks.
181
182
.. image :: ../_static/img/hta/pie_charts.png
182
183
183
184
.. tip ::
185
+
184
186
All images are generated using plotly. Hovering on the graph shows the
185
- mode bar on the top right which allows the user to zoom, pan, select and
187
+ mode bar on the top right which allows the user to zoom, pan, select, and
186
188
download the graph.
187
189
188
- The pie charts above shows the top 5 computation, communication and memory
190
+ The pie charts above show the top 5 computation, communication, and memory
189
191
kernels. Similar pie charts are generated for each rank. The pie charts can be
190
192
configured to show the top k kernels using the ``num_kernels `` argument passed
191
193
to the `get_gpu_kernel_breakdown ` function. Additionally, the
@@ -212,21 +214,21 @@ in the examples folder of the repo.
212
214
Communication Computation Overlap
213
215
---------------------------------
214
216
215
- In distributed training a significant amount of time is spent in communication
216
- and synchronization events between GPUs. To achieve high GPU efficiency (i.e.
217
- TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation
217
+ In distributed training, a significant amount of time is spent in communication
218
+ and synchronization events between GPUs. To achieve high GPU efficiency (such as
219
+ TFLOPS/GPU), it is crucial to keep the GPU oversubscribed with computation
218
220
kernels. In other words, the GPU should not be blocked due to unresolved data
219
221
dependencies. One way to measure the extent to which computation is blocked by
220
222
data dependencies is to calculate the communication computation overlap. Higher
221
223
GPU efficiency is observed if communication events overlap computation events.
222
224
Lack of communication and computation overlap will lead to the GPU being idle,
223
- thus the efficiency would be low. To sum up, higher communication computation
224
- overlap is desirable. To calculate the overlap percentage for each rank we
225
- measure the following ratio:
225
+ resulting in low efficiency.
226
+ To sum up, a higher communication computation overlap is desirable. To calculate
227
+ the overlap percentage for each rank, we measure the following ratio:
226
228
227
229
| **(time spent in computation while communicating) / (time spent in communication)**
228
230
229
- Communication computation overlap can be calculated as follows:
231
+ The communication computation overlap can be calculated as follows:
230
232
231
233
.. code-block :: python
232
234
@@ -266,9 +268,9 @@ API outputs a new trace file with the memory bandwidth and queue length
266
268
counters. The new trace file contains tracks which indicate the memory
267
269
bandwidth used by memcpy/memset operations and tracks for the queue length on
268
270
each stream. By default, these counters are generated using the rank 0
269
- trace file and the new file contains the suffix ``_with_counters `` in its name.
271
+ trace file, and the new file contains the suffix ``_with_counters `` in its name.
270
272
Users have the option to generate the counters for multiple ranks by using the
271
- ``ranks `` argument in the `generate_trace_with_counters ` API.
273
+ ``ranks `` argument in the `` generate_trace_with_counters ` ` API.
272
274
273
275
.. code-block :: python
274
276
@@ -284,19 +286,19 @@ HTA also provides a summary of the memory copy bandwidth and queue length
284
286
counters as well as the time series of the counters for the profiled portion of
285
287
the code using the following API:
286
288
287
- #. `get_memory_bw_summary
289
+ * `get_memory_bw_summary
288
290
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary> `_
289
291
290
- #. `get_queue_length_summary
292
+ * `get_queue_length_summary
291
293
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary> `_
292
294
293
- #. `get_memory_bw_time_series
295
+ * `get_memory_bw_time_series
294
296
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series> `_
295
297
296
- #. `get_queue_length_time_series
298
+ * `get_queue_length_time_series
297
299
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series> `_
298
300
299
- To view the summary and time series use:
301
+ To view the summary and time series, use:
300
302
301
303
.. code-block :: python
302
304
@@ -321,17 +323,16 @@ bandwidth and queue length time series functions return a dictionary whose key
321
323
is the rank and the value is the time series for that rank. By default, the
322
324
time series is computed for rank 0 only.
323
325
324
-
325
326
CUDA Kernel Launch Statistics
326
327
-----------------------------
327
328
328
329
.. image :: ../_static/img/hta/cuda_kernel_launch.png
329
330
330
- For each event launched on the GPU there is a corresponding scheduling event on
331
- the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events
332
- are linked by a common correlation id in the trace. See figure above. This
333
- feature computes the duration of the CPU runtime event, its corresponding GPU
334
- kernel and the launch delay i.e. the difference between GPU kernel starting and
331
+ For each event launched on the GPU, there is a corresponding scheduling event on
332
+ the CPU, such as `` CudaLaunchKernel ``, `` CudaMemcpyAsync ``, `` CudaMemsetAsync ``.
333
+ These events are linked by a common correlation ID in the trace - see the figure
334
+ above. This feature computes the duration of the CPU runtime event, its corresponding GPU
335
+ kernel and the launch delay, for example, the difference between GPU kernel starting and
335
336
CPU operator ending. The kernel launch info can be generated as follows:
336
337
337
338
.. code-block :: python
@@ -345,23 +346,23 @@ A screenshot of the generated dataframe is given below.
345
346
:scale: 100%
346
347
:align: center
347
348
348
- The duration of the CPU op, GPU kernel and the launch delay allows us to find:
349
+ The duration of the CPU op, GPU kernel, and the launch delay allow us to find
350
+ the following:
349
351
350
- #. **Short GPU kernels ** - GPU kernels with duration less than the
352
+ * **Short GPU kernels ** - GPU kernels with duration less than the
351
353
corresponding CPU runtime event.
352
354
353
- #. **Runtime event outliers ** - CPU runtime events with excessive duration.
355
+ * **Runtime event outliers ** - CPU runtime events with excessive duration.
354
356
355
- #. **Launch delay outliers ** - GPU kernels which take too long to be scheduled.
357
+ * **Launch delay outliers ** - GPU kernels which take too long to be scheduled.
356
358
357
359
HTA generates distribution plots for each of the aforementioned three categories.
358
360
359
-
360
361
**Short GPU kernels **
361
362
362
- Usually , the launch time on the CPU side is between 5-20 microseconds. In some
363
- cases the GPU execution time is lower than the launch time itself. The graph
364
- below allows us to find how frequently such instances appear in the code.
363
+ Typically , the launch time on the CPU side ranges from 5-20 microseconds. In some
364
+ cases, the GPU execution time is lower than the launch time itself. The graph
365
+ below helps us to find how frequently such instances occur in the code.
365
366
366
367
.. image :: ../_static/img/hta/short_gpu_kernels.png
367
368
@@ -382,3 +383,12 @@ hence the `get_cuda_kernel_launch_stats` API provides the
382
383
``launch_delay_cutoff `` argument to configure the value.
383
384
384
385
.. image :: ../_static/img/hta/launch_delay_outliers.png
386
+
387
+
388
+ Conclusion
389
+ ~~~~~~~~~~
390
+
391
+ In this tutorial, you have learned how to install and use HTA,
392
+ a performance tool that enables you analyze bottlenecks in your distributed
393
+ training workflows. To learn how you can use the HTA tool to perform trace
394
+ diff analysis, see `Trace Diff using Holistic Trace Analysis <https://pytorch.org/tutorials/beginner/hta_trace_diff_tutorial.html >`__.
0 commit comments