Skip to content

Commit 50b58db

Browse files
committed
Editorial pass on the HTA tutorials
1 parent 765d428 commit 50b58db

File tree

2 files changed

+105
-93
lines changed

2 files changed

+105
-93
lines changed

beginner_source/hta_intro_tutorial.rst

Lines changed: 92 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,73 @@
11
Introduction to Holistic Trace Analysis
22
=======================================
3-
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
43

5-
Setup
6-
-----
4+
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
75

8-
In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to
6+
In this tutorial, we demonstrate how to use Holistic Trace Analysis (HTA) to
97
analyze traces from a distributed training job. To get started follow the steps
10-
below:
8+
below.
119

1210
Installing HTA
13-
^^^^^^^^^^^^^^
11+
~~~~~~~~~~~~~~
1412

1513
We recommend using a Conda environment to install HTA. To install Anaconda, see
16-
`here <https://docs.anaconda.com/anaconda/install/index.html>`_.
14+
`the official Anaconda documentation <https://docs.anaconda.com/anaconda/install/index.html>`_.
1715

18-
1) Install HTA using pip
16+
1. Install HTA using pip:
1917

20-
.. code-block:: python
18+
.. code-block:: python
2119
22-
pip install HolisticTraceAnalysis
20+
pip install HolisticTraceAnalysis
2321
24-
2) [Optional and recommended] Setup a conda environment
22+
2. (Optional and recommended) Set up a Conda environment:
2523

26-
.. code-block:: python
24+
.. code-block:: python
2725
28-
# create the environment env_name
29-
conda create -n env_name
26+
# create the environment env_name
27+
conda create -n env_name
3028
31-
# activate the environment
32-
conda activate env_name
29+
# activate the environment
30+
conda activate env_name
3331
34-
# deactivate the environment
35-
conda deactivate
32+
# When you are done, deactivate the environment by running ``conda deactivate``
3633
37-
Getting started
38-
^^^^^^^^^^^^^^^
34+
Getting Started
35+
~~~~~~~~~~~~~~~
3936

40-
Launch a jupyter notebook and set the ``trace_dir`` variable to the location of the traces.
37+
Launch a Jupyter notebook and set the ``trace_dir`` variable to the location of the traces.
4138

4239
.. code-block:: python
4340
44-
from hta.trace_analysis import TraceAnalysis
45-
trace_dir = "/path/to/folder/with/traces"
46-
analyzer = TraceAnalysis(trace_dir=trace_dir)
41+
from hta.trace_analysis import TraceAnalysis
42+
trace_dir = "/path/to/folder/with/traces"
43+
analyzer = TraceAnalysis(trace_dir=trace_dir)
4744
4845
4946
Temporal Breakdown
5047
------------------
5148

52-
To best utilize the GPUs it is vital to understand where the GPU is spending
53-
time for a given job. Is the GPU spending time on computation, communication,
54-
memory events, or is it idle? The temporal breakdown feature breaks down the
55-
time spent in three categories
49+
To effectively utilize the GPUs, it is crucial to understand how they are spending
50+
time for a specific job. Are they primarily engaged in computation, communication,
51+
memory events, or are they idle? The temporal breakdown feature provides a detailed
52+
analysis of the time spent in these three categories.
5653

5754
1) Idle time - GPU is idle.
5855
2) Compute time - GPU is being used for matrix multiplications or vector operations.
5956
3) Non-compute time - GPU is being used for communication or memory events.
6057

61-
To achieve high training efficiency the code should maximize compute time and
62-
minimize idle time and non-compute time. The function below returns
63-
a dataframe containing the temporal breakdown for each rank.
58+
To achieve high training efficiency, the code should maximize compute time and
59+
minimize idle time and non-compute time. The following function generates a
60+
dataframe that provides a detailed breakdown of the temporal usage for each rank.
6461

6562
.. code-block:: python
6663
67-
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
68-
time_spent_df = analyzer.get_temporal_breakdown()
64+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
65+
time_spent_df = analyzer.get_temporal_breakdown()
6966
7067
7168
.. image:: ../_static/img/hta/temporal_breakdown_df.png
7269

73-
When the ``visualize`` argument is set to True in the `get_temporal_breakdown
70+
When the ``visualize`` argument is set to ``True`` in the `get_temporal_breakdown
7471
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown>`_
7572
function it also generates a bar graph representing the breakdown by rank.
7673

@@ -79,21 +76,24 @@ function it also generates a bar graph representing the breakdown by rank.
7976

8077
Idle Time Breakdown
8178
-------------------
82-
Understanding how much time the GPU is idle and its causes can help direct
83-
optimization strategies. A GPU is considered idle when no kernel is running on
84-
it. We developed an algorithm to categorize the Idle time into 3 categories:
8579

86-
#. Host wait: is the idle duration on the GPU due to the CPU not enqueuing
87-
kernels fast enough to keep the GPU busy. These kinds of inefficiencies can
88-
be resolved by examining the CPU operators that are contributing to the slow
89-
down, increasing the batch size and applying operator fusion.
80+
Gaining insight into the amount of time the GPU spends idle and the
81+
reasons behind it can help guide optimization strategies. A GPU is
82+
considered idle when no kernel is running on it. We have developed an
83+
algorithm to categorize the `Idle` time into three distinct categories:
84+
85+
#. **Host wait:** refers to the idle time on the GPU that is caused by
86+
the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87+
These types of inefficiencies can be addressed by examining the CPU
88+
operators that are contributing to the slowdown, increasing the batch
89+
size and applying operator fusion.
9090

91-
#. Kernel wait: constitutes the short overhead to launch consecutive kernels on
92-
the GPU. The idle time attributed to this category can be minimized by using
93-
CUDA Graph optimizations.
91+
#. **Kernel wait:** This refers to brief overhead associated with launching
92+
consecutive kernels on the GPU. The idle time attributed to this category
93+
can be minimized by using CUDA Graph optimizations.
9494

95-
#. Other wait: Lastly, this category includes idle we could not currently
96-
attribute due to insufficient information. The likely causes include
95+
#. **Other wait:** This category includes idle time that cannot currently
96+
be attributed due to insufficient information. The likely causes include
9797
synchronization among CUDA streams using CUDA events and delays in launching
9898
kernels.
9999

@@ -132,6 +132,7 @@ on each rank.
132132
:scale: 100%
133133

134134
.. tip::
135+
135136
By default, the idle time breakdown presents the percentage of each of the
136137
idle time categories. Setting the ``visualize_pctg`` argument to ``False``,
137138
the function renders with absolute time on the y-axis.
@@ -140,10 +141,10 @@ on each rank.
140141
Kernel Breakdown
141142
----------------
142143

143-
The kernel breakdown feature breaks down the time spent for each kernel type
144-
i.e. communication (COMM), computation (COMP), and memory (MEM) across all
145-
ranks and presents the proportion of time spent in each category. The
146-
percentage of time spent in each category as a pie chart.
144+
The kernel breakdown feature breaks down the time spent for each kernel type,
145+
such as communication (COMM), computation (COMP), and memory (MEM), across all
146+
ranks and presents the proportion of time spent in each category. Here is the
147+
percentage of time spent in each category as a pie chart:
147148

148149
.. image:: ../_static/img/hta/kernel_type_breakdown.png
149150
:align: center
@@ -156,15 +157,15 @@ The kernel breakdown can be calculated as follows:
156157
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown()
157158
158159
The first dataframe returned by the function contains the raw values used to
159-
generate the Pie chart.
160+
generate the pie chart.
160161

161162
Kernel Duration Distribution
162163
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163164

164165
The second dataframe returned by `get_gpu_kernel_breakdown
165166
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown>`_
166167
contains duration summary statistics for each kernel. In particular, this
167-
includes the count, min, max, average, standard deviation, sum and kernel type
168+
includes the count, min, max, average, standard deviation, sum, and kernel type
168169
for each kernel on each rank.
169170

170171
.. image:: ../_static/img/hta/kernel_metrics_df.png
@@ -181,11 +182,12 @@ bottlenecks.
181182
.. image:: ../_static/img/hta/pie_charts.png
182183

183184
.. tip::
185+
184186
All images are generated using plotly. Hovering on the graph shows the
185-
mode bar on the top right which allows the user to zoom, pan, select and
187+
mode bar on the top right which allows the user to zoom, pan, select, and
186188
download the graph.
187189

188-
The pie charts above shows the top 5 computation, communication and memory
190+
The pie charts above show the top 5 computation, communication, and memory
189191
kernels. Similar pie charts are generated for each rank. The pie charts can be
190192
configured to show the top k kernels using the ``num_kernels`` argument passed
191193
to the `get_gpu_kernel_breakdown` function. Additionally, the
@@ -212,21 +214,21 @@ in the examples folder of the repo.
212214
Communication Computation Overlap
213215
---------------------------------
214216

215-
In distributed training a significant amount of time is spent in communication
216-
and synchronization events between GPUs. To achieve high GPU efficiency (i.e.
217-
TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation
217+
In distributed training, a significant amount of time is spent in communication
218+
and synchronization events between GPUs. To achieve high GPU efficiency (such as
219+
TFLOPS/GPU), it is crucial to keep the GPU oversubscribed with computation
218220
kernels. In other words, the GPU should not be blocked due to unresolved data
219221
dependencies. One way to measure the extent to which computation is blocked by
220222
data dependencies is to calculate the communication computation overlap. Higher
221223
GPU efficiency is observed if communication events overlap computation events.
222224
Lack of communication and computation overlap will lead to the GPU being idle,
223-
thus the efficiency would be low. To sum up, higher communication computation
224-
overlap is desirable. To calculate the overlap percentage for each rank we
225-
measure the following ratio:
225+
resulting in low efficiency.
226+
To sum up, a higher communication computation overlap is desirable. To calculate
227+
the overlap percentage for each rank, we measure the following ratio:
226228

227229
| **(time spent in computation while communicating) / (time spent in communication)**
228230
229-
Communication computation overlap can be calculated as follows:
231+
The communication computation overlap can be calculated as follows:
230232

231233
.. code-block:: python
232234
@@ -266,9 +268,9 @@ API outputs a new trace file with the memory bandwidth and queue length
266268
counters. The new trace file contains tracks which indicate the memory
267269
bandwidth used by memcpy/memset operations and tracks for the queue length on
268270
each stream. By default, these counters are generated using the rank 0
269-
trace file and the new file contains the suffix ``_with_counters`` in its name.
271+
trace file, and the new file contains the suffix ``_with_counters`` in its name.
270272
Users have the option to generate the counters for multiple ranks by using the
271-
``ranks`` argument in the `generate_trace_with_counters` API.
273+
``ranks`` argument in the ``generate_trace_with_counters`` API.
272274

273275
.. code-block:: python
274276
@@ -284,19 +286,19 @@ HTA also provides a summary of the memory copy bandwidth and queue length
284286
counters as well as the time series of the counters for the profiled portion of
285287
the code using the following API:
286288

287-
#. `get_memory_bw_summary
289+
* `get_memory_bw_summary
288290
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
289291

290-
#. `get_queue_length_summary
292+
* `get_queue_length_summary
291293
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
292294

293-
#. `get_memory_bw_time_series
295+
* `get_memory_bw_time_series
294296
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
295297

296-
#. `get_queue_length_time_series
298+
* `get_queue_length_time_series
297299
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
298300

299-
To view the summary and time series use:
301+
To view the summary and time series, use:
300302

301303
.. code-block:: python
302304
@@ -321,17 +323,16 @@ bandwidth and queue length time series functions return a dictionary whose key
321323
is the rank and the value is the time series for that rank. By default, the
322324
time series is computed for rank 0 only.
323325

324-
325326
CUDA Kernel Launch Statistics
326327
-----------------------------
327328

328329
.. image:: ../_static/img/hta/cuda_kernel_launch.png
329330

330-
For each event launched on the GPU there is a corresponding scheduling event on
331-
the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events
332-
are linked by a common correlation id in the trace. See figure above. This
333-
feature computes the duration of the CPU runtime event, its corresponding GPU
334-
kernel and the launch delay i.e. the difference between GPU kernel starting and
331+
For each event launched on the GPU, there is a corresponding scheduling event on
332+
the CPU, such as ``CudaLaunchKernel``, ``CudaMemcpyAsync``, ``CudaMemsetAsync``.
333+
These events are linked by a common correlation ID in the trace - see the figure
334+
above. This feature computes the duration of the CPU runtime event, its corresponding GPU
335+
kernel and the launch delay, for example, the difference between GPU kernel starting and
335336
CPU operator ending. The kernel launch info can be generated as follows:
336337

337338
.. code-block:: python
@@ -345,23 +346,23 @@ A screenshot of the generated dataframe is given below.
345346
:scale: 100%
346347
:align: center
347348

348-
The duration of the CPU op, GPU kernel and the launch delay allows us to find:
349+
The duration of the CPU op, GPU kernel, and the launch delay allow us to find
350+
the following:
349351

350-
#. **Short GPU kernels** - GPU kernels with duration less than the
352+
* **Short GPU kernels** - GPU kernels with duration less than the
351353
corresponding CPU runtime event.
352354

353-
#. **Runtime event outliers** - CPU runtime events with excessive duration.
355+
* **Runtime event outliers** - CPU runtime events with excessive duration.
354356

355-
#. **Launch delay outliers** - GPU kernels which take too long to be scheduled.
357+
* **Launch delay outliers** - GPU kernels which take too long to be scheduled.
356358

357359
HTA generates distribution plots for each of the aforementioned three categories.
358360

359-
360361
**Short GPU kernels**
361362

362-
Usually, the launch time on the CPU side is between 5-20 microseconds. In some
363-
cases the GPU execution time is lower than the launch time itself. The graph
364-
below allows us to find how frequently such instances appear in the code.
363+
Typically, the launch time on the CPU side ranges from 5-20 microseconds. In some
364+
cases, the GPU execution time is lower than the launch time itself. The graph
365+
below helps us to find how frequently such instances occur in the code.
365366

366367
.. image:: ../_static/img/hta/short_gpu_kernels.png
367368

@@ -382,3 +383,12 @@ hence the `get_cuda_kernel_launch_stats` API provides the
382383
``launch_delay_cutoff`` argument to configure the value.
383384

384385
.. image:: ../_static/img/hta/launch_delay_outliers.png
386+
387+
388+
Conclusion
389+
~~~~~~~~~~
390+
391+
In this tutorial, you have learned how to install and use HTA,
392+
a performance tool that enables you analyze bottlenecks in your distributed
393+
training workflows. To learn how you can use the HTA tool to perform trace
394+
diff analysis, see `Trace Diff using Holistic Trace Analysis <https://pytorch.org/tutorials/beginner/hta_trace_diff_tutorial.html>`__.

beginner_source/hta_trace_diff_tutorial.rst

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,27 @@
11
Trace Diff using Holistic Trace Analysis
22
========================================
3-
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
43

4+
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
55

66
Occasionally, users need to identify the changes in PyTorch operators and CUDA
7-
kernels resulting from a code change. To support such a requirement, HTA
7+
kernels resulting from a code change. To support this requirement, HTA
88
provides a trace comparison feature. This feature allows the user to input two
99
sets of trace files where the first can be thought of as the *control group*
10-
and the second as the *test group* as in an A/B test. The ``TraceDiff`` class
10+
and the second as the *test group*, similar to an A/B test. The ``TraceDiff`` class
1111
provides functions to compare the differences between traces and functionality
1212
to visualize these differences. In particular, users can find operators and
13-
kernels which were added and removed from each group along with the frequency
13+
kernels that were added and removed from each group, along with the frequency
1414
of each operator/kernel and the cumulative time taken by the operator/kernel.
15-
The `TraceDiff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html>`_ class has 4 methods:
1615

17-
#. `compare_traces
16+
The `TraceDiff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html>`_ class
17+
has the following methods:
18+
19+
* `compare_traces
1820
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.compare_traces>`_ -
1921
Compare the frequency and total duration of CPU operators and GPU kernels from
2022
two sets of traces.
2123

22-
#. `ops_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.ops_diff>`_ -
24+
* `ops_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.ops_diff>`_ -
2325
Get the operators and kernels which have been:
2426

2527
#. **added** to the test trace and are absent in the control trace
@@ -28,17 +30,17 @@ The `TraceDiff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.h
2830
#. **decreased** in frequency in the test trace and exist in the control trace
2931
#. **unchanged** between the two sets of traces
3032

31-
#. `visualize_counts_diff
33+
* `visualize_counts_diff
3234
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_counts_diff>`_
3335

34-
#. `visualize_duration_diff
36+
* `visualize_duration_diff
3537
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_duration_diff>`_
3638

3739
The last two methods can be used to visualize various changes in frequency and
3840
duration of CPU operators and GPU kernels, using the output of the
3941
``compare_traces`` method.
4042

41-
E.g. The top 10 operators with increase in frequency can be computed as
43+
For example, the top ten operators with increase in frequency can be computed as
4244
follows:
4345

4446
.. code-block:: python
@@ -48,7 +50,7 @@ follows:
4850
4951
.. image:: ../_static/img/hta/counts_diff.png
5052

51-
Similarly, the top 10 ops with the largest change in duration can be computed as
53+
Similarly, the top ten operators with the largest change in duration can be computed as
5254
follows:
5355

5456
.. code-block:: python

0 commit comments

Comments
 (0)