Skip to content

Commit 765d428

Browse files
anupambhatnagaranupamb
and
anupamb
authored
Added tutorials for Holistic Trace Analysis (#2723)
* HTA tutorial * rename file * added trace diff tutorial * fix ci errors --------- Co-authored-by: anupamb <anupamb@devgpu020.odn1.facebook.com>
1 parent 0fe2a10 commit 765d428

24 files changed

+490
-0
lines changed

_static/img/hta/comm_across_ranks.png

46.2 KB
Loading

_static/img/hta/counts_diff.png

54.1 KB
Loading
90.8 KB
Loading
26.9 KB
Loading

_static/img/hta/duration_diff.png

80.1 KB
Loading

_static/img/hta/idle_time.png

42.4 KB
Loading
Loading

_static/img/hta/idle_time_summary.png

76.2 KB
Loading

_static/img/hta/kernel_metrics_df.png

84.3 KB
Loading
43.5 KB
Loading
35 KB
Loading
102 KB
Loading

_static/img/hta/overlap_df.png

36.3 KB
Loading

_static/img/hta/overlap_plot.png

44.9 KB
Loading

_static/img/hta/pie_charts.png

76.4 KB
Loading
48.1 KB
Loading

_static/img/hta/runtime_outliers.png

31.9 KB
Loading

_static/img/hta/short_gpu_kernels.png

38.3 KB
Loading
114 KB
Loading
46.9 KB
Loading
Lines changed: 384 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,384 @@
1+
Introduction to Holistic Trace Analysis
2+
=======================================
3+
**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_
4+
5+
Setup
6+
-----
7+
8+
In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to
9+
analyze traces from a distributed training job. To get started follow the steps
10+
below:
11+
12+
Installing HTA
13+
^^^^^^^^^^^^^^
14+
15+
We recommend using a Conda environment to install HTA. To install Anaconda, see
16+
`here <https://docs.anaconda.com/anaconda/install/index.html>`_.
17+
18+
1) Install HTA using pip
19+
20+
.. code-block:: python
21+
22+
pip install HolisticTraceAnalysis
23+
24+
2) [Optional and recommended] Setup a conda environment
25+
26+
.. code-block:: python
27+
28+
# create the environment env_name
29+
conda create -n env_name
30+
31+
# activate the environment
32+
conda activate env_name
33+
34+
# deactivate the environment
35+
conda deactivate
36+
37+
Getting started
38+
^^^^^^^^^^^^^^^
39+
40+
Launch a jupyter notebook and set the ``trace_dir`` variable to the location of the traces.
41+
42+
.. code-block:: python
43+
44+
from hta.trace_analysis import TraceAnalysis
45+
trace_dir = "/path/to/folder/with/traces"
46+
analyzer = TraceAnalysis(trace_dir=trace_dir)
47+
48+
49+
Temporal Breakdown
50+
------------------
51+
52+
To best utilize the GPUs it is vital to understand where the GPU is spending
53+
time for a given job. Is the GPU spending time on computation, communication,
54+
memory events, or is it idle? The temporal breakdown feature breaks down the
55+
time spent in three categories
56+
57+
1) Idle time - GPU is idle.
58+
2) Compute time - GPU is being used for matrix multiplications or vector operations.
59+
3) Non-compute time - GPU is being used for communication or memory events.
60+
61+
To achieve high training efficiency the code should maximize compute time and
62+
minimize idle time and non-compute time. The function below returns
63+
a dataframe containing the temporal breakdown for each rank.
64+
65+
.. code-block:: python
66+
67+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
68+
time_spent_df = analyzer.get_temporal_breakdown()
69+
70+
71+
.. image:: ../_static/img/hta/temporal_breakdown_df.png
72+
73+
When the ``visualize`` argument is set to True in the `get_temporal_breakdown
74+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown>`_
75+
function it also generates a bar graph representing the breakdown by rank.
76+
77+
.. image:: ../_static/img/hta/temporal_breakdown_plot.png
78+
79+
80+
Idle Time Breakdown
81+
-------------------
82+
Understanding how much time the GPU is idle and its causes can help direct
83+
optimization strategies. A GPU is considered idle when no kernel is running on
84+
it. We developed an algorithm to categorize the Idle time into 3 categories:
85+
86+
#. Host wait: is the idle duration on the GPU due to the CPU not enqueuing
87+
kernels fast enough to keep the GPU busy. These kinds of inefficiencies can
88+
be resolved by examining the CPU operators that are contributing to the slow
89+
down, increasing the batch size and applying operator fusion.
90+
91+
#. Kernel wait: constitutes the short overhead to launch consecutive kernels on
92+
the GPU. The idle time attributed to this category can be minimized by using
93+
CUDA Graph optimizations.
94+
95+
#. Other wait: Lastly, this category includes idle we could not currently
96+
attribute due to insufficient information. The likely causes include
97+
synchronization among CUDA streams using CUDA events and delays in launching
98+
kernels.
99+
100+
The host wait time can be interpreted as the time when the GPU is stalling due
101+
to the CPU. To attribute the idle time as kernel wait we use the following
102+
heuristic:
103+
104+
| **gap between consecutive kernels < threshold**
105+
106+
The default threshold value is 30 nanoseconds and can be configured using the
107+
``consecutive_kernel_delay`` argument. By default, the idle time breakdown is
108+
computed for rank 0 only. In order to calculate the breakdown for other ranks,
109+
use the ``ranks`` argument in the `get_idle_time_breakdown
110+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_idle_time_breakdown>`_
111+
function. The idle time breakdown can be generated as follows:
112+
113+
.. code-block:: python
114+
115+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
116+
idle_time_df = analyzer.get_idle_time_breakdown()
117+
118+
.. image:: ../_static/img/hta/idle_time_breakdown_percentage.png
119+
120+
The function returns a tuple of dataframes. The first dataframe contains the
121+
idle time by category on each stream for each rank.
122+
123+
.. image:: ../_static/img/hta/idle_time.png
124+
:scale: 100%
125+
:align: center
126+
127+
The second dataframe is generated when ``show_idle_interval_stats`` is set to
128+
``True``. It contains the summary statistics of the idle time for each stream
129+
on each rank.
130+
131+
.. image:: ../_static/img/hta/idle_time_summary.png
132+
:scale: 100%
133+
134+
.. tip::
135+
By default, the idle time breakdown presents the percentage of each of the
136+
idle time categories. Setting the ``visualize_pctg`` argument to ``False``,
137+
the function renders with absolute time on the y-axis.
138+
139+
140+
Kernel Breakdown
141+
----------------
142+
143+
The kernel breakdown feature breaks down the time spent for each kernel type
144+
i.e. communication (COMM), computation (COMP), and memory (MEM) across all
145+
ranks and presents the proportion of time spent in each category. The
146+
percentage of time spent in each category as a pie chart.
147+
148+
.. image:: ../_static/img/hta/kernel_type_breakdown.png
149+
:align: center
150+
151+
The kernel breakdown can be calculated as follows:
152+
153+
.. code-block:: python
154+
155+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
156+
kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown()
157+
158+
The first dataframe returned by the function contains the raw values used to
159+
generate the Pie chart.
160+
161+
Kernel Duration Distribution
162+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163+
164+
The second dataframe returned by `get_gpu_kernel_breakdown
165+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown>`_
166+
contains duration summary statistics for each kernel. In particular, this
167+
includes the count, min, max, average, standard deviation, sum and kernel type
168+
for each kernel on each rank.
169+
170+
.. image:: ../_static/img/hta/kernel_metrics_df.png
171+
:align: center
172+
173+
Using this data HTA creates many visualizations to identify performance
174+
bottlenecks.
175+
176+
#. Pie charts of the top kernels for each kernel type for each rank.
177+
178+
#. Bar graphs of the average duration across all ranks for each of the top
179+
kernels and for each kernel type.
180+
181+
.. image:: ../_static/img/hta/pie_charts.png
182+
183+
.. tip::
184+
All images are generated using plotly. Hovering on the graph shows the
185+
mode bar on the top right which allows the user to zoom, pan, select and
186+
download the graph.
187+
188+
The pie charts above shows the top 5 computation, communication and memory
189+
kernels. Similar pie charts are generated for each rank. The pie charts can be
190+
configured to show the top k kernels using the ``num_kernels`` argument passed
191+
to the `get_gpu_kernel_breakdown` function. Additionally, the
192+
``duration_ratio`` argument can be used to tune the percentage of time that
193+
needs to be analyzed. If both ``num_kernels`` and ``duration_ratio`` are
194+
specified, then ``num_kernels`` takes precedence.
195+
196+
.. image:: ../_static/img/hta/comm_across_ranks.png
197+
198+
The bar graph above shows the average duration of the NCCL AllReduce kernel
199+
across all the ranks. The black lines indicate the minimum and maximum time
200+
taken on each rank.
201+
202+
.. warning::
203+
When using jupyter-lab set the "image_renderer" argument value to
204+
"jupyterlab" otherwise the graphs will not render in the notebook.
205+
206+
For a detailed walkthrough of this feature see the `gpu_kernel_breakdown
207+
notebook
208+
<https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/examples/kernel_breakdown_demo.ipynb>`_
209+
in the examples folder of the repo.
210+
211+
212+
Communication Computation Overlap
213+
---------------------------------
214+
215+
In distributed training a significant amount of time is spent in communication
216+
and synchronization events between GPUs. To achieve high GPU efficiency (i.e.
217+
TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation
218+
kernels. In other words, the GPU should not be blocked due to unresolved data
219+
dependencies. One way to measure the extent to which computation is blocked by
220+
data dependencies is to calculate the communication computation overlap. Higher
221+
GPU efficiency is observed if communication events overlap computation events.
222+
Lack of communication and computation overlap will lead to the GPU being idle,
223+
thus the efficiency would be low. To sum up, higher communication computation
224+
overlap is desirable. To calculate the overlap percentage for each rank we
225+
measure the following ratio:
226+
227+
| **(time spent in computation while communicating) / (time spent in communication)**
228+
229+
Communication computation overlap can be calculated as follows:
230+
231+
.. code-block:: python
232+
233+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
234+
overlap_df = analyzer.get_comm_comp_overlap()
235+
236+
The function returns a dataframe containing the overlap percentage
237+
for each rank.
238+
239+
.. image:: ../_static/img/hta/overlap_df.png
240+
:align: center
241+
:scale: 50%
242+
243+
When the ``visualize`` argument is set to True, the `get_comm_comp_overlap
244+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_comm_comp_overlap>`_
245+
function also generates a bar graph representing the overlap by rank.
246+
247+
.. image:: ../_static/img/hta/overlap_plot.png
248+
249+
250+
Augmented Counters
251+
------------------
252+
253+
Memory Bandwidth & Queue Length Counters
254+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
255+
256+
Memory bandwidth counters measure the memory copy bandwidth used while copying
257+
the data from H2D, D2H and D2D by memory copy (memcpy) and memory set (memset)
258+
events. HTA also computes the number of outstanding operations on each CUDA
259+
stream. We refer to this as **queue length**. When the queue length on a stream
260+
is 1024 or larger new events cannot be scheduled on that stream and the CPU
261+
will stall until the events on the GPU stream have processed.
262+
263+
The `generate_trace_with_counters
264+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.generate_trace_with_counters>`_
265+
API outputs a new trace file with the memory bandwidth and queue length
266+
counters. The new trace file contains tracks which indicate the memory
267+
bandwidth used by memcpy/memset operations and tracks for the queue length on
268+
each stream. By default, these counters are generated using the rank 0
269+
trace file and the new file contains the suffix ``_with_counters`` in its name.
270+
Users have the option to generate the counters for multiple ranks by using the
271+
``ranks`` argument in the `generate_trace_with_counters` API.
272+
273+
.. code-block:: python
274+
275+
analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder")
276+
analyzer.generate_trace_with_counters()
277+
278+
A screenshot of the generated trace file with augmented counters.
279+
280+
.. image:: ../_static/img/hta/mem_bandwidth_queue_length.png
281+
:scale: 100%
282+
283+
HTA also provides a summary of the memory copy bandwidth and queue length
284+
counters as well as the time series of the counters for the profiled portion of
285+
the code using the following API:
286+
287+
#. `get_memory_bw_summary
288+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
289+
290+
#. `get_queue_length_summary
291+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
292+
293+
#. `get_memory_bw_time_series
294+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
295+
296+
#. `get_queue_length_time_series
297+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
298+
299+
To view the summary and time series use:
300+
301+
.. code-block:: python
302+
303+
# generate summary
304+
mem_bw_summary = analyzer.get_memory_bw_summary()
305+
queue_len_summary = analyzer.get_queue_length_summary()
306+
307+
# get time series
308+
mem_bw_series = analyzer.get_memory_bw_time_series()
309+
queue_len_series = analyzer.get_queue_length_series()
310+
311+
The summary contains the count, min, max, mean, standard deviation, 25th, 50th,
312+
and 75th percentile.
313+
314+
.. image:: ../_static/img/hta/queue_length_summary.png
315+
:scale: 100%
316+
:align: center
317+
318+
The time series only contains the points when a value changes. Once a value is
319+
observed the time series stays constant until the next update. The memory
320+
bandwidth and queue length time series functions return a dictionary whose key
321+
is the rank and the value is the time series for that rank. By default, the
322+
time series is computed for rank 0 only.
323+
324+
325+
CUDA Kernel Launch Statistics
326+
-----------------------------
327+
328+
.. image:: ../_static/img/hta/cuda_kernel_launch.png
329+
330+
For each event launched on the GPU there is a corresponding scheduling event on
331+
the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events
332+
are linked by a common correlation id in the trace. See figure above. This
333+
feature computes the duration of the CPU runtime event, its corresponding GPU
334+
kernel and the launch delay i.e. the difference between GPU kernel starting and
335+
CPU operator ending. The kernel launch info can be generated as follows:
336+
337+
.. code-block:: python
338+
339+
analyzer = TraceAnalysis(trace_dir="/path/to/trace/dir")
340+
kernel_info_df = analyzer.get_cuda_kernel_launch_stats()
341+
342+
A screenshot of the generated dataframe is given below.
343+
344+
.. image:: ../_static/img/hta/cuda_kernel_launch_stats.png
345+
:scale: 100%
346+
:align: center
347+
348+
The duration of the CPU op, GPU kernel and the launch delay allows us to find:
349+
350+
#. **Short GPU kernels** - GPU kernels with duration less than the
351+
corresponding CPU runtime event.
352+
353+
#. **Runtime event outliers** - CPU runtime events with excessive duration.
354+
355+
#. **Launch delay outliers** - GPU kernels which take too long to be scheduled.
356+
357+
HTA generates distribution plots for each of the aforementioned three categories.
358+
359+
360+
**Short GPU kernels**
361+
362+
Usually, the launch time on the CPU side is between 5-20 microseconds. In some
363+
cases the GPU execution time is lower than the launch time itself. The graph
364+
below allows us to find how frequently such instances appear in the code.
365+
366+
.. image:: ../_static/img/hta/short_gpu_kernels.png
367+
368+
369+
**Runtime event outliers**
370+
371+
The runtime outliers depend on the cutoff used to classify the outliers, hence
372+
the `get_cuda_kernel_launch_stats
373+
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_cuda_kernel_launch_stats>`_
374+
API provides the ``runtime_cutoff`` argument to configure the value.
375+
376+
.. image:: ../_static/img/hta/runtime_outliers.png
377+
378+
**Launch delay outliers**
379+
380+
The launch delay outliers depend on the cutoff used to classify the outliers,
381+
hence the `get_cuda_kernel_launch_stats` API provides the
382+
``launch_delay_cutoff`` argument to configure the value.
383+
384+
.. image:: ../_static/img/hta/launch_delay_outliers.png

0 commit comments

Comments
 (0)