Skip to content

Commit ede7870

Browse files
committed
Formatting fixes
1 parent 50b58db commit ede7870

File tree

2 files changed

+28
-35
lines changed

2 files changed

+28
-35
lines changed

beginner_source/hta_intro_tutorial.rst

Lines changed: 21 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -51,9 +51,9 @@ time for a specific job. Are they primarily engaged in computation, communicatio
5151
memory events, or are they idle? The temporal breakdown feature provides a detailed
5252
analysis of the time spent in these three categories.
5353

54-
1) Idle time - GPU is idle.
55-
2) Compute time - GPU is being used for matrix multiplications or vector operations.
56-
3) Non-compute time - GPU is being used for communication or memory events.
54+
* Idle time - GPU is idle.
55+
* Compute time - GPU is being used for matrix multiplications or vector operations.
56+
* Non-compute time - GPU is being used for communication or memory events.
5757

5858
To achieve high training efficiency, the code should maximize compute time and
5959
minimize idle time and non-compute time. The following function generates a
@@ -82,20 +82,20 @@ reasons behind it can help guide optimization strategies. A GPU is
8282
considered idle when no kernel is running on it. We have developed an
8383
algorithm to categorize the `Idle` time into three distinct categories:
8484

85-
#. **Host wait:** refers to the idle time on the GPU that is caused by
86-
the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87-
These types of inefficiencies can be addressed by examining the CPU
88-
operators that are contributing to the slowdown, increasing the batch
89-
size and applying operator fusion.
85+
* **Host wait:** refers to the idle time on the GPU that is caused by
86+
the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized.
87+
These types of inefficiencies can be addressed by examining the CPU
88+
operators that are contributing to the slowdown, increasing the batch
89+
size and applying operator fusion.
9090

91-
#. **Kernel wait:** This refers to brief overhead associated with launching
92-
consecutive kernels on the GPU. The idle time attributed to this category
93-
can be minimized by using CUDA Graph optimizations.
91+
* **Kernel wait:** This refers to brief overhead associated with launching
92+
consecutive kernels on the GPU. The idle time attributed to this category
93+
can be minimized by using CUDA Graph optimizations.
9494

95-
#. **Other wait:** This category includes idle time that cannot currently
96-
be attributed due to insufficient information. The likely causes include
97-
synchronization among CUDA streams using CUDA events and delays in launching
98-
kernels.
95+
* **Other wait:** This category includes idle time that cannot currently
96+
be attributed due to insufficient information. The likely causes include
97+
synchronization among CUDA streams using CUDA events and delays in launching
98+
kernels.
9999

100100
The host wait time can be interpreted as the time when the GPU is stalling due
101101
to the CPU. To attribute the idle time as kernel wait we use the following
@@ -286,17 +286,13 @@ HTA also provides a summary of the memory copy bandwidth and queue length
286286
counters as well as the time series of the counters for the profiled portion of
287287
the code using the following API:
288288

289-
* `get_memory_bw_summary
290-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
289+
* `get_memory_bw_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_
291290

292-
* `get_queue_length_summary
293-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
291+
* `get_queue_length_summary <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_
294292

295-
* `get_memory_bw_time_series
296-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
293+
* `get_memory_bw_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_
297294

298-
* `get_queue_length_time_series
299-
<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
295+
* `get_queue_length_time_series <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_
300296

301297
To view the summary and time series, use:
302298

@@ -349,8 +345,8 @@ A screenshot of the generated dataframe is given below.
349345
The duration of the CPU op, GPU kernel, and the launch delay allow us to find
350346
the following:
351347

352-
* **Short GPU kernels** - GPU kernels with duration less than the
353-
corresponding CPU runtime event.
348+
* **Short GPU kernels** - GPU kernels with duration less than the corresponding
349+
CPU runtime event.
354350

355351
* **Runtime event outliers** - CPU runtime events with excessive duration.
356352

beginner_source/hta_trace_diff_tutorial.rst

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,22 @@ of each operator/kernel and the cumulative time taken by the operator/kernel.
1616
The `TraceDiff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html>`_ class
1717
has the following methods:
1818

19-
* `compare_traces
20-
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.compare_traces>`_ -
21-
Compare the frequency and total duration of CPU operators and GPU kernels from
22-
two sets of traces.
19+
* `compare_traces <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.compare_traces>`_:
20+
Compare the frequency and total duration of CPU operators and GPU kernels from
21+
two sets of traces.
2322

24-
* `ops_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.ops_diff>`_ -
25-
Get the operators and kernels which have been:
23+
* `ops_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.ops_diff>`_:
24+
Get the operators and kernels which have been:
2625

2726
#. **added** to the test trace and are absent in the control trace
2827
#. **deleted** from the test trace and are present in the control trace
2928
#. **increased** in frequency in the test trace and exist in the control trace
3029
#. **decreased** in frequency in the test trace and exist in the control trace
3130
#. **unchanged** between the two sets of traces
3231

33-
* `visualize_counts_diff
34-
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_counts_diff>`_
32+
* `visualize_counts_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_counts_diff>`_
3533

36-
* `visualize_duration_diff
37-
<https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_duration_diff>`_
34+
* `visualize_duration_diff <https://hta.readthedocs.io/en/latest/source/api/trace_diff_api.html#hta.trace_diff.TraceDiff.visualize_duration_diff>`_
3835

3936
The last two methods can be used to visualize various changes in frequency and
4037
duration of CPU operators and GPU kernels, using the output of the

0 commit comments

Comments
 (0)