diff --git a/beginner_source/hta_intro_tutorial.rst b/beginner_source/hta_intro_tutorial.rst index 6d9dd9bfbd8..dc7c8cedf9e 100644 --- a/beginner_source/hta_intro_tutorial.rst +++ b/beginner_source/hta_intro_tutorial.rst @@ -1,76 +1,73 @@ Introduction to Holistic Trace Analysis ======================================= -**Author:** `Anupam Bhatnagar `_ -Setup ------ +**Author:** `Anupam Bhatnagar `_ -In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to +In this tutorial, we demonstrate how to use Holistic Trace Analysis (HTA) to analyze traces from a distributed training job. To get started follow the steps -below: +below. Installing HTA -^^^^^^^^^^^^^^ +~~~~~~~~~~~~~~ We recommend using a Conda environment to install HTA. To install Anaconda, see -`here `_. +`the official Anaconda documentation `_. -1) Install HTA using pip +1. Install HTA using pip: -.. code-block:: python + .. code-block:: python - pip install HolisticTraceAnalysis + pip install HolisticTraceAnalysis -2) [Optional and recommended] Setup a conda environment +2. (Optional and recommended) Set up a Conda environment: -.. code-block:: python + .. code-block:: python - # create the environment env_name - conda create -n env_name + # create the environment env_name + conda create -n env_name - # activate the environment - conda activate env_name + # activate the environment + conda activate env_name - # deactivate the environment - conda deactivate + # When you are done, deactivate the environment by running ``conda deactivate`` -Getting started -^^^^^^^^^^^^^^^ +Getting Started +~~~~~~~~~~~~~~~ -Launch a jupyter notebook and set the ``trace_dir`` variable to the location of the traces. +Launch a Jupyter notebook and set the ``trace_dir`` variable to the location of the traces. .. code-block:: python - from hta.trace_analysis import TraceAnalysis - trace_dir = "/path/to/folder/with/traces" - analyzer = TraceAnalysis(trace_dir=trace_dir) + from hta.trace_analysis import TraceAnalysis + trace_dir = "/path/to/folder/with/traces" + analyzer = TraceAnalysis(trace_dir=trace_dir) Temporal Breakdown ------------------ -To best utilize the GPUs it is vital to understand where the GPU is spending -time for a given job. Is the GPU spending time on computation, communication, -memory events, or is it idle? The temporal breakdown feature breaks down the -time spent in three categories +To effectively utilize the GPUs, it is crucial to understand how they are spending +time for a specific job. Are they primarily engaged in computation, communication, +memory events, or are they idle? The temporal breakdown feature provides a detailed +analysis of the time spent in these three categories. -1) Idle time - GPU is idle. -2) Compute time - GPU is being used for matrix multiplications or vector operations. -3) Non-compute time - GPU is being used for communication or memory events. +* Idle time - GPU is idle. +* Compute time - GPU is being used for matrix multiplications or vector operations. +* Non-compute time - GPU is being used for communication or memory events. -To achieve high training efficiency the code should maximize compute time and -minimize idle time and non-compute time. The function below returns -a dataframe containing the temporal breakdown for each rank. +To achieve high training efficiency, the code should maximize compute time and +minimize idle time and non-compute time. The following function generates a +dataframe that provides a detailed breakdown of the temporal usage for each rank. .. code-block:: python - analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") - time_spent_df = analyzer.get_temporal_breakdown() + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") + time_spent_df = analyzer.get_temporal_breakdown() .. image:: ../_static/img/hta/temporal_breakdown_df.png -When the ``visualize`` argument is set to True in the `get_temporal_breakdown +When the ``visualize`` argument is set to ``True`` in the `get_temporal_breakdown `_ function it also generates a bar graph representing the breakdown by rank. @@ -79,23 +76,26 @@ function it also generates a bar graph representing the breakdown by rank. Idle Time Breakdown ------------------- -Understanding how much time the GPU is idle and its causes can help direct -optimization strategies. A GPU is considered idle when no kernel is running on -it. We developed an algorithm to categorize the Idle time into 3 categories: -#. Host wait: is the idle duration on the GPU due to the CPU not enqueuing - kernels fast enough to keep the GPU busy. These kinds of inefficiencies can - be resolved by examining the CPU operators that are contributing to the slow - down, increasing the batch size and applying operator fusion. +Gaining insight into the amount of time the GPU spends idle and the +reasons behind it can help guide optimization strategies. A GPU is +considered idle when no kernel is running on it. We have developed an +algorithm to categorize the `Idle` time into three distinct categories: + +* **Host wait:** refers to the idle time on the GPU that is caused by + the CPU not enqueuing kernels quickly enough to keep the GPU fully utilized. + These types of inefficiencies can be addressed by examining the CPU + operators that are contributing to the slowdown, increasing the batch + size and applying operator fusion. -#. Kernel wait: constitutes the short overhead to launch consecutive kernels on - the GPU. The idle time attributed to this category can be minimized by using - CUDA Graph optimizations. +* **Kernel wait:** This refers to brief overhead associated with launching + consecutive kernels on the GPU. The idle time attributed to this category + can be minimized by using CUDA Graph optimizations. -#. Other wait: Lastly, this category includes idle we could not currently - attribute due to insufficient information. The likely causes include - synchronization among CUDA streams using CUDA events and delays in launching - kernels. +* **Other wait:** This category includes idle time that cannot currently + be attributed due to insufficient information. The likely causes include + synchronization among CUDA streams using CUDA events and delays in launching + kernels. The host wait time can be interpreted as the time when the GPU is stalling due to the CPU. To attribute the idle time as kernel wait we use the following @@ -132,6 +132,7 @@ on each rank. :scale: 100% .. tip:: + By default, the idle time breakdown presents the percentage of each of the idle time categories. Setting the ``visualize_pctg`` argument to ``False``, the function renders with absolute time on the y-axis. @@ -140,10 +141,10 @@ on each rank. Kernel Breakdown ---------------- -The kernel breakdown feature breaks down the time spent for each kernel type -i.e. communication (COMM), computation (COMP), and memory (MEM) across all -ranks and presents the proportion of time spent in each category. The -percentage of time spent in each category as a pie chart. +The kernel breakdown feature breaks down the time spent for each kernel type, +such as communication (COMM), computation (COMP), and memory (MEM), across all +ranks and presents the proportion of time spent in each category. Here is the +percentage of time spent in each category as a pie chart: .. image:: ../_static/img/hta/kernel_type_breakdown.png :align: center @@ -156,7 +157,7 @@ The kernel breakdown can be calculated as follows: kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown() The first dataframe returned by the function contains the raw values used to -generate the Pie chart. +generate the pie chart. Kernel Duration Distribution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -164,7 +165,7 @@ Kernel Duration Distribution The second dataframe returned by `get_gpu_kernel_breakdown `_ contains duration summary statistics for each kernel. In particular, this -includes the count, min, max, average, standard deviation, sum and kernel type +includes the count, min, max, average, standard deviation, sum, and kernel type for each kernel on each rank. .. image:: ../_static/img/hta/kernel_metrics_df.png @@ -181,11 +182,12 @@ bottlenecks. .. image:: ../_static/img/hta/pie_charts.png .. tip:: + All images are generated using plotly. Hovering on the graph shows the - mode bar on the top right which allows the user to zoom, pan, select and + mode bar on the top right which allows the user to zoom, pan, select, and download the graph. -The pie charts above shows the top 5 computation, communication and memory +The pie charts above show the top 5 computation, communication, and memory kernels. Similar pie charts are generated for each rank. The pie charts can be configured to show the top k kernels using the ``num_kernels`` argument passed to the `get_gpu_kernel_breakdown` function. Additionally, the @@ -212,21 +214,21 @@ in the examples folder of the repo. Communication Computation Overlap --------------------------------- -In distributed training a significant amount of time is spent in communication -and synchronization events between GPUs. To achieve high GPU efficiency (i.e. -TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation +In distributed training, a significant amount of time is spent in communication +and synchronization events between GPUs. To achieve high GPU efficiency (such as +TFLOPS/GPU), it is crucial to keep the GPU oversubscribed with computation kernels. In other words, the GPU should not be blocked due to unresolved data dependencies. One way to measure the extent to which computation is blocked by data dependencies is to calculate the communication computation overlap. Higher GPU efficiency is observed if communication events overlap computation events. Lack of communication and computation overlap will lead to the GPU being idle, -thus the efficiency would be low. To sum up, higher communication computation -overlap is desirable. To calculate the overlap percentage for each rank we -measure the following ratio: +resulting in low efficiency. +To sum up, a higher communication computation overlap is desirable. To calculate +the overlap percentage for each rank, we measure the following ratio: | **(time spent in computation while communicating) / (time spent in communication)** -Communication computation overlap can be calculated as follows: +The communication computation overlap can be calculated as follows: .. code-block:: python @@ -266,9 +268,9 @@ API outputs a new trace file with the memory bandwidth and queue length counters. The new trace file contains tracks which indicate the memory bandwidth used by memcpy/memset operations and tracks for the queue length on each stream. By default, these counters are generated using the rank 0 -trace file and the new file contains the suffix ``_with_counters`` in its name. +trace file, and the new file contains the suffix ``_with_counters`` in its name. Users have the option to generate the counters for multiple ranks by using the -``ranks`` argument in the `generate_trace_with_counters` API. +``ranks`` argument in the ``generate_trace_with_counters`` API. .. code-block:: python @@ -284,19 +286,15 @@ HTA also provides a summary of the memory copy bandwidth and queue length counters as well as the time series of the counters for the profiled portion of the code using the following API: -#. `get_memory_bw_summary - `_ +* `get_memory_bw_summary `_ -#. `get_queue_length_summary - `_ +* `get_queue_length_summary `_ -#. `get_memory_bw_time_series - `_ +* `get_memory_bw_time_series `_ -#. `get_queue_length_time_series - `_ +* `get_queue_length_time_series `_ -To view the summary and time series use: +To view the summary and time series, use: .. code-block:: python @@ -321,17 +319,16 @@ bandwidth and queue length time series functions return a dictionary whose key is the rank and the value is the time series for that rank. By default, the time series is computed for rank 0 only. - CUDA Kernel Launch Statistics ----------------------------- .. image:: ../_static/img/hta/cuda_kernel_launch.png -For each event launched on the GPU there is a corresponding scheduling event on -the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events -are linked by a common correlation id in the trace. See figure above. This -feature computes the duration of the CPU runtime event, its corresponding GPU -kernel and the launch delay i.e. the difference between GPU kernel starting and +For each event launched on the GPU, there is a corresponding scheduling event on +the CPU, such as ``CudaLaunchKernel``, ``CudaMemcpyAsync``, ``CudaMemsetAsync``. +These events are linked by a common correlation ID in the trace - see the figure +above. This feature computes the duration of the CPU runtime event, its corresponding GPU +kernel and the launch delay, for example, the difference between GPU kernel starting and CPU operator ending. The kernel launch info can be generated as follows: .. code-block:: python @@ -345,23 +342,23 @@ A screenshot of the generated dataframe is given below. :scale: 100% :align: center -The duration of the CPU op, GPU kernel and the launch delay allows us to find: +The duration of the CPU op, GPU kernel, and the launch delay allow us to find +the following: -#. **Short GPU kernels** - GPU kernels with duration less than the - corresponding CPU runtime event. +* **Short GPU kernels** - GPU kernels with duration less than the corresponding + CPU runtime event. -#. **Runtime event outliers** - CPU runtime events with excessive duration. +* **Runtime event outliers** - CPU runtime events with excessive duration. -#. **Launch delay outliers** - GPU kernels which take too long to be scheduled. +* **Launch delay outliers** - GPU kernels which take too long to be scheduled. HTA generates distribution plots for each of the aforementioned three categories. - **Short GPU kernels** -Usually, the launch time on the CPU side is between 5-20 microseconds. In some -cases the GPU execution time is lower than the launch time itself. The graph -below allows us to find how frequently such instances appear in the code. +Typically, the launch time on the CPU side ranges from 5-20 microseconds. In some +cases, the GPU execution time is lower than the launch time itself. The graph +below helps us to find how frequently such instances occur in the code. .. image:: ../_static/img/hta/short_gpu_kernels.png @@ -382,3 +379,12 @@ hence the `get_cuda_kernel_launch_stats` API provides the ``launch_delay_cutoff`` argument to configure the value. .. image:: ../_static/img/hta/launch_delay_outliers.png + + +Conclusion +~~~~~~~~~~ + +In this tutorial, you have learned how to install and use HTA, +a performance tool that enables you analyze bottlenecks in your distributed +training workflows. To learn how you can use the HTA tool to perform trace +diff analysis, see `Trace Diff using Holistic Trace Analysis `__. diff --git a/beginner_source/hta_trace_diff_tutorial.rst b/beginner_source/hta_trace_diff_tutorial.rst index 77ac398d625..608d29ea358 100644 --- a/beginner_source/hta_trace_diff_tutorial.rst +++ b/beginner_source/hta_trace_diff_tutorial.rst @@ -1,26 +1,27 @@ Trace Diff using Holistic Trace Analysis ======================================== -**Author:** `Anupam Bhatnagar `_ +**Author:** `Anupam Bhatnagar `_ Occasionally, users need to identify the changes in PyTorch operators and CUDA -kernels resulting from a code change. To support such a requirement, HTA +kernels resulting from a code change. To support this requirement, HTA provides a trace comparison feature. This feature allows the user to input two sets of trace files where the first can be thought of as the *control group* -and the second as the *test group* as in an A/B test. The ``TraceDiff`` class +and the second as the *test group*, similar to an A/B test. The ``TraceDiff`` class provides functions to compare the differences between traces and functionality to visualize these differences. In particular, users can find operators and -kernels which were added and removed from each group along with the frequency +kernels that were added and removed from each group, along with the frequency of each operator/kernel and the cumulative time taken by the operator/kernel. -The `TraceDiff `_ class has 4 methods: -#. `compare_traces - `_ - - Compare the frequency and total duration of CPU operators and GPU kernels from - two sets of traces. +The `TraceDiff `_ class +has the following methods: + +* `compare_traces `_: + Compare the frequency and total duration of CPU operators and GPU kernels from + two sets of traces. -#. `ops_diff `_ - - Get the operators and kernels which have been: +* `ops_diff `_: + Get the operators and kernels which have been: #. **added** to the test trace and are absent in the control trace #. **deleted** from the test trace and are present in the control trace @@ -28,17 +29,15 @@ The `TraceDiff `_ +* `visualize_counts_diff `_ -#. `visualize_duration_diff - `_ +* `visualize_duration_diff `_ The last two methods can be used to visualize various changes in frequency and duration of CPU operators and GPU kernels, using the output of the ``compare_traces`` method. -E.g. The top 10 operators with increase in frequency can be computed as +For example, the top ten operators with increase in frequency can be computed as follows: .. code-block:: python @@ -48,7 +47,7 @@ follows: .. image:: ../_static/img/hta/counts_diff.png -Similarly, the top 10 ops with the largest change in duration can be computed as +Similarly, the top ten operators with the largest change in duration can be computed as follows: .. code-block:: python