|
| 1 | +Introduction to Holistic Trace Analysis |
| 2 | +======================================= |
| 3 | +**Author:** `Anupam Bhatnagar <https://github.com/anupambhatnagar>`_ |
| 4 | + |
| 5 | +Setup |
| 6 | +----- |
| 7 | + |
| 8 | +In this tutorial we demonstrate how to use Holistic Trace Analysis (HTA) to |
| 9 | +analyze traces from a distributed training job. To get started follow the steps |
| 10 | +below: |
| 11 | + |
| 12 | +Installing HTA |
| 13 | +^^^^^^^^^^^^^^ |
| 14 | + |
| 15 | +We recommend using a Conda environment to install HTA. To install Anaconda, see |
| 16 | +`here <https://docs.anaconda.com/anaconda/install/index.html>`_. |
| 17 | + |
| 18 | +1) Install HTA using pip |
| 19 | + |
| 20 | +.. code-block:: python |
| 21 | +
|
| 22 | + pip install HolisticTraceAnalysis |
| 23 | +
|
| 24 | +2) [Optional and recommended] Setup a conda environment |
| 25 | + |
| 26 | +.. code-block:: python |
| 27 | +
|
| 28 | + # create the environment env_name |
| 29 | + conda create -n env_name |
| 30 | +
|
| 31 | + # activate the environment |
| 32 | + conda activate env_name |
| 33 | +
|
| 34 | + # deactivate the environment |
| 35 | + conda deactivate |
| 36 | +
|
| 37 | +Getting started |
| 38 | +^^^^^^^^^^^^^^^ |
| 39 | + |
| 40 | +Launch a jupyter notebook and set the ``trace_dir`` variable to the location of the traces. |
| 41 | + |
| 42 | +.. code-block:: python |
| 43 | +
|
| 44 | + from hta.trace_analysis import TraceAnalysis |
| 45 | + trace_dir = "/path/to/folder/with/traces" |
| 46 | + analyzer = TraceAnalysis(trace_dir=trace_dir) |
| 47 | +
|
| 48 | +
|
| 49 | +Temporal Breakdown |
| 50 | +------------------ |
| 51 | + |
| 52 | +To best utilize the GPUs it is vital to understand where the GPU is spending |
| 53 | +time for a given job. Is the GPU spending time on computation, communication, |
| 54 | +memory events, or is it idle? The temporal breakdown feature breaks down the |
| 55 | +time spent in three categories |
| 56 | + |
| 57 | +1) Idle time - GPU is idle. |
| 58 | +2) Compute time - GPU is being used for matrix multiplications or vector operations. |
| 59 | +3) Non-compute time - GPU is being used for communication or memory events. |
| 60 | + |
| 61 | +To achieve high training efficiency the code should maximize compute time and |
| 62 | +minimize idle time and non-compute time. The function below returns |
| 63 | +a dataframe containing the temporal breakdown for each rank. |
| 64 | + |
| 65 | +.. code-block:: python |
| 66 | +
|
| 67 | + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") |
| 68 | + time_spent_df = analyzer.get_temporal_breakdown() |
| 69 | +
|
| 70 | +
|
| 71 | +.. image:: ../_static/img/hta/temporal_breakdown_df.png |
| 72 | + |
| 73 | +When the ``visualize`` argument is set to True in the `get_temporal_breakdown |
| 74 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_temporal_breakdown>`_ |
| 75 | +function it also generates a bar graph representing the breakdown by rank. |
| 76 | + |
| 77 | +.. image:: ../_static/img/hta/temporal_breakdown_plot.png |
| 78 | + |
| 79 | + |
| 80 | +Idle Time Breakdown |
| 81 | +------------------- |
| 82 | +Understanding how much time the GPU is idle and its causes can help direct |
| 83 | +optimization strategies. A GPU is considered idle when no kernel is running on |
| 84 | +it. We developed an algorithm to categorize the Idle time into 3 categories: |
| 85 | + |
| 86 | +#. Host wait: is the idle duration on the GPU due to the CPU not enqueuing |
| 87 | + kernels fast enough to keep the GPU busy. These kinds of inefficiencies can |
| 88 | + be resolved by examining the CPU operators that are contributing to the slow |
| 89 | + down, increasing the batch size and applying operator fusion. |
| 90 | + |
| 91 | +#. Kernel wait: constitutes the short overhead to launch consecutive kernels on |
| 92 | + the GPU. The idle time attributed to this category can be minimized by using |
| 93 | + CUDA Graph optimizations. |
| 94 | + |
| 95 | +#. Other wait: Lastly, this category includes idle we could not currently |
| 96 | + attribute due to insufficient information. The likely causes include |
| 97 | + synchronization among CUDA streams using CUDA events and delays in launching |
| 98 | + kernels. |
| 99 | + |
| 100 | +The host wait time can be interpreted as the time when the GPU is stalling due |
| 101 | +to the CPU. To attribute the idle time as kernel wait we use the following |
| 102 | +heuristic: |
| 103 | + |
| 104 | + | **gap between consecutive kernels < threshold** |
| 105 | +
|
| 106 | +The default threshold value is 30 nanoseconds and can be configured using the |
| 107 | +``consecutive_kernel_delay`` argument. By default, the idle time breakdown is |
| 108 | +computed for rank 0 only. In order to calculate the breakdown for other ranks, |
| 109 | +use the ``ranks`` argument in the `get_idle_time_breakdown |
| 110 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_idle_time_breakdown>`_ |
| 111 | +function. The idle time breakdown can be generated as follows: |
| 112 | + |
| 113 | +.. code-block:: python |
| 114 | +
|
| 115 | + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") |
| 116 | + idle_time_df = analyzer.get_idle_time_breakdown() |
| 117 | +
|
| 118 | +.. image:: ../_static/img/hta/idle_time_breakdown_percentage.png |
| 119 | + |
| 120 | +The function returns a tuple of dataframes. The first dataframe contains the |
| 121 | +idle time by category on each stream for each rank. |
| 122 | + |
| 123 | +.. image:: ../_static/img/hta/idle_time.png |
| 124 | + :scale: 100% |
| 125 | + :align: center |
| 126 | + |
| 127 | +The second dataframe is generated when ``show_idle_interval_stats`` is set to |
| 128 | +``True``. It contains the summary statistics of the idle time for each stream |
| 129 | +on each rank. |
| 130 | + |
| 131 | +.. image:: ../_static/img/hta/idle_time_summary.png |
| 132 | + :scale: 100% |
| 133 | + |
| 134 | +.. tip:: |
| 135 | + By default, the idle time breakdown presents the percentage of each of the |
| 136 | + idle time categories. Setting the ``visualize_pctg`` argument to ``False``, |
| 137 | + the function renders with absolute time on the y-axis. |
| 138 | + |
| 139 | + |
| 140 | +Kernel Breakdown |
| 141 | +---------------- |
| 142 | + |
| 143 | +The kernel breakdown feature breaks down the time spent for each kernel type |
| 144 | +i.e. communication (COMM), computation (COMP), and memory (MEM) across all |
| 145 | +ranks and presents the proportion of time spent in each category. The |
| 146 | +percentage of time spent in each category as a pie chart. |
| 147 | + |
| 148 | +.. image:: ../_static/img/hta/kernel_type_breakdown.png |
| 149 | + :align: center |
| 150 | + |
| 151 | +The kernel breakdown can be calculated as follows: |
| 152 | + |
| 153 | +.. code-block:: python |
| 154 | +
|
| 155 | + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") |
| 156 | + kernel_type_metrics_df, kernel_metrics_df = analyzer.get_gpu_kernel_breakdown() |
| 157 | +
|
| 158 | +The first dataframe returned by the function contains the raw values used to |
| 159 | +generate the Pie chart. |
| 160 | + |
| 161 | +Kernel Duration Distribution |
| 162 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 163 | + |
| 164 | +The second dataframe returned by `get_gpu_kernel_breakdown |
| 165 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_gpu_kernel_breakdown>`_ |
| 166 | +contains duration summary statistics for each kernel. In particular, this |
| 167 | +includes the count, min, max, average, standard deviation, sum and kernel type |
| 168 | +for each kernel on each rank. |
| 169 | + |
| 170 | +.. image:: ../_static/img/hta/kernel_metrics_df.png |
| 171 | + :align: center |
| 172 | + |
| 173 | +Using this data HTA creates many visualizations to identify performance |
| 174 | +bottlenecks. |
| 175 | + |
| 176 | +#. Pie charts of the top kernels for each kernel type for each rank. |
| 177 | + |
| 178 | +#. Bar graphs of the average duration across all ranks for each of the top |
| 179 | + kernels and for each kernel type. |
| 180 | + |
| 181 | +.. image:: ../_static/img/hta/pie_charts.png |
| 182 | + |
| 183 | +.. tip:: |
| 184 | + All images are generated using plotly. Hovering on the graph shows the |
| 185 | + mode bar on the top right which allows the user to zoom, pan, select and |
| 186 | + download the graph. |
| 187 | + |
| 188 | +The pie charts above shows the top 5 computation, communication and memory |
| 189 | +kernels. Similar pie charts are generated for each rank. The pie charts can be |
| 190 | +configured to show the top k kernels using the ``num_kernels`` argument passed |
| 191 | +to the `get_gpu_kernel_breakdown` function. Additionally, the |
| 192 | +``duration_ratio`` argument can be used to tune the percentage of time that |
| 193 | +needs to be analyzed. If both ``num_kernels`` and ``duration_ratio`` are |
| 194 | +specified, then ``num_kernels`` takes precedence. |
| 195 | + |
| 196 | +.. image:: ../_static/img/hta/comm_across_ranks.png |
| 197 | + |
| 198 | +The bar graph above shows the average duration of the NCCL AllReduce kernel |
| 199 | +across all the ranks. The black lines indicate the minimum and maximum time |
| 200 | +taken on each rank. |
| 201 | + |
| 202 | +.. warning:: |
| 203 | + When using jupyter-lab set the "image_renderer" argument value to |
| 204 | + "jupyterlab" otherwise the graphs will not render in the notebook. |
| 205 | + |
| 206 | +For a detailed walkthrough of this feature see the `gpu_kernel_breakdown |
| 207 | +notebook |
| 208 | +<https://github.com/facebookresearch/HolisticTraceAnalysis/blob/main/examples/kernel_breakdown_demo.ipynb>`_ |
| 209 | +in the examples folder of the repo. |
| 210 | + |
| 211 | + |
| 212 | +Communication Computation Overlap |
| 213 | +--------------------------------- |
| 214 | + |
| 215 | +In distributed training a significant amount of time is spent in communication |
| 216 | +and synchronization events between GPUs. To achieve high GPU efficiency (i.e. |
| 217 | +TFLOPS/GPU) it is vital to keep the GPU oversubscribed with computation |
| 218 | +kernels. In other words, the GPU should not be blocked due to unresolved data |
| 219 | +dependencies. One way to measure the extent to which computation is blocked by |
| 220 | +data dependencies is to calculate the communication computation overlap. Higher |
| 221 | +GPU efficiency is observed if communication events overlap computation events. |
| 222 | +Lack of communication and computation overlap will lead to the GPU being idle, |
| 223 | +thus the efficiency would be low. To sum up, higher communication computation |
| 224 | +overlap is desirable. To calculate the overlap percentage for each rank we |
| 225 | +measure the following ratio: |
| 226 | + |
| 227 | + | **(time spent in computation while communicating) / (time spent in communication)** |
| 228 | +
|
| 229 | +Communication computation overlap can be calculated as follows: |
| 230 | + |
| 231 | +.. code-block:: python |
| 232 | +
|
| 233 | + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") |
| 234 | + overlap_df = analyzer.get_comm_comp_overlap() |
| 235 | +
|
| 236 | +The function returns a dataframe containing the overlap percentage |
| 237 | +for each rank. |
| 238 | + |
| 239 | +.. image:: ../_static/img/hta/overlap_df.png |
| 240 | + :align: center |
| 241 | + :scale: 50% |
| 242 | + |
| 243 | +When the ``visualize`` argument is set to True, the `get_comm_comp_overlap |
| 244 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_comm_comp_overlap>`_ |
| 245 | +function also generates a bar graph representing the overlap by rank. |
| 246 | + |
| 247 | +.. image:: ../_static/img/hta/overlap_plot.png |
| 248 | + |
| 249 | + |
| 250 | +Augmented Counters |
| 251 | +------------------ |
| 252 | + |
| 253 | +Memory Bandwidth & Queue Length Counters |
| 254 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 255 | + |
| 256 | +Memory bandwidth counters measure the memory copy bandwidth used while copying |
| 257 | +the data from H2D, D2H and D2D by memory copy (memcpy) and memory set (memset) |
| 258 | +events. HTA also computes the number of outstanding operations on each CUDA |
| 259 | +stream. We refer to this as **queue length**. When the queue length on a stream |
| 260 | +is 1024 or larger new events cannot be scheduled on that stream and the CPU |
| 261 | +will stall until the events on the GPU stream have processed. |
| 262 | + |
| 263 | +The `generate_trace_with_counters |
| 264 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.generate_trace_with_counters>`_ |
| 265 | +API outputs a new trace file with the memory bandwidth and queue length |
| 266 | +counters. The new trace file contains tracks which indicate the memory |
| 267 | +bandwidth used by memcpy/memset operations and tracks for the queue length on |
| 268 | +each stream. By default, these counters are generated using the rank 0 |
| 269 | +trace file and the new file contains the suffix ``_with_counters`` in its name. |
| 270 | +Users have the option to generate the counters for multiple ranks by using the |
| 271 | +``ranks`` argument in the `generate_trace_with_counters` API. |
| 272 | + |
| 273 | +.. code-block:: python |
| 274 | +
|
| 275 | + analyzer = TraceAnalysis(trace_dir = "/path/to/trace/folder") |
| 276 | + analyzer.generate_trace_with_counters() |
| 277 | +
|
| 278 | +A screenshot of the generated trace file with augmented counters. |
| 279 | + |
| 280 | +.. image:: ../_static/img/hta/mem_bandwidth_queue_length.png |
| 281 | + :scale: 100% |
| 282 | + |
| 283 | +HTA also provides a summary of the memory copy bandwidth and queue length |
| 284 | +counters as well as the time series of the counters for the profiled portion of |
| 285 | +the code using the following API: |
| 286 | + |
| 287 | +#. `get_memory_bw_summary |
| 288 | + <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_summary>`_ |
| 289 | + |
| 290 | +#. `get_queue_length_summary |
| 291 | + <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_summary>`_ |
| 292 | + |
| 293 | +#. `get_memory_bw_time_series |
| 294 | + <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_memory_bw_time_series>`_ |
| 295 | + |
| 296 | +#. `get_queue_length_time_series |
| 297 | + <https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_queue_length_time_series>`_ |
| 298 | + |
| 299 | +To view the summary and time series use: |
| 300 | + |
| 301 | +.. code-block:: python |
| 302 | +
|
| 303 | + # generate summary |
| 304 | + mem_bw_summary = analyzer.get_memory_bw_summary() |
| 305 | + queue_len_summary = analyzer.get_queue_length_summary() |
| 306 | +
|
| 307 | + # get time series |
| 308 | + mem_bw_series = analyzer.get_memory_bw_time_series() |
| 309 | + queue_len_series = analyzer.get_queue_length_series() |
| 310 | +
|
| 311 | +The summary contains the count, min, max, mean, standard deviation, 25th, 50th, |
| 312 | +and 75th percentile. |
| 313 | + |
| 314 | +.. image:: ../_static/img/hta/queue_length_summary.png |
| 315 | + :scale: 100% |
| 316 | + :align: center |
| 317 | + |
| 318 | +The time series only contains the points when a value changes. Once a value is |
| 319 | +observed the time series stays constant until the next update. The memory |
| 320 | +bandwidth and queue length time series functions return a dictionary whose key |
| 321 | +is the rank and the value is the time series for that rank. By default, the |
| 322 | +time series is computed for rank 0 only. |
| 323 | + |
| 324 | + |
| 325 | +CUDA Kernel Launch Statistics |
| 326 | +----------------------------- |
| 327 | + |
| 328 | +.. image:: ../_static/img/hta/cuda_kernel_launch.png |
| 329 | + |
| 330 | +For each event launched on the GPU there is a corresponding scheduling event on |
| 331 | +the CPU e.g. CudaLaunchKernel, CudaMemcpyAsync, CudaMemsetAsync. These events |
| 332 | +are linked by a common correlation id in the trace. See figure above. This |
| 333 | +feature computes the duration of the CPU runtime event, its corresponding GPU |
| 334 | +kernel and the launch delay i.e. the difference between GPU kernel starting and |
| 335 | +CPU operator ending. The kernel launch info can be generated as follows: |
| 336 | + |
| 337 | +.. code-block:: python |
| 338 | +
|
| 339 | + analyzer = TraceAnalysis(trace_dir="/path/to/trace/dir") |
| 340 | + kernel_info_df = analyzer.get_cuda_kernel_launch_stats() |
| 341 | +
|
| 342 | +A screenshot of the generated dataframe is given below. |
| 343 | + |
| 344 | +.. image:: ../_static/img/hta/cuda_kernel_launch_stats.png |
| 345 | + :scale: 100% |
| 346 | + :align: center |
| 347 | + |
| 348 | +The duration of the CPU op, GPU kernel and the launch delay allows us to find: |
| 349 | + |
| 350 | +#. **Short GPU kernels** - GPU kernels with duration less than the |
| 351 | + corresponding CPU runtime event. |
| 352 | + |
| 353 | +#. **Runtime event outliers** - CPU runtime events with excessive duration. |
| 354 | + |
| 355 | +#. **Launch delay outliers** - GPU kernels which take too long to be scheduled. |
| 356 | + |
| 357 | +HTA generates distribution plots for each of the aforementioned three categories. |
| 358 | + |
| 359 | + |
| 360 | +**Short GPU kernels** |
| 361 | + |
| 362 | +Usually, the launch time on the CPU side is between 5-20 microseconds. In some |
| 363 | +cases the GPU execution time is lower than the launch time itself. The graph |
| 364 | +below allows us to find how frequently such instances appear in the code. |
| 365 | + |
| 366 | +.. image:: ../_static/img/hta/short_gpu_kernels.png |
| 367 | + |
| 368 | + |
| 369 | +**Runtime event outliers** |
| 370 | + |
| 371 | +The runtime outliers depend on the cutoff used to classify the outliers, hence |
| 372 | +the `get_cuda_kernel_launch_stats |
| 373 | +<https://hta.readthedocs.io/en/latest/source/api/trace_analysis_api.html#hta.trace_analysis.TraceAnalysis.get_cuda_kernel_launch_stats>`_ |
| 374 | +API provides the ``runtime_cutoff`` argument to configure the value. |
| 375 | + |
| 376 | +.. image:: ../_static/img/hta/runtime_outliers.png |
| 377 | + |
| 378 | +**Launch delay outliers** |
| 379 | + |
| 380 | +The launch delay outliers depend on the cutoff used to classify the outliers, |
| 381 | +hence the `get_cuda_kernel_launch_stats` API provides the |
| 382 | +``launch_delay_cutoff`` argument to configure the value. |
| 383 | + |
| 384 | +.. image:: ../_static/img/hta/launch_delay_outliers.png |
0 commit comments