More cleanup.

c-p-i-o · c-p-i-o · commit 107a9bf423c5 · 2024-09-06T09:24:55.000-07:00
diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
@@ -1,11 +1,11 @@
 (prototype) Flight Recorder for Debugging
 =========================================
-**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`
+**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_
 
 This tutorial introduces a new tool for debugging stuck jobs during distributed training.
 
-1. Background and Motivation
-----------------------------
+Background and Motivation
+--------------------------
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
@@ -17,35 +17,45 @@ A distributed AI training job is considered "stuck" when it stops making meaning
 time.
 
 A job can get stuck for various reasons:
-Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
+    - Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
 issues with the data pipeline or the data source.
-Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
+    - Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
 memory), the job might not be able to proceed.
-Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
+    - Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
 devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
 stuck.
-Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
+    - Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
 get stuck.
-Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
+    - Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
 to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
 occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
 indefinite wait for the job to progress.
 
 Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
 cause the underlying issue. There are 2 core parts to flight recorder.
-a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
+- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
-b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of
-reading flight recorder records and performing an automated analysis on the collected data.
+- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
 
-2. Enabling Flight Recorder
+ Enabling Flight Recorder
+ ------------------------
 There are 2 required environment variables to get the initial version of flight recorder working.
-TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
-TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
+   - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
+     to 2000)
+   - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
 Optional settings:
-TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
- see advanced settings)
-TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
-‘duration’ of each collective. May incur some CPU overhead.
+   - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
+     addr2line - see additinal settings)
+   - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
+     record the ‘duration’ of each collective. May incur some CPU overhead.
+
+Flight Recorder File Formats
+----------------------------
+Flight recorder files are dumped out in `pickle` format.
+
+
 
-3. Flight Recorder File Formats
+Analyzing Flight Recorder Dumps
+-------------------------------
+We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
+data.