From b2c013e2208d7736d8d0c6f99dd667fb426e5044 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@meta.com>
Date: Tue, 20 Aug 2024 11:53:37 -0700
Subject: [PATCH 01/31] [Doc] Flight recorder tutorial

Summary:
Add a tutorial file for flight recorder.

Test Plan:
N/A

Reviewers:

Subscribers:

Tasks:

Tags:
---
 prototype_source/flight_recorder_tutorial.rst | 51 +++++++++++++++++++
 1 file changed, 51 insertions(+)
 create mode 100644 prototype_source/flight_recorder_tutorial.rst

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
new file mode 100644
index 00000000000..92195ebddb0
--- /dev/null
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -0,0 +1,51 @@
+(prototype) Flight Recorder for Debugging
+=========================================
+**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`
+
+This tutorial introduces a new tool for debugging stuck jobs during distributed training.
+
+1. Background and Motivation
+----------------------------
+An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
+as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
+that require significant computational resources.
+An engineer’s goal is  to complete an AI training job as fast as possible and make continuous improvements such that
+subsequent training can be done faster. A trained usable model is the final desired outcome.
+One of the biggest impediment to completing training is the concept of a "stuck job".
+
+A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of
+time.
+
+A job can get stuck for various reasons:
+Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
+issues with the data pipeline or the data source.
+Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
+memory), the job might not be able to proceed.
+Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
+devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
+stuck.
+Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
+get stuck.
+Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
+to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
+occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
+indefinite wait for the job to progress.
+
+Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
+cause the underlying issue. There are 2 core parts to flight recorder.
+a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
+Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
+b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of
+reading flight recorder records and performing an automated analysis on the collected data.
+
+2. Enabling Flight Recorder
+There are 2 required environment variables to get the initial version of flight recorder working.
+TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
+TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
+Optional settings:
+TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
+ see advanced settings)
+TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
+‘duration’ of each collective. May incur some CPU overhead.
+
+3. Flight Recorder File Formats

From 2b69bcd2b020dfee75826aaed0508b73eb6cf74c Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@meta.com>
Date: Wed, 21 Aug 2024 13:25:46 -0700
Subject: [PATCH 02/31] More cleanup.

---
 prototype_source/flight_recorder_tutorial.rst | 48 +++++++++++--------
 1 file changed, 29 insertions(+), 19 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 92195ebddb0..53af2de15e8 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -1,11 +1,11 @@
 (prototype) Flight Recorder for Debugging
 =========================================
-**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`
+**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_
 
 This tutorial introduces a new tool for debugging stuck jobs during distributed training.
 
-1. Background and Motivation
-----------------------------
+Background and Motivation
+--------------------------
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
@@ -17,35 +17,45 @@ A distributed AI training job is considered "stuck" when it stops making meaning
 time.
 
 A job can get stuck for various reasons:
-Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
+    - Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
 issues with the data pipeline or the data source.
-Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
+    - Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
 memory), the job might not be able to proceed.
-Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
+    - Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
 devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
 stuck.
-Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
+    - Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
 get stuck.
-Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
+    - Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
 to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
 occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
 indefinite wait for the job to progress.
 
 Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
 cause the underlying issue. There are 2 core parts to flight recorder.
-a) The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
+- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
-b) An analyzer script is available in the `pytorch/tools/flight_recorder` directory. The analyzer script is capable of
-reading flight recorder records and performing an automated analysis on the collected data.
+- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
 
-2. Enabling Flight Recorder
+ Enabling Flight Recorder
+ ------------------------
 There are 2 required environment variables to get the initial version of flight recorder working.
-TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this to 2000)
-TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
+   - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
+     to 2000)
+   - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
 Optional settings:
-TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow addr2line -
- see advanced settings)
-TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and record
-‘duration’ of each collective. May incur some CPU overhead.
+   - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
+     addr2line - see additinal settings)
+   - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
+     record the ‘duration’ of each collective. May incur some CPU overhead.
+
+Flight Recorder File Formats
+----------------------------
+Flight recorder files are dumped out in `pickle` format.
+
+
 
-3. Flight Recorder File Formats
+Analyzing Flight Recorder Dumps
+-------------------------------
+We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
+data.

From 7df80d14b3ccd2cf049b3c3a6680b24bc805a9ee Mon Sep 17 00:00:00 2001
From: Hugo <6937752+fduwjj@users.noreply.github.com>
Date: Tue, 27 Aug 2024 10:28:45 -0700
Subject: [PATCH 03/31] Update flight_recorder_tutorial.rst

Add dump example and command line.
---
 prototype_source/flight_recorder_tutorial.rst | 77 +++++++++++++++++--
 1 file changed, 69 insertions(+), 8 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 53af2de15e8..0517f9d66ed 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -1,6 +1,6 @@
 (prototype) Flight Recorder for Debugging
 =========================================
-**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_
+**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
 
 This tutorial introduces a new tool for debugging stuck jobs during distributed training.
 
@@ -9,7 +9,7 @@ Background and Motivation
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
-An engineer’s goal is  to complete an AI training job as fast as possible and make continuous improvements such that
+An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
 subsequent training can be done faster. A trained usable model is the final desired outcome.
 One of the biggest impediment to completing training is the concept of a "stuck job".
 
@@ -37,9 +37,9 @@ cause the underlying issue. There are 2 core parts to flight recorder.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
 - An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
 
- Enabling Flight Recorder
- ------------------------
-There are 2 required environment variables to get the initial version of flight recorder working.
+Enabling Flight Recorder
+------------------------
+There are two required environment variables to get the initial version of flight recorder working.
    - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
      to 2000)
    - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
@@ -52,10 +52,71 @@ Optional settings:
 Flight Recorder File Formats
 ----------------------------
 Flight recorder files are dumped out in `pickle` format.
-
-
-
+```
+{
+  "version": "2.3",
+  "pg_config": {
+    "0": {
+      "name": "0",
+      "desc": "default_pg",
+      "ranks": "[0, 1]"
+    }
+  },
+  "pg_status": {
+    "0": {
+      "last_enqueued_collective": 2,
+      "last_started_collective": -1,
+      "last_completed_collective": 2
+    }
+  },
+  "entries": [
+    {
+      "frames": [
+        {
+          "name": "test_short_pickle",
+          "filename": "pytorch/test/distributed/test_c10d_nccl.py",
+          "line": 3647
+        },
+        ...
+        {
+          "name": "spawn_main",
+          "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
+          "line": 116
+        },
+        {
+          "name": "<module>",
+          "filename": "<string>",
+          "line": 1
+        }
+      ],
+      "record_id": 0,
+      "pg_id": 0,
+      "process_group": ("0", "default_pg"),
+      "collective_seq_id": 1,
+      "p2p_seq_id": 0,
+      "op_id": 1,
+      "profiling_name": "nccl:all_reduce",
+      "time_created_ns": 1724779239936775119,
+      "input_sizes": [[3, 4]],
+      "input_dtypes": ["Float"],
+      "output_sizes": [[3, 4]],
+      "output_dtypes": ["Float"],
+      "state": "completed",
+      "time_discovered_started_ns": null,
+      "time_discovered_completed_ns": 1724779239975811724,
+      "retired": true,
+      "timeout_ms": 600000,
+      "is_p2p": false
+    },
+  ...]
+}
+```
 Analyzing Flight Recorder Dumps
 -------------------------------
 We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
 data.
+
+To run it, one can use command line: 
+```
+python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
+```

From d4fd0d3c7847a109ff8a22508ca854e4f80e6eed Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@meta.com>
Date: Mon, 2 Sep 2024 17:33:26 -0700
Subject: [PATCH 04/31] Add additional settings section

Summary:
Update tutorial

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
---
 prototype_source/flight_recorder_tutorial.rst | 136 +++++++++---------
 1 file changed, 72 insertions(+), 64 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 0517f9d66ed..02631025cdf 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -35,7 +35,7 @@ Flight Recorder captures diagnostics information as collectives run. The diagnos
 cause the underlying issue. There are 2 core parts to flight recorder.
 - The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
-- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below). T
+- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
 
 Enabling Flight Recorder
 ------------------------
@@ -51,72 +51,80 @@ Optional settings:
 
 Flight Recorder File Formats
 ----------------------------
-Flight recorder files are dumped out in `pickle` format.
-```
-{
-  "version": "2.3",
-  "pg_config": {
-    "0": {
-      "name": "0",
-      "desc": "default_pg",
-      "ranks": "[0, 1]"
-    }
-  },
-  "pg_status": {
-    "0": {
-      "last_enqueued_collective": 2,
-      "last_started_collective": -1,
-      "last_completed_collective": 2
-    }
-  },
-  "entries": [
-    {
-      "frames": [
-        {
-          "name": "test_short_pickle",
-          "filename": "pytorch/test/distributed/test_c10d_nccl.py",
-          "line": 3647
-        },
-        ...
-        {
-          "name": "spawn_main",
-          "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
-          "line": 116
-        },
-        {
-          "name": "<module>",
-          "filename": "<string>",
-          "line": 1
-        }
-      ],
-      "record_id": 0,
-      "pg_id": 0,
-      "process_group": ("0", "default_pg"),
-      "collective_seq_id": 1,
-      "p2p_seq_id": 0,
-      "op_id": 1,
-      "profiling_name": "nccl:all_reduce",
-      "time_created_ns": 1724779239936775119,
-      "input_sizes": [[3, 4]],
-      "input_dtypes": ["Float"],
-      "output_sizes": [[3, 4]],
-      "output_dtypes": ["Float"],
-      "state": "completed",
-      "time_discovered_started_ns": null,
-      "time_discovered_completed_ns": 1724779239975811724,
-      "retired": true,
-      "timeout_ms": 600000,
-      "is_p2p": false
+Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
+folders.
+Contents of a flight recorder `unpickled` file is shown below.
+.. code-block: JSON
+  {
+    "version": "2.3",
+    "pg_config": {
+      "0": {
+        "name": "0",
+        "desc": "default_pg",
+        "ranks": "[0, 1]"
+      }
     },
-  ...]
-}
-```
+    "pg_status": {
+      "0": {
+        "last_enqueued_collective": 2,
+        "last_started_collective": -1,
+        "last_completed_collective": 2
+      }
+    },
+    "entries": [
+      {
+        "frames": [
+          {
+            "name": "test_short_pickle",
+            "filename": "pytorch/test/distributed/test_c10d_nccl.py",
+            "line": 3647
+          },
+          ...
+          {
+            "name": "spawn_main",
+            "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
+            "line": 116
+          },
+          {
+            "name": "<module>",
+            "filename": "<string>",
+            "line": 1
+          }
+        ],
+        "record_id": 0,
+        "pg_id": 0,
+        "process_group": ("0", "default_pg"),
+        "collective_seq_id": 1,
+        "p2p_seq_id": 0,
+        "op_id": 1,
+        "profiling_name": "nccl:all_reduce",
+        "time_created_ns": 1724779239936775119,
+        "input_sizes": [[3, 4]],
+        "input_dtypes": ["Float"],
+        "output_sizes": [[3, 4]],
+        "output_dtypes": ["Float"],
+        "state": "completed",
+        "time_discovered_started_ns": null,
+        "time_discovered_completed_ns": 1724779239975811724,
+        "retired": true,
+        "timeout_ms": 600000,
+        "is_p2p": false
+      },
+    ...]
+  }
+
 Analyzing Flight Recorder Dumps
 -------------------------------
 We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
 data.
 
-To run it, one can use command line: 
-```
-python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
-```
+To run it, one can use command line:
+.. code:: python
+  python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
+
+
+Additional settings
+-------------------
+TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
+from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
+faster than the traditional `addr2line`.

From 23eef97d19da5bb1ccddf804541d48b9af27b2e0 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 13:28:13 -0700
Subject: [PATCH 05/31] address code review comments

Summary:
Add missing sections from the template and clarify some notes further in the tutorial.
---
 prototype_source/flight_recorder_tutorial.rst | 53 ++++++++++++++-----
 1 file changed, 40 insertions(+), 13 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 02631025cdf..6ce152542be 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -4,8 +4,8 @@
 
 This tutorial introduces a new tool for debugging stuck jobs during distributed training.
 
-Background and Motivation
---------------------------
+Overview, Background and Motivation
+-----------------------------------
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
@@ -31,24 +31,48 @@ to be synchronized at certain points. If this synchronization fails, the job can
 occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
 indefinite wait for the job to progress.
 
-Flight Recorder captures diagnostics information as collectives run. The diagnostic information can be used to help root
-cause the underlying issue. There are 2 core parts to flight recorder.
+Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
+information can be used to help root cause the underlying issue when jobs get stuck.
+There are 2 core parts to flight recorder.
 - The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
 - An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
 
+Prerequisites
+-------------
+None. This is a new debugging tool that is available in PyTorch version 2.5.
+
 Enabling Flight Recorder
 ------------------------
 There are two required environment variables to get the initial version of flight recorder working.
-   - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a postitive number) N = collection enabled. Recommended to set this
-     to 2000)
-   - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout.
+   - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of
+     entries that will be kept internally in a circular buffer. Recommended to set this at 2000.
+   - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set,
+     there will be one file per rank output in the jobs running directory.
 Optional settings:
    - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
      addr2line - see additinal settings)
    - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
      record the ‘duration’ of each collective. May incur some CPU overhead.
 
+Additional settings
+-------------------
+TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
+from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
+faster than the traditional `addr2line`.
+
+Retrieving Flight Recorder Data via an API
+------------------------------------------
+Flight recorder data can also be retrieved via an API call.
+The API is shown below with the default arguments.
+.. code:: python
+  torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
+
+To view the data, you can unpickle the data
+.. code:: python
+  t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
+  print(t)
+
 Flight Recorder File Formats
 ----------------------------
 Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
@@ -118,13 +142,16 @@ Analyzing Flight Recorder Dumps
 We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
 data.
 
-To run it, one can use command line:
+1. In order to run the convenience script, all files from a rank must first be copied over into a single directory.
+
+2. To run it, one can use command line:
 .. code:: python
   python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
 
 
-Additional settings
--------------------
-TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
-from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
-faster than the traditional `addr2line`.
+Conclusion
+----------
+This tutorial introduces a new PyTorch diagnostic tool called `flight recorder`. The tutorial talks about how flight
+recorder can be enabled to collect diagnostic data from a machine.
+Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder` directory
+in the PyTorch repository.

From f4cf1ff5312f24d12a21037d0d638ae69dfc0241 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 13:32:37 -0700
Subject: [PATCH 06/31] add missing section

Summary:
add missing "what you will learn" section.
---
 prototype_source/flight_recorder_tutorial.rst | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 6ce152542be..e33899c26eb 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -2,7 +2,10 @@
 =========================================
 **Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
 
-This tutorial introduces a new tool for debugging stuck jobs during distributed training.
+What you will learn
+-------------------
+This tutorial introduces a new tool for debugging stuck jobs during distributed training. The tutorial explains how this
+new tool can be enabled and how to use the collected data for analyzing stuck jobs.
 
 Overview, Background and Motivation
 -----------------------------------

From f7db1b96ef312ff34101f06358169761fafb38ab Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 13:37:12 -0700
Subject: [PATCH 07/31] fix typo

Summary:
fix a typo.
---
 prototype_source/flight_recorder_tutorial.rst | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index e33899c26eb..7dfac453693 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -54,15 +54,17 @@ There are two required environment variables to get the initial version of fligh
      there will be one file per rank output in the jobs running directory.
 Optional settings:
    - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
-     addr2line - see additinal settings)
+     addr2line - see additional settings)
    - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
-     record the ‘duration’ of each collective. May incur some CPU overhead.
+     record the `duration` of each collective. May incur some CPU overhead. In the collected data, we end up with a
+     `duration` field that indicates how long a collective took to execute.
 
 Additional settings
 -------------------
 TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
 from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
-faster than the traditional `addr2line`.
+faster than the traditional `addr2line`. Use this setting in conjunction with `TORCH_NCCL_TRACE_CPP_STACK` to collect
+C++ traces in `flight recorder` data.
 
 Retrieving Flight Recorder Data via an API
 ------------------------------------------

From d86d62388b370fa925c81fba8ae524bc3a692db4 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 15:05:22 -0700
Subject: [PATCH 08/31] Fixes

Summary:
1. Add flight recorder to README.txt.
2. Add a pre-requisite.
---
 prototype_source/README.txt                   | 9 ++++++---
 prototype_source/flight_recorder_tutorial.rst | 2 +-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/prototype_source/README.txt b/prototype_source/README.txt
index 4ab9ce8f6a9..2dcb5e0cb2d 100644
--- a/prototype_source/README.txt
+++ b/prototype_source/README.txt
@@ -7,7 +7,7 @@ Prototype Tutorials
 2. graph_mode_static_quantization_tutorial.py
 	   Graph Mode Post Training Static Quantization in PyTorch
 	   https://pytorch.org/tutorials/prototype/graph_mode_static_quantization_tutorial.html
-	   
+
 3. graph_mode_dynamic_bert_tutorial.rst
 	   Graph Mode Dynamic Quantization on BERT
 	   https://github.com/pytorch/tutorials/blob/main/prototype_source/graph_mode_dynamic_bert_tutorial.rst
@@ -30,9 +30,12 @@ Prototype Tutorials
 
 8. fx_graph_mode_ptq_dynamic.py
 	   FX Graph Mode Post Training Dynamic Quantization
-	   https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html	   
+	   https://pytorch.org/tutorials/prototype/fx_graph_mode_ptq_dynamic.html
 
 9. fx_graph_mode_quant_guide.py
 	   FX Graph Mode Quantization User Guide
-	   https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html	   
+	   https://pytorch.org/tutorials/prototype/fx_graph_mode_quant_guide.html
 
+10 flight_recorder_tutorial.rst
+	   Flight Recorder User Guide
+	   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html
diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 7dfac453693..1cc5b16236e 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -43,7 +43,7 @@ Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped
 
 Prerequisites
 -------------
-None. This is a new debugging tool that is available in PyTorch version 2.5.
+PyTorch version 2.5 and later.
 
 Enabling Flight Recorder
 ------------------------

From c07178411f13bcda99fdf74f1755ccde0baae3bd Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 15:12:09 -0700
Subject: [PATCH 09/31] Add flight recorder to prototype index

Summary:
Add flight recorder to index file.
---
 prototype_source/prototype_index.rst | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/prototype_source/prototype_index.rst b/prototype_source/prototype_index.rst
index 8d965194f88..6ae8e251394 100644
--- a/prototype_source/prototype_index.rst
+++ b/prototype_source/prototype_index.rst
@@ -80,8 +80,8 @@ Prototype features are not available as part of binary distributions like PyPI o
    :card_description: Learn how to use Post Training Quantization in PyTorch 2 Export.
    :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
    :link: ../prototype/pt2e_quant_ptq.html
-   :tags: Quantization          
-   
+   :tags: Quantization
+
 .. customcarditem::
    :header: PyTorch 2 Export Quantization-Aware Training
    :card_description: Learn how to use Quantization-Aware-Training in PyTorch 2 Export.
@@ -203,11 +203,11 @@ Prototype features are not available as part of binary distributions like PyPI o
 
 .. customcarditem::
    :header: MaskedTensor: Simplifying Adagrad Sparse Semantics
-   :card_description: See a showcase on how masked tensors can enable sparse semantics and provide for a cleaner dev experience 
+   :card_description: See a showcase on how masked tensors can enable sparse semantics and provide for a cleaner dev experience
    :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
    :link: ../prototype/maskedtensor_adagrad.html
    :tags: MaskedTensor
-   
+
 .. Model-Optimization
 
 .. customcarditem::
@@ -217,6 +217,15 @@ Prototype features are not available as part of binary distributions like PyPI o
    :link: ../prototype/inductor_cpp_wrapper_tutorial.html
    :tags: Model-Optimization
 
+.. Distributed
+
+.. customcarditem::
+   :header: Flight Recorder Tutorial
+   :card_description: Debug your stuck jobs with Flight Recorder
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../prototype/flight_recorder_tutorial.html
+   :tags: Distributed, Debugging, FlightRecorder
+
 .. End of tutorial card section
 
 .. raw:: html

From aa65c84a90618ea0da66f9ee053ac89df7169fd8 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Tue, 3 Sep 2024 15:15:19 -0700
Subject: [PATCH 10/31] Update prototype_index

---
 prototype_source/prototype_index.rst | 1 +
 1 file changed, 1 insertion(+)

diff --git a/prototype_source/prototype_index.rst b/prototype_source/prototype_index.rst
index 6ae8e251394..af0da6ea56b 100644
--- a/prototype_source/prototype_index.rst
+++ b/prototype_source/prototype_index.rst
@@ -247,6 +247,7 @@ Prototype features are not available as part of binary distributions like PyPI o
    prototype/fx_graph_mode_quant_guide.html
    prototype/fx_graph_mode_ptq_dynamic.html
    prototype/fx_graph_mode_ptq_static.html
+   prototype/flight_recorder_tutorial.html
    prototype/graph_mode_dynamic_bert_tutorial.html
    prototype/inductor_cpp_wrapper_tutorial.html
    prototype/pt2e_quantizer.html

From fb13629131655f92ccfa9e2bb8d278378ffa5cfc Mon Sep 17 00:00:00 2001
From: Chirag Pandya <c-p-i-o@users.noreply.github.com>
Date: Thu, 5 Sep 2024 16:51:22 -0700
Subject: [PATCH 11/31] Apply suggestions from code review

Address code formatting changes

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 prototype_source/flight_recorder_tutorial.rst | 99 +++++++++++--------
 1 file changed, 56 insertions(+), 43 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 1cc5b16236e..afa35aae0c7 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -4,86 +4,95 @@
 
 What you will learn
 -------------------
-This tutorial introduces a new tool for debugging stuck jobs during distributed training. The tutorial explains how this
-new tool can be enabled and how to use the collected data for analyzing stuck jobs.
+* Learn about a new tool for debugging stuck jobs during distributed training.
+* Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
 
-Overview, Background and Motivation
+Overview
 -----------------------------------
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
-An engineer’s goal is to complete an AI training job as fast as possible and make continuous improvements such that
-subsequent training can be done faster. A trained usable model is the final desired outcome.
-One of the biggest impediment to completing training is the concept of a "stuck job".
+An engineer’s goal is to complete an AI training job as quickly as possible and make continuous improvements so that
+subsequent training can be done faster. A trained, usable model is the final desired outcome.
+One of the biggest impediment to completing training is the concept of a *stuck job*.
 
-A distributed AI training job is considered "stuck" when it stops making meaningful progress for an extended period of
+A distributed AI training job is considered `stuck` when it stops making meaningful progress for an extended period of
 time.
 
 A job can get stuck for various reasons:
-    - Data Starvation: This happens when the training job is not receiving data at the expected rate. This could be due to
+- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to
 issues with the data pipeline or the data source.
-    - Resource Constraints: If the system running the job does not have enough computational resources (like CPU, GPU, or
+- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or
 memory), the job might not be able to proceed.
-    - Network Issues: In a distributed training setup, different parts of the model or data may be processed on different
+- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different
 devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
 stuck.
-    - Software Bugs or Errors: Errors in the training code or the underlying libraries and frameworks can also cause a job to
+- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to
 get stuck.
-    - Synchronization Issues: In distributed training, different parts of the computation are often run in parallel and need
+- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need
 to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
-occur if one or ranks fail to join a collective while the remaining ranks have joined. This results in an
+occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an
 indefinite wait for the job to progress.
 
 Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
-information can be used to help root cause the underlying issue when jobs get stuck.
-There are 2 core parts to flight recorder.
-- The collection portion. When enabled, information about collectives are recorded in an in-memory circular buffer.
+information can be used to help identify the root cause of issues when jobs get stuck.
+There are two core parts to Flight Recorder.
+- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
 - An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
 
 Prerequisites
 -------------
-PyTorch version 2.5 and later.
+- PyTorch version 2.5 or later.
 
 Enabling Flight Recorder
 ------------------------
 There are two required environment variables to get the initial version of flight recorder working.
-   - TORCH_NCCL_TRACE_BUFFER_SIZE (0, N where N is a positive number) N = collection enabled. N represents the number of
-     entries that will be kept internally in a circular buffer. Recommended to set this at 2000.
-   - TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false) true = write out diagnostic files to disk on job timeout. If set,
+- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of
+     entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
+- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled,
      there will be one file per rank output in the jobs running directory.
-Optional settings:
-   - TORCH_NCCL_TRACE_CPP_STACK (true, false) true = enable cpp stack trace captures in flight recorder (for slow
-     addr2line - see additional settings)
+     
+**Optional settings:**
+
+- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
+     ``addr2line`` - for more information, see additional settings.
    - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
-     record the `duration` of each collective. May incur some CPU overhead. In the collected data, we end up with a
-     `duration` field that indicates how long a collective took to execute.
+     records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
+     ``duration`` filed indicates how long each collective took to execute..
 
-Additional settings
+Additional Settings
 -------------------
-TORCH_SYMBOLIZE_MODE: {dladdr, addr2line, fast}: This setting controls the program that is used to retrieve C++ traces
-from a running program. The default setting is `addr2line`. `fast` is a new experimental mode that is shown to be much
-faster than the traditional `addr2line`. Use this setting in conjunction with `TORCH_NCCL_TRACE_CPP_STACK` to collect
-C++ traces in `flight recorder` data.
+
+``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}:`` This setting determines the program used to retrieve C++ traces
+from a running program. The default setting is ``addr2line``. ``fast`` is a new experimental mode that is shown to be much
+faster than the traditional ``addr2line``. Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect
+C++ traces in the Flight Recorder` data.
 
 Retrieving Flight Recorder Data via an API
 ------------------------------------------
-Flight recorder data can also be retrieved via an API call.
-The API is shown below with the default arguments.
+
+You can also retrieve Flight Recorder data with an API call.
+Below is the API with the default arguments:
+
 .. code:: python
   torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
 
-To view the data, you can unpickle the data
+To view the data, you can unpickle it as shown below:
+
 .. code:: python
   t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
   print(t)
 
 Flight Recorder File Formats
 ----------------------------
-Flight recorder files are dumped out in `pickle` format. Files are written out to local disks or mounted shared NFS
+
+Flight Recorder files are dumped in ``pickle`` format. Files are written to local disks or mounted shared NFS
 folders.
-Contents of a flight recorder `unpickled` file is shown below.
-.. code-block: JSON
+
+The contents of a Flight Recorder ``unpickled`` file are shown below:
+.. code-block: json
+
   {
     "version": "2.3",
     "pg_config": {
@@ -144,19 +153,23 @@ Contents of a flight recorder `unpickled` file is shown below.
 
 Analyzing Flight Recorder Dumps
 -------------------------------
-We have convenient scripts available in `pytorch/tools/flight_recorder` directory that can be used to analyze captured
+
+We have convenient scripts available in `pytorch/tools/flight_recorder` directory for analyzing captured
 data.
 
-1. In order to run the convenience script, all files from a rank must first be copied over into a single directory.
+To run the convenience script, follow these steps:
+
+1. Copy all files from a rank into a single directory.
 
-2. To run it, one can use command line:
+2. To run the script, use this command:
 .. code:: python
   python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
 
 
 Conclusion
 ----------
-This tutorial introduces a new PyTorch diagnostic tool called `flight recorder`. The tutorial talks about how flight
-recorder can be enabled to collect diagnostic data from a machine.
-Data captured from flight recorder can be analyzed using a convenience script in the `tools/flight_recorder` directory
-in the PyTorch repository.
+In this tutorial, we have learned about a new PyTorch diagnostic tool called  Flight Recorder. 
+We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
+Additionally, we explored how to analyze the data captured from the flight recorder using a
+convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__
+directory of the PyTorch repository.

From 235820129dc15f626288b45dfdff764cde807393 Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@meta.com>
Date: Thu, 5 Sep 2024 17:42:37 -0700
Subject: [PATCH 12/31] More HTML and formatting fixes

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
---
 prototype_source/flight_recorder_tutorial.rst | 133 +++++++++---------
 1 file changed, 69 insertions(+), 64 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index afa35aae0c7..14d76c39ebc 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -1,14 +1,19 @@
 (prototype) Flight Recorder for Debugging
 =========================================
-**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`, `Junjie Wang <https://github.com/fduwjj>`
+**Author**: `Chirag Pandya <https://github.com/c-p-i-o>`_, `Junjie Wang <https://github.com/fduwjj>`_
 
 What you will learn
 -------------------
 * Learn about a new tool for debugging stuck jobs during distributed training.
 * Learn how you can enable the tool and use the collected data for analyzing stuck jobs.
 
+Prerequisites
+-------------
+- PyTorch version 2.5 or later.
+
+
 Overview
------------------------------------
+--------
 An AI distributed training job refers to the process of training a machine learning model using multiple devices, such
 as GPUs or CPUs, connected in a network. This approach allows for faster and more efficient training of large models
 that require significant computational resources.
@@ -41,25 +46,19 @@ There are two core parts to Flight Recorder.
 Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
 - An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
 
-Prerequisites
--------------
-- PyTorch version 2.5 or later.
-
 Enabling Flight Recorder
 ------------------------
-There are two required environment variables to get the initial version of flight recorder working.
-- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of
-     entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
-- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled,
-     there will be one file per rank output in the jobs running directory.
-     
+There are two required environment variables to get the initial version of Flight Recorder working.
+- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
+- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, there will be one file per rank output in the jobs running directory.
+
 **Optional settings:**
 
 - ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
      ``addr2line`` - for more information, see additional settings.
-   - TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
+- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
      records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
-     ``duration`` filed indicates how long each collective took to execute..
+     ``duration`` field indicates how long each collective took to execute.
 
 Additional Settings
 -------------------
@@ -76,11 +75,14 @@ You can also retrieve Flight Recorder data with an API call.
 Below is the API with the default arguments:
 
 .. code:: python
+
   torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
 
+
 To view the data, you can unpickle it as shown below:
 
 .. code:: python
+
   t = pickle.loads(torch._C._distributed_c10d._dump_nccl_trace())
   print(t)
 
@@ -96,61 +98,63 @@ The contents of a Flight Recorder ``unpickled`` file are shown below:
   {
     "version": "2.3",
     "pg_config": {
-      "0": {
-        "name": "0",
-        "desc": "default_pg",
-        "ranks": "[0, 1]"
-      }
+    "0": {
+    "name": "0",
+    "desc": "default_pg",
+    "ranks": "[0, 1]"
+    }
     },
     "pg_status": {
-      "0": {
-        "last_enqueued_collective": 2,
-        "last_started_collective": -1,
-        "last_completed_collective": 2
-      }
+    "0": {
+    "last_enqueued_collective": 2,
+    "last_started_collective": -1,
+    "last_completed_collective": 2
+    }
     },
     "entries": [
-      {
-        "frames": [
-          {
-            "name": "test_short_pickle",
-            "filename": "pytorch/test/distributed/test_c10d_nccl.py",
-            "line": 3647
-          },
-          ...
-          {
-            "name": "spawn_main",
-            "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
-            "line": 116
-          },
-          {
-            "name": "<module>",
-            "filename": "<string>",
-            "line": 1
-          }
-        ],
-        "record_id": 0,
-        "pg_id": 0,
-        "process_group": ("0", "default_pg"),
-        "collective_seq_id": 1,
-        "p2p_seq_id": 0,
-        "op_id": 1,
-        "profiling_name": "nccl:all_reduce",
-        "time_created_ns": 1724779239936775119,
-        "input_sizes": [[3, 4]],
-        "input_dtypes": ["Float"],
-        "output_sizes": [[3, 4]],
-        "output_dtypes": ["Float"],
-        "state": "completed",
-        "time_discovered_started_ns": null,
-        "time_discovered_completed_ns": 1724779239975811724,
-        "retired": true,
-        "timeout_ms": 600000,
-        "is_p2p": false
-      },
+    {
+    "frames": [
+    {
+    "name": "test_short_pickle",
+    "filename": "pytorch/test/distributed/test_c10d_nccl.py",
+    "line": 3647
+    },
+    ...
+    {
+    "name": "spawn_main",
+    "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
+    "line": 116
+    },
+    {
+    "name": "<module>",
+    "filename": "<string>",
+    "line": 1
+    }
+    ],
+    "record_id": 0,
+    "pg_id": 0,
+    "process_group": ("0", "default_pg"),
+    "collective_seq_id": 1,
+    "p2p_seq_id": 0,
+    "op_id": 1,
+    "profiling_name": "nccl:all_reduce",
+    "time_created_ns": 1724779239936775119,
+    "input_sizes": [[3, 4]],
+    "input_dtypes": ["Float"],
+    "output_sizes": [[3, 4]],
+    "output_dtypes": ["Float"],
+    "state": "completed",
+    "time_discovered_started_ns": null,
+    "time_discovered_completed_ns": 1724779239975811724,
+    "retired": true,
+    "timeout_ms": 600000,
+    "is_p2p": false
+    },
     ...]
+
   }
 
+
 Analyzing Flight Recorder Dumps
 -------------------------------
 
@@ -163,13 +167,14 @@ To run the convenience script, follow these steps:
 
 2. To run the script, use this command:
 .. code:: python
+
   python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
 
 
 Conclusion
 ----------
-In this tutorial, we have learned about a new PyTorch diagnostic tool called  Flight Recorder. 
+In this tutorial, we have learned about a new PyTorch diagnostic tool called  Flight Recorder.
 We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
-Additionally, we explored how to analyze the data captured from the flight recorder using a
+Additionally, we explored how to analyze the data captured from the Flight Recorder using a
 convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__
 directory of the PyTorch repository.

From de5654ac6a94408ce059f373f350d18c7984b23a Mon Sep 17 00:00:00 2001
From: Chirag Pandya <cpio@fb.com>
Date: Fri, 6 Sep 2024 09:14:02 -0700
Subject: [PATCH 13/31] More HTML formatting changes

Test Plan:
Ran rst2html5 and viewed HTML on browser.
---
 prototype_source/flight_recorder_tutorial.rst | 166 +++++++++---------
 1 file changed, 86 insertions(+), 80 deletions(-)

diff --git a/prototype_source/flight_recorder_tutorial.rst b/prototype_source/flight_recorder_tutorial.rst
index 14d76c39ebc..8130e1537bb 100644
--- a/prototype_source/flight_recorder_tutorial.rst
+++ b/prototype_source/flight_recorder_tutorial.rst
@@ -25,61 +25,66 @@ A distributed AI training job is considered `stuck` when it stops making meaning
 time.
 
 A job can get stuck for various reasons:
-- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to
-issues with the data pipeline or the data source.
-- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or
-memory), the job might not be able to proceed.
-- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different
-devices. If there are network issues, communication between these devices may be disrupted, causing the job to get
-stuck.
-- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to
-get stuck.
-- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need
-to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can
-occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an
-indefinite wait for the job to progress.
+
+- **Data Starvation:** This occurs when the training job is not receiving data at the expected rate, possibly due to issues with the data pipeline or the data source.
+
+- **Resource Constraints:** If the system running the job does not have enough computational resources (such as CPU, GPU, or memory), the job might not be able to proceed.
+
+- **Network Issues:** In a distributed training setup, different parts of the model or data may be processed on different devices. If there are network issues, communication between these devices may be disrupted, causing the job to get stuck.
+
+- **Software Bugs or Errors:** Errors in the training code or the underlying libraries and frameworks can also cause a job to get stuck.
+
+- **Synchronization Issues:** In distributed training, different parts of the computation are often run in parallel and need to be synchronized at certain points. If this synchronization fails, the job can get stuck. For example, a deadlock can occur if one or more ranks fail to join a collective while the remaining ranks have joined. This results in an indefinite wait for the job to progress.
 
 Flight Recorder, as the name suggests, captures diagnostics information as collectives run. The captured diagnostic
-information can be used to help identify the root cause of issues when jobs get stuck.
+information is used to help root cause issues when jobs get stuck.
 There are two core parts to Flight Recorder.
-- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer.
-Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
-- An analyzer script is available in the `pytorch/tools/flight_recorder` directory (details below).
+
+- The collection portion: when enabled, information about collectives is recorded in an in-memory circular buffer. Upon job timeout, or on demand, the in-memory buffer can be retrieved or dumped to file.
+
+- An analyzer script is available in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__ directory (details below).
+   The analyzer script runs known heuristics using the collected data and attempts to automatically identify the underlying issue that caused the job to stall.
 
 Enabling Flight Recorder
 ------------------------
 There are two required environment variables to get the initial version of Flight Recorder working.
-- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection. N represents the number of entries that will be kept internally in a circular buffer. We recommended to set this value at 2000.
-- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout. If enabled, there will be one file per rank output in the jobs running directory.
+
+- ``TORCH_NCCL_TRACE_BUFFER_SIZE`` (``0``, ``N`` where ``N`` is a positive number): Setting ``N`` enables collection.
+     ``N`` represents the number of entries that will be kept internally in a circular buffer.
+     We recommended to set this value at 2000.
+- ``TORCH_NCCL_DUMP_ON_TIMEOUT = (true, false)``: Setting this to ``true`` will write out diagnostic files to disk on job timeout.
+     If enabled, there will be one file per rank output in the job's running directory.
 
 **Optional settings:**
 
-- ```TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder. This is useful for slow
-     ``addr2line`` - for more information, see additional settings.
-- TORCH_NCCL_ENABLE_TIMING (true, false) true = enable additional cuda events at the start of each collective and
+- ``TORCH_NCCL_TRACE_CPP_STACK (true, false)``: Setting this to true enables C++ stack stack trace captures in Flight Recorder.
+     C++ stack traces can be useful in providing the exact code path from a PyTorch Python call down to the primitive
+     C++ implementations. Also see ``TORCH_SYMBOLIZE_MODE`` in additional settings.
+- ``TORCH_NCCL_ENABLE_TIMING (true, false)``: true = enable additional cuda events at the start of each collective and
      records the `duration` of each collective. This may incur some CPU overhead. In the collected data, the
      ``duration`` field indicates how long each collective took to execute.
 
 Additional Settings
 -------------------
 
-``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}:`` This setting determines the program used to retrieve C++ traces
-from a running program. The default setting is ``addr2line``. ``fast`` is a new experimental mode that is shown to be much
-faster than the traditional ``addr2line``. Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect
-C++ traces in the Flight Recorder` data.
+- ``TORCH_SYMBOLIZE_MODE {dladdr, addr2line, fast}``: This setting determines the program used to retrieve C++ traces from a running program.
+     The default setting is ``addr2line``.
+
+     ``fast`` is a new experimental mode that is shown to be much faster than the traditional ``addr2line``.
+     Use this setting in conjunction with ``TORCH_NCCL_TRACE_CPP_STACK`` to collect C++ traces in the Flight Recorder data.
 
 Retrieving Flight Recorder Data via an API
 ------------------------------------------
 
 You can also retrieve Flight Recorder data with an API call.
-Below is the API with the default arguments:
+The API with the default arguments is shown below:
 
 .. code:: python
 
   torch._C._distributed_c10d._dump_nccl_trace(includeCollectives=True, includeStackTraces=True, onlyActive=False)
 
 
-To view the data, you can unpickle it as shown below:
+To view the data, you can ``unpickle`` it as shown below:
 
 .. code:: python
 
@@ -93,65 +98,65 @@ Flight Recorder files are dumped in ``pickle`` format. Files are written to loca
 folders.
 
 The contents of a Flight Recorder ``unpickled`` file are shown below:
-.. code-block: json
+
+.. code-block:: json
 
   {
-    "version": "2.3",
+    "version": "2.5",
     "pg_config": {
-    "0": {
-    "name": "0",
-    "desc": "default_pg",
-    "ranks": "[0, 1]"
-    }
+      "0": {
+      "name": "0",
+      "desc": "default_pg",
+      "ranks": "[0, 1]"
+      }
     },
     "pg_status": {
-    "0": {
-    "last_enqueued_collective": 2,
-    "last_started_collective": -1,
-    "last_completed_collective": 2
-    }
+      "0": {
+      "last_enqueued_collective": 2,
+      "last_started_collective": -1,
+      "last_completed_collective": 2
+      }
     },
     "entries": [
     {
-    "frames": [
-    {
-    "name": "test_short_pickle",
-    "filename": "pytorch/test/distributed/test_c10d_nccl.py",
-    "line": 3647
-    },
-    ...
-    {
-    "name": "spawn_main",
-    "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
-    "line": 116
-    },
-    {
-    "name": "<module>",
-    "filename": "<string>",
-    "line": 1
-    }
-    ],
-    "record_id": 0,
-    "pg_id": 0,
-    "process_group": ("0", "default_pg"),
-    "collective_seq_id": 1,
-    "p2p_seq_id": 0,
-    "op_id": 1,
-    "profiling_name": "nccl:all_reduce",
-    "time_created_ns": 1724779239936775119,
-    "input_sizes": [[3, 4]],
-    "input_dtypes": ["Float"],
-    "output_sizes": [[3, 4]],
-    "output_dtypes": ["Float"],
-    "state": "completed",
-    "time_discovered_started_ns": null,
-    "time_discovered_completed_ns": 1724779239975811724,
-    "retired": true,
-    "timeout_ms": 600000,
-    "is_p2p": false
-    },
-    ...]
-
+      "frames": [
+      {
+      "name": "test_short_pickle",
+      "filename": "pytorch/test/distributed/test_c10d_nccl.py",
+      "line": 3647
+      },
+      {
+      "name": "spawn_main",
+      "filename": ".conda/envs/pytorch-3.10/lib/python3.10/multiprocessing/spawn.py",
+      "line": 116
+      },
+      {
+      "name": "<module>",
+      "filename": "<string>",
+      "line": 1
+      }
+      ],
+      "record_id": 0,
+      "pg_id": 0,
+      "process_group": ("0", "default_pg"),
+      "collective_seq_id": 1,
+      "p2p_seq_id": 0,
+      "op_id": 1,
+      "profiling_name": "nccl:all_reduce",
+      "time_created_ns": 1724779239936775119,
+      "input_sizes": [[3, 4]],
+      "input_dtypes": ["Float"],
+      "output_sizes": [[3, 4]],
+      "output_dtypes": ["Float"],
+      "state": "completed",
+      "time_discovered_started_ns": null,
+      "time_discovered_completed_ns": 1724779239975811724,
+      "retired": true,
+      "timeout_ms": 600000,
+      "is_p2p": false
+      },
+      ...
+      ]
   }
 
 
@@ -166,6 +171,7 @@ To run the convenience script, follow these steps:
 1. Copy all files from a rank into a single directory.
 
 2. To run the script, use this command:
+
 .. code:: python
 
   python fr_trace.py -d <dump dir containing trace files> [-o <output file>]
@@ -173,7 +179,7 @@ To run the convenience script, follow these steps:
 
 Conclusion
 ----------
-In this tutorial, we have learned about a new PyTorch diagnostic tool called  Flight Recorder.
+In this tutorial, we have learned about a new PyTorch diagnostic tool called Flight Recorder.
 We have discussed how to enable Flight Recorder to collect diagnostic data from a machine.
 Additionally, we explored how to analyze the data captured from the Flight Recorder using a
 convenience script located in the `tools/flight_recorder <https://github.com/pytorch/pytorch/tree/main/tools/flight_recorder>`__

From 3b0c4cc667eadb3ed05a940679bf62d4d67b7016 Mon Sep 17 00:00:00 2001
From: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Date: Mon, 19 Aug 2024 13:26:05 -0700
Subject: [PATCH 14/31] [dtensor][debug] CommDebugMode recipe (#3001)

* Add [dtensor][debug] CommDebugMode recipe
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 .../distributed_comm_debug_mode.rst           | 210 ++++++++++++++++++
 recipes_source/recipes_index.rst              |   8 +
 2 files changed, 218 insertions(+)
 create mode 100644 recipes_source/distributed_comm_debug_mode.rst

diff --git a/recipes_source/distributed_comm_debug_mode.rst b/recipes_source/distributed_comm_debug_mode.rst
new file mode 100644
index 00000000000..db79cdc8992
--- /dev/null
+++ b/recipes_source/distributed_comm_debug_mode.rst
@@ -0,0 +1,210 @@
+Getting Started with ``CommDebugMode``
+=====================================================
+
+**Author**: `Anshul Sinha <https://github.com/sinhaanshul>`__
+
+
+In this tutorial, we will explore how to use ``CommDebugMode`` with PyTorch's
+DistributedTensor (DTensor) for debugging by tracking collective operations in distributed training environments.
+
+Prerequisites
+---------------------
+
+* Python 3.8 - 3.11
+* PyTorch 2.2 or later
+
+
+What is ``CommDebugMode`` and why is it useful
+----------------------------------------------------
+As the size of models continues to increase, users are seeking to leverage various combinations
+of parallel strategies to scale up distributed training. However, the lack of interoperability
+between existing solutions poses a significant challenge, primarily due to the absence of a
+unified abstraction that can bridge these different parallelism strategies. To address this
+issue, PyTorch has proposed `DistributedTensor(DTensor)
+<https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py>`_
+which abstracts away the complexities of tensor communication in distributed training,
+providing a seamless user experience. However, when dealing with existing parallelism solutions and
+developing parallelism solutions using the unified abstraction like DTensor, the lack of transparency
+about what and when the collective communications happens under the hood could make it challenging
+for advanced users to identify and resolve issues. To address this challenge, ``CommDebugMode``, a
+Python context manager will serve as one of the primary debugging tools for DTensors, enabling
+users to view when and why collective operations are happening when using DTensors, effectively
+addressing this issue.
+
+
+Using ``CommDebugMode``
+------------------------
+
+Here is how you can use ``CommDebugMode``:
+
+.. code-block:: python
+
+    # The model used in this example is a MLPModule applying Tensor Parallel
+    comm_mode = CommDebugMode()
+        with comm_mode:
+            output = model(inp)
+
+    # print the operation level collective tracing information
+    print(comm_mode.generate_comm_debug_tracing_table(noise_level=0))
+
+    # log the operation level collective tracing information to a file
+    comm_mode.log_comm_debug_tracing_table_to_file(
+        noise_level=1, file_name="transformer_operation_log.txt"
+    )
+
+    # dump the operation level collective tracing information to json file,
+    # used in the visual browser below
+    comm_mode.generate_json_dump(noise_level=2)
+
+This is what the output looks like for a MLPModule at noise level 0:
+
+.. code-block:: python
+
+    Expected Output:
+        Global
+          FORWARD PASS
+            *c10d_functional.all_reduce: 1
+            MLPModule
+              FORWARD PASS
+                *c10d_functional.all_reduce: 1
+                MLPModule.net1
+                MLPModule.relu
+                MLPModule.net2
+                  FORWARD PASS
+                    *c10d_functional.all_reduce: 1
+
+To use ``CommDebugMode``, you must wrap the code running the model in ``CommDebugMode`` and call the API that
+you want to use to display the data. You can also use a ``noise_level`` argument to control the verbosity
+level of displayed information. Here is what each noise level displays:
+
+| 0. Prints module-level collective counts
+| 1. Prints DTensor operations (not including trivial operations), module sharding information
+| 2. Prints tensor operations (not including trivial operations)
+| 3. Prints all operations
+
+In the example above, you can see that the collective operation, all_reduce, occurs once in the forward pass
+of the ``MLPModule``. Furthermore, you can use ``CommDebugMode`` to pinpoint that the all-reduce operation happens
+in the second linear layer of the ``MLPModule``.
+
+
+Below is the interactive module tree visualization that you can use to upload your own JSON dump:
+
+.. raw:: html
+
+    <!DOCTYPE html>
+    <html lang ="en">
+    <head>
+        <meta charset="UTF-8">
+        <meta name = "viewport" content="width=device-width, initial-scale=1.0">
+        <title>CommDebugMode Module Tree</title>
+        <style>
+            ul, #tree-container {
+                list-style-type: none;
+                margin: 0;
+                padding: 0;
+            }
+            .caret {
+                cursor: pointer;
+                user-select: none;
+            }
+            .caret::before {
+                content: "\25B6";
+                color:black;
+                display: inline-block;
+                margin-right: 6px;
+            }
+            .caret-down::before {
+                transform: rotate(90deg);
+            }
+            .tree {
+                padding-left: 20px;
+            }
+            .tree ul {
+                padding-left: 20px;
+            }
+            .nested {
+                display: none;
+            }
+            .active {
+                display: block;
+            }
+            .forward-pass,
+            .backward-pass {
+                margin-left: 40px;
+            }
+            .forward-pass table {
+                margin-left: 40px;
+                width: auto;
+            }
+            .forward-pass table td, .forward-pass table th {
+                padding: 8px;
+            }
+            .forward-pass ul {
+                display: none;
+            }
+            table {
+                font-family: arial, sans-serif;
+                border-collapse: collapse;
+                width: 100%;
+            }
+            td, th {
+                border: 1px solid #dddddd;
+                text-align: left;
+                padding: 8px;
+            }
+            tr:nth-child(even) {
+                background-color: #dddddd;
+            }
+            #drop-area {
+                position: relative;
+                width: 25%;
+                height: 100px;
+                border: 2px dashed #ccc;
+                border-radius: 5px;
+                padding: 0px;
+                text-align: center;
+            }
+            .drag-drop-block {
+                display: inline-block;
+                width: 200px;
+                height: 50px;
+                background-color: #f7f7f7;
+                border: 1px solid #ccc;
+                border-radius: 5px;
+                padding: 10px;
+                font-size: 14px;
+                color: #666;
+                cursor: pointer;
+            }
+            #file-input {
+                position: absolute;
+                top: 0;
+                left: 0;
+                width: 100%;
+                height: 100%;
+                opacity: 0;
+            }
+        </style>
+    </head>
+    <body>
+        <div id="drop-area">
+            <div class="drag-drop-block">
+              <span>Drag file here</span>
+            </div>
+            <input type="file" id="file-input" accept=".json">
+          </div>
+        <div id="tree-container"></div>
+        <script src="https://cdn.jsdelivr.net/gh/pytorch/pytorch@main/torch/distributed/_tensor/debug/comm_mode_broswer_visual.js"></script>
+    </body>
+    </html>
+
+Conclusion
+------------------------------------------
+
+In this recipe, we have learned how to use ``CommDebugMode`` to debug Distributed Tensors and
+parallelism solutions that uses communication collectives with PyTorch. You can use your own
+JSON outputs in the embedded visual browser.
+
+For more detailed information about ``CommDebugMode``, see
+`comm_mode_features_example.py
+<https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/examples/comm_mode_features_example.py>`_
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
index 8959ea98a38..d94d7d5c22e 100644
--- a/recipes_source/recipes_index.rst
+++ b/recipes_source/recipes_index.rst
@@ -395,6 +395,13 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/distributed_async_checkpoint_recipe.html
    :tags: Distributed-Training
 
+.. customcarditem::
+   :header: Getting Started with CommDebugMode
+   :card_description: Learn how to use CommDebugMode for DTensors
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/distributed_comm_debug_mode.html
+   :tags: Distributed-Training
+
 .. TorchServe
 
 .. customcarditem::
@@ -449,3 +456,4 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    /recipes/cuda_rpc
    /recipes/distributed_optim_torchscript
    /recipes/mobile_interpreter
+   /recipes/distributed_comm_debug_mode

From f952921c73d4acdbe866195eb6771482cefdc239 Mon Sep 17 00:00:00 2001
From: Shaoyu Yang <100203773+shaoyuyoung@users.noreply.github.com>
Date: Tue, 20 Aug 2024 05:55:12 +0800
Subject: [PATCH 15/31] fix: rm `use_cuda` param (#3002)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 recipes_source/recipes/profiler_recipe.py | 1 -
 1 file changed, 1 deletion(-)

diff --git a/recipes_source/recipes/profiler_recipe.py b/recipes_source/recipes/profiler_recipe.py
index 47d9f86d8a8..f35172159b8 100644
--- a/recipes_source/recipes/profiler_recipe.py
+++ b/recipes_source/recipes/profiler_recipe.py
@@ -73,7 +73,6 @@
 # - ``record_shapes`` - whether to record shapes of the operator inputs;
 # - ``profile_memory`` - whether to report amount of memory consumed by
 #   model's Tensors;
-# - ``use_cuda`` - whether to measure execution time of CUDA kernels.
 #
 # Note: when using CUDA, profiler also shows the runtime CUDA events
 # occurring on the host.

From fa0387929fcf7c24e5079376dc2a0cb340e00bdc Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Thu, 22 Aug 2024 14:26:41 -0700
Subject: [PATCH 16/31] Add programmable Google Search to pytorch tutorials
 site (#2820)

* Add programmable Google Search to pytorch tutorials site
---
 _static/css/custom.css |  9 +++++++++
 _templates/layout.html | 17 +++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/_static/css/custom.css b/_static/css/custom.css
index a467a088159..09aba28f258 100755
--- a/_static/css/custom.css
+++ b/_static/css/custom.css
@@ -91,3 +91,12 @@
     transition: none;
     transform-origin: none;
 }
+
+.pytorch-left-menu-search input[type=text] {
+    background-image: none;
+}
+
+.gsc-control-cse {
+   padding-left: 0px !important;
+   padding-bottom: 0px !important;
+}
diff --git a/_templates/layout.html b/_templates/layout.html
index 22129040e49..1c632de63f8 100644
--- a/_templates/layout.html
+++ b/_templates/layout.html
@@ -11,6 +11,23 @@
 </script>
 {%- endblock %}
 
+{% block sidebartitle %}
+    {% if theme_display_version %}
+      {%- set nav_version = version %}
+      {% if READTHEDOCS and current_version %}
+        {%- set nav_version = current_version %}
+      {% endif %}
+      {% if nav_version %}
+        <div class="version">
+            {{ nav_version }}
+        </div>
+      {% endif %}
+    {% endif %}
+    <div class="searchbox">
+        <script async src="https://cse.google.com/cse.js?cx=e65585f8c3ea1440e"></script>
+        <div class="gcse-search"></div>
+    </div>
+{% endblock %}
 
 {% block footer %}
 {{ super() }}

From ab36383d582ecb9a311a35c109d085aa8c7147ef Mon Sep 17 00:00:00 2001
From: Ankith Gunapal <agunapal@ischool.Berkeley.edu>
Date: Fri, 23 Aug 2024 10:58:18 -0700
Subject: [PATCH 17/31] Tutorial for AOTI Python runtime (#2997)

* Tutorial for AOTI Python runtime
---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Angela Yi <angelayi@meta.com>
---
 .ci/docker/build.sh                        |   3 +-
 .ci/docker/common/common_utils.sh          |   2 +-
 .ci/docker/requirements.txt                |   6 +-
 .jenkins/metadata.json                     |   3 +
 en-wordlist.txt                            |   3 +-
 recipes_source/recipes_index.rst           |   6 +
 recipes_source/torch_export_aoti_python.py | 220 +++++++++++++++++++++
 7 files changed, 237 insertions(+), 6 deletions(-)
 create mode 100644 recipes_source/torch_export_aoti_python.py

diff --git a/.ci/docker/build.sh b/.ci/docker/build.sh
index 31f42fdbd85..c646b8f9a86 100755
--- a/.ci/docker/build.sh
+++ b/.ci/docker/build.sh
@@ -11,8 +11,9 @@ IMAGE_NAME="$1"
 shift
 
 export UBUNTU_VERSION="20.04"
+export CUDA_VERSION="12.4.1"
 
-export BASE_IMAGE="ubuntu:${UBUNTU_VERSION}"
+export BASE_IMAGE="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
 echo "Building ${IMAGE_NAME} Docker image"
 
 docker build \
diff --git a/.ci/docker/common/common_utils.sh b/.ci/docker/common/common_utils.sh
index b20286a4099..c7eabda555d 100644
--- a/.ci/docker/common/common_utils.sh
+++ b/.ci/docker/common/common_utils.sh
@@ -22,5 +22,5 @@ conda_run() {
 }
 
 pip_install() {
-  as_ci_user conda run -n py_$ANACONDA_PYTHON_VERSION pip install --progress-bar off $*
+  as_ci_user conda run -n py_$ANACONDA_PYTHON_VERSION pip3 install --progress-bar off $*
 }
diff --git a/.ci/docker/requirements.txt b/.ci/docker/requirements.txt
index 00cf2f21033..9668b17fc3a 100644
--- a/.ci/docker/requirements.txt
+++ b/.ci/docker/requirements.txt
@@ -30,8 +30,8 @@ pytorch-lightning
 torchx
 torchrl==0.5.0
 tensordict==0.5.0
-ax-platform>==0.4.0
-nbformat>==5.9.2
+ax-platform>=0.4.0
+nbformat>=5.9.2
 datasets
 transformers
 torchmultimodal-nightly # needs to be updated to stable as soon as it's avaialable
@@ -68,4 +68,4 @@ pygame==2.1.2
 pycocotools
 semilearn==0.3.2
 torchao==0.0.3
-segment_anything==1.0
\ No newline at end of file
+segment_anything==1.0
diff --git a/.jenkins/metadata.json b/.jenkins/metadata.json
index 4814f9a7d2b..2f1a9933aab 100644
--- a/.jenkins/metadata.json
+++ b/.jenkins/metadata.json
@@ -28,6 +28,9 @@
   "intermediate_source/model_parallel_tutorial.py": {
     "needs": "linux.16xlarge.nvidia.gpu"
   },
+  "recipes_source/torch_export_aoti_python.py": {
+    "needs": "linux.g5.4xlarge.nvidia.gpu"
+  }, 
   "advanced_source/pendulum.py": {
     "needs": "linux.g5.4xlarge.nvidia.gpu",
     "_comment": "need to be here for the compiling_optimizer_lr_scheduler.py to run."
diff --git a/en-wordlist.txt b/en-wordlist.txt
index 62762ab69cc..e69cbaa1a5f 100644
--- a/en-wordlist.txt
+++ b/en-wordlist.txt
@@ -2,6 +2,7 @@
 ACL
 ADI
 AOT
+AOTInductor
 APIs
 ATen
 AVX
@@ -617,4 +618,4 @@ warmstarting
 warmup
 webp
 wsi
-wsis
\ No newline at end of file
+wsis
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
index d94d7d5c22e..caccdcc28f7 100644
--- a/recipes_source/recipes_index.rst
+++ b/recipes_source/recipes_index.rst
@@ -150,6 +150,12 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/recipes/swap_tensors.html
    :tags: Basics
 
+.. customcarditem::
+   :header: torch.export AOTInductor Tutorial for Python runtime
+   :card_description: Learn an end-to-end example of how to use AOTInductor for python runtime.
+   :image: ../_static/img/thumbnails/cropped/generic-pytorch-logo.png
+   :link: ../recipes/torch_export_aoti_python.html
+   :tags: Basics
 
 .. Interpretability
 
diff --git a/recipes_source/torch_export_aoti_python.py b/recipes_source/torch_export_aoti_python.py
new file mode 100644
index 00000000000..136862078c1
--- /dev/null
+++ b/recipes_source/torch_export_aoti_python.py
@@ -0,0 +1,220 @@
+# -*- coding: utf-8 -*-
+
+"""
+(Beta) ``torch.export`` AOTInductor Tutorial for Python runtime
+===============================================================
+**Author:** Ankith Gunapal, Bin Bao, Angela Yi
+"""
+
+######################################################################
+#
+# .. warning::
+#
+#     ``torch._inductor.aot_compile`` and ``torch._export.aot_load`` are in Beta status and are subject to backwards compatibility
+#     breaking changes. This tutorial provides an example of how to use these APIs for model deployment using Python runtime.
+#
+# It has been shown `previously <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`__ how AOTInductor can be used 
+# to do Ahead-of-Time compilation of PyTorch exported models by creating
+# a shared library that can be run in a non-Python environment.
+#
+#
+# In this tutorial, you will learn an end-to-end example of how to use AOTInductor for python runtime.
+# We will look at how  to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a 
+# shared library. Additionally, we will examine how to execute the shared library in Python runtime using :func:`torch._export.aot_load`.
+# You will learn about the speed up seen in the first inference time using AOTInductor, especially when using 
+# ``max-autotune`` mode which can take some time to execute.
+#
+# **Contents**
+#
+# .. contents::
+#     :local:
+
+######################################################################
+# Prerequisites
+# -------------
+# * PyTorch 2.4 or later
+# * Basic understanding of ``torch.export`` and AOTInductor
+# * Complete the `AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`_ tutorial
+
+######################################################################
+# What you will learn
+# ----------------------
+# * How to use AOTInductor for python runtime.
+# * How  to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a shared library
+# * How to run a shared library in Python runtime using :func:`torch._export.aot_load`.
+# * When do you use AOTInductor for python runtime
+
+######################################################################
+# Model Compilation
+# -----------------
+#
+# We will use the TorchVision pretrained `ResNet18` model and TorchInductor on the 
+# exported PyTorch program using :func:`torch._inductor.aot_compile`.
+#
+# .. note::
+#
+#       This API also supports :func:`torch.compile` options like ``mode``
+#       This means that if used on a CUDA enabled device, you can, for example, set ``"max_autotune": True``
+#       which leverages Triton based matrix multiplications & convolutions, and enables CUDA graphs by default.
+#
+# We also specify ``dynamic_shapes`` for the batch dimension. In this example, ``min=2`` is not a bug and is 
+# explained in `The 0/1 Specialization Problem <https://docs.google.com/document/d/16VPOa3d-Liikf48teAOmxLc92rgvJdfosIy-yoT38Io/edit?fbclid=IwAR3HNwmmexcitV0pbZm_x1a4ykdXZ9th_eJWK-3hBtVgKnrkmemz6Pm5jRQ#heading=h.ez923tomjvyk>`__
+
+
+import os
+import torch
+from torchvision.models import ResNet18_Weights, resnet18
+
+model = resnet18(weights=ResNet18_Weights.DEFAULT)
+model.eval()
+
+with torch.inference_mode():
+
+    # Specify the generated shared library path
+    aot_compile_options = {
+            "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"),
+    }
+    if torch.cuda.is_available():
+        device = "cuda"
+        aot_compile_options.update({"max_autotune": True})
+    else:
+        device = "cpu"
+
+    model = model.to(device=device)
+    example_inputs = (torch.randn(2, 3, 224, 224, device=device),)
+
+    # min=2 is not a bug and is explained in the 0/1 Specialization Problem
+    batch_dim = torch.export.Dim("batch", min=2, max=32)
+    exported_program = torch.export.export(
+        model,
+        example_inputs,
+        # Specify the first dimension of the input x as dynamic
+        dynamic_shapes={"x": {0: batch_dim}},
+    )
+    so_path = torch._inductor.aot_compile(
+        exported_program.module(),
+        example_inputs,
+        # Specify the generated shared library path
+        options=aot_compile_options
+    )
+
+
+######################################################################
+# Model Inference in Python
+# -------------------------
+#
+# Typically, the shared object generated above is used in a non-Python environment. In PyTorch 2.3, 
+# we added a new API called :func:`torch._export.aot_load` to load the shared library in the Python runtime.
+# The API follows a structure similar to the :func:`torch.jit.load` API . You need to specify the path 
+# of the shared library and the device where it should be loaded.
+#
+# .. note::
+#      In the example above, we specified ``batch_size=1`` for inference and  it still functions correctly even though we specified ``min=2`` in 
+#      :func:`torch.export.export`.
+
+
+import os
+import torch
+
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model_so_path = os.path.join(os.getcwd(), "resnet18_pt2.so")
+
+model = torch._export.aot_load(model_so_path, device)
+example_inputs = (torch.randn(1, 3, 224, 224, device=device),)
+
+with torch.inference_mode():
+    output = model(example_inputs)
+
+######################################################################
+# When to use AOTInductor for Python Runtime
+# ------------------------------------------
+#
+# One of the requirements for using AOTInductor is that the model shouldn't have any graph breaks.
+# Once this requirement is met, the primary use case for using AOTInductor Python Runtime is for
+# model deployment using Python.
+# There are mainly two reasons why you would use AOTInductor Python Runtime:
+#
+# -  ``torch._inductor.aot_compile`` generates a shared library. This is useful for model
+#    versioning for deployments and tracking model performance over time.
+# -  With :func:`torch.compile` being a JIT compiler, there is a warmup
+#    cost associated with the first compilation. Your deployment needs to account for the
+#    compilation time taken for the first inference. With AOTInductor, the compilation is
+#    done offline using ``torch.export.export`` & ``torch._indutor.aot_compile``. The deployment
+#    would only load the shared library using ``torch._export.aot_load`` and run inference.
+#
+#
+# The section below shows the speedup achieved with AOTInductor for first inference
+#
+# We define a utility function ``timed`` to measure the time taken for inference
+#
+
+import time
+def timed(fn):
+    # Returns the result of running `fn()` and the time it took for `fn()` to run,
+    # in seconds. We use CUDA events and synchronization for accurate
+    # measurement on CUDA enabled devices.
+    if torch.cuda.is_available():
+        start = torch.cuda.Event(enable_timing=True)
+        end = torch.cuda.Event(enable_timing=True)
+        start.record()
+    else:
+        start = time.time()
+
+    result = fn()
+    if torch.cuda.is_available():
+        end.record()
+        torch.cuda.synchronize()
+    else:
+        end = time.time()
+
+    # Measure time taken to execute the function in miliseconds
+    if torch.cuda.is_available():
+        duration = start.elapsed_time(end)
+    else:
+        duration = (end - start) * 1000
+
+    return result, duration
+
+
+######################################################################
+# Lets measure the time for first inference using AOTInductor
+
+torch._dynamo.reset()
+
+model = torch._export.aot_load(model_so_path, device)
+example_inputs = (torch.randn(1, 3, 224, 224, device=device),)
+
+with torch.inference_mode():
+    _, time_taken = timed(lambda: model(example_inputs))
+    print(f"Time taken for first inference for AOTInductor is {time_taken:.2f} ms")
+
+
+######################################################################
+# Lets measure the time for first inference using ``torch.compile``
+
+torch._dynamo.reset()
+
+model = resnet18(weights=ResNet18_Weights.DEFAULT).to(device)
+model.eval()
+
+model = torch.compile(model)
+example_inputs = torch.randn(1, 3, 224, 224, device=device)
+
+with torch.inference_mode():
+    _, time_taken = timed(lambda: model(example_inputs))
+    print(f"Time taken for first inference for torch.compile is {time_taken:.2f} ms")
+
+######################################################################
+# We see that there is a drastic speedup in first inference time using AOTInductor compared
+# to ``torch.compile``
+
+######################################################################
+# Conclusion
+# ----------
+#
+# In this recipe, we have learned how to effectively use the AOTInductor for Python runtime by 
+# compiling and loading a pretrained ``ResNet18`` model using the ``torch._inductor.aot_compile``
+# and ``torch._export.aot_load`` APIs. This process demonstrates the practical application of 
+# generating a shared library and running it within a Python environment, even with dynamic shape
+# considerations and device-specific optimizations. We also looked at the advantage of using 
+# AOTInductor in model deployments, with regards to speed up in first inference time.

From fc27f08f9ed968b17fa9dfe23117904f722feb88 Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Sat, 24 Aug 2024 11:14:07 -0700
Subject: [PATCH 18/31] Create tutorial_submission_policy.md (#2995)

- Add a policy for new tutorials submissions

---------

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
---
 README.md                     |   2 +
 tutorial_submission_policy.md | 107 ++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)
 create mode 100644 tutorial_submission_policy.md

diff --git a/README.md b/README.md
index 0c961afd262..fe4b4b6edd6 100644
--- a/README.md
+++ b/README.md
@@ -22,6 +22,8 @@ We use sphinx-gallery's [notebook styled examples](https://sphinx-gallery.github
 
 Here is how you can create a new tutorial (for a detailed description, see [CONTRIBUTING.md](./CONTRIBUTING.md)):
 
+NOTE: Before submitting a new tutorial, read [PyTorch Tutorial Submission Policy](./tutorial_submission_policy.md).
+
 1. Create a Python file. If you want it executed while inserted into documentation, save the file with the suffix `tutorial` so that the file name is `your_tutorial.py`.
 2. Put it in one of the `beginner_source`, `intermediate_source`, `advanced_source` directory based on the level of difficulty. If it is a recipe, add it to `recipes_source`. For tutorials demonstrating unstable prototype features, add to the `prototype_source`.
 3. For Tutorials (except if it is a prototype feature), include it in the `toctree` directive and create a `customcarditem` in [index.rst](./index.rst).
diff --git a/tutorial_submission_policy.md b/tutorial_submission_policy.md
new file mode 100644
index 00000000000..c5c3a800876
--- /dev/null
+++ b/tutorial_submission_policy.md
@@ -0,0 +1,107 @@
+# PyTorch Tutorial Submission Policy
+
+This policy outlines the criteria and process for submitting new
+tutorials to the PyTorch community.
+Our goal is to ensure that all tutorials are of high quality,
+relevant, and up-to-date, supporting both the growth of the PyTorch
+users and the evolution of the PyTorch framework itself. By following
+these guidelines, contributors can help us maintain a robust and
+informative educational environment.
+
+## Acceptance Criteria For New Tutorials
+
+We accept new tutorials that adhere to one of the following use cases:
+
+* **Demonstrate New PyTorch Features:** Tutorials that support new features
+  for upcoming PyTorch releases are typically authored by the engineers who
+  are developing these features. These tutorials are crucial for showcasing
+  the latest advancements in PyTorch. We typically do not require more than
+  one tutorial per feature.
+
+* **Tutorials showcasing PyTorch usage with other tools and libraries:** We
+  accept community-contributed tutorials that illustrate innovative uses of
+  PyTorch alongside other open-source projects, models, and tools. Please
+  ensure that your tutorial remains neutral and does not promote or endorse
+  proprietary technologies over others.
+
+The first use case does not require going through the submission
+process outlined below. If your tutorial falls under the second category,
+please read and follow the instructions in the
+**Submission Process For Community-Contributed Tutorials** section.
+
+## Submission Process For Community-Contributed Tutorials
+
+To maintain the quality and relevance of tutorials, we request that
+community-contributed tutorials undergo a review process. If you are
+interested in contributing a tutorial, please follow these steps:
+
+1. **Create an issue:**
+   * Open an issue in the pytorch/tutorials repository proposing the
+     new tutorial. Clearly explain the importance of the tutorial and
+     confirm that there is no existing tutorial covering the same or
+     similar topic. A tutorial should not disproportionately endorse
+     one technology over another. Please consult with Core Maintainers
+     to ensure your content adheres to these guidelines.
+     Use the provided [ISSUE_TEMPLATE](https://github.com/pytorch/tutorials/blob/main/.github/ISSUE_TEMPLATE/feature-request.yml) for the new tutorial request - select **Feature request** when submitting an issue.
+
+     * If there is an existing tutorial on the topic that you would
+       like to significantly refactor, you can submit a PR. In the
+       description of the PR, explain why the changes are needed and
+       how they improve the tutorial.
+
+   * These issues will be triaged by PyTorch maintainers on a case-by-case basis. 
+   * Link any supporting materials including discussions in other repositories.
+     
+1. **Await Approval:**
+   * Wait for a response from the PyTorch Tutorials maintainers. A PyTorch
+     tutorial maintainer will review your proposal and
+     determine whether a tutorial on the proposed topic is desirable.
+     A comment and an **approved** label will be added to your issue
+     by a maintainer. The review process for new tutorial PRs submitted
+     without the corresponding issue may take longer.
+     
+1. **Adhere to writing and styling guidelines:**
+   * Once approved, follow the guidelines outlined in [CONTRIBUTING.md](https://github.com/pytorch/tutorials/blob/main/CONTRIBUTING.md)
+     and use the provided [template](https://github.com/pytorch/tutorials/blob/main/beginner_source/template_tutorial.py) for creating your tutorial.
+   * Link the issue in which you received approval for your tutorial
+     in the PR.
+   * We accept tutorials in both ``.rst`` (ReStructuredText) and ``.py``
+     (Python) formats. However, unless your tutorial involves using
+     multiple GPU, parallel/distributed training, or requires extended
+     execution time (25 minutes or more), we prefer submissions
+     in Python file format.
+     
+## Maintaining Tutorials
+
+When you submit a new tutorial, we encourage you to keep it in sync
+with the latest PyTorch updates and features. Additionally, we may
+contact you to review any PRs, issues, and other related matters to
+ensure the tutorial remains a valuable resource.
+
+Please note the following: 
+
+* If a tutorial breaks against the main branch, it will
+  be excluded from the build and an issue will be filed against it,
+  with the author/maintainer notified. If the issue is not resolved
+  within 90 days, the tutorial might be deleted from the repository.
+
+* We recommend that each tutorial is reviewed at least once a year to
+  ensure its relevance.
+
+## Deleting Stale Tutorials
+
+A tutorial might be considered stale when it no longer aligns with
+the latest PyTorch updates, features, or best practices or best
+practices:
+
+* The tutorial is no longer functional due to changes in PyTorch or
+  its dependencies
+* The tutorial has been superseded by a newer, more comprehensive, or
+  more accurate tutorial
+* The tutorial does not run successfully in the (CI), indicating
+  potential compatibility or dependency issues.
+
+If a tutorial is deemed stale, we will attempt to contact the code owner,
+or someone from the tutorial mainatainers might attempt to update it.
+However, if despite those attempts we fail to fix it, the tutorial
+might be removed from the repository.

From 9d97d8f0fe637bcdf53e450caf4852e38e10adad Mon Sep 17 00:00:00 2001
From: Tim Statler <tim.statler@gmail.com>
Date: Mon, 26 Aug 2024 19:02:14 -0700
Subject: [PATCH 19/31] Removed upper-case letter/made 'download' the link text
 instead of 'here'/identified zip file (#3015)

---
 beginner_source/chatbot_tutorial.py | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/beginner_source/chatbot_tutorial.py b/beginner_source/chatbot_tutorial.py
index 44310cc3620..f902f8cd717 100644
--- a/beginner_source/chatbot_tutorial.py
+++ b/beginner_source/chatbot_tutorial.py
@@ -84,8 +84,7 @@
 # Preparations
 # ------------
 #
-# To start, Download the data ZIP file
-# `here <https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip>`__
+# To get started, `download <https://zissou.infosci.cornell.edu/convokit/datasets/movie-corpus/movie-corpus.zip>`__ the Movie-Dialogs Corpus zip file.
 
 # and put in a ``data/`` directory under the current directory.
 #

From acdc91bef8e31ee6f73af6cf5ecb54ff338e2a81 Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Tue, 27 Aug 2024 12:57:19 -0700
Subject: [PATCH 20/31] Add weights_only=True to torch.load (#3012)

* Add weights_only=True to torch.load
---
 advanced_source/dynamic_quantization_tutorial.py |  3 ++-
 advanced_source/static_quantization_tutorial.rst |  2 +-
 beginner_source/basics/quickstart_tutorial.py    |  2 +-
 beginner_source/basics/saveloadrun_tutorial.py   | 16 +++++++++++++---
 beginner_source/blitz/cifar10_tutorial.py        |  2 +-
 beginner_source/fgsm_tutorial.py                 |  2 +-
 beginner_source/saving_loading_models.py         | 16 ++++++++--------
 beginner_source/transfer_learning_tutorial.py    |  2 +-
 .../autograd_saved_tensors_hooks_tutorial.py     |  6 +++---
 intermediate_source/ddp_tutorial.rst             |  2 +-
 intermediate_source/tiatoolbox_tutorial.rst      |  2 +-
 prototype_source/fx_graph_mode_ptq_dynamic.py    |  3 ++-
 prototype_source/fx_graph_mode_ptq_static.rst    |  6 +++---
 prototype_source/pt2e_quant_ptq.rst              |  2 +-
 prototype_source/pt2e_quant_qat.rst              |  2 +-
 .../intel_neural_compressor_for_pytorch.rst      |  2 +-
 .../recipes/module_load_state_dict_tips.py       |  8 ++++----
 .../recipes/save_load_across_devices.py          |  2 +-
 .../saving_and_loading_a_general_checkpoint.py   |  2 +-
 .../saving_and_loading_models_for_inference.py   |  2 +-
 .../saving_multiple_models_in_one_file.py        |  2 +-
 ...el_using_parameters_from_a_different_model.py |  2 +-
 22 files changed, 50 insertions(+), 38 deletions(-)

diff --git a/advanced_source/dynamic_quantization_tutorial.py b/advanced_source/dynamic_quantization_tutorial.py
index 9cc07a1d956..c8d94789d5d 100644
--- a/advanced_source/dynamic_quantization_tutorial.py
+++ b/advanced_source/dynamic_quantization_tutorial.py
@@ -151,7 +151,8 @@ def tokenize(self, path):
 model.load_state_dict(
     torch.load(
         model_data_filepath + 'word_language_model_quantize.pth',
-        map_location=torch.device('cpu')
+        map_location=torch.device('cpu'),
+        weights_only=True
         )
     )
 
diff --git a/advanced_source/static_quantization_tutorial.rst b/advanced_source/static_quantization_tutorial.rst
index 3b818aa03aa..efb171c0dfe 100644
--- a/advanced_source/static_quantization_tutorial.rst
+++ b/advanced_source/static_quantization_tutorial.rst
@@ -286,7 +286,7 @@ We next define several helper functions to help with model evaluation. These mos
 
     def load_model(model_file): 
         model = MobileNetV2() 
-        state_dict = torch.load(model_file) 
+        state_dict = torch.load(model_file, weights_only=True) 
         model.load_state_dict(state_dict) 
         model.to('cpu') 
         return model  
diff --git a/beginner_source/basics/quickstart_tutorial.py b/beginner_source/basics/quickstart_tutorial.py
index 07a1be517d1..df7628081ba 100644
--- a/beginner_source/basics/quickstart_tutorial.py
+++ b/beginner_source/basics/quickstart_tutorial.py
@@ -216,7 +216,7 @@ def test(dataloader, model, loss_fn):
 # the state dictionary into it.
 
 model = NeuralNetwork().to(device)
-model.load_state_dict(torch.load("model.pth"))
+model.load_state_dict(torch.load("model.pth", weights_only=True))
 
 #############################################################
 # This model can now be used to make predictions.
diff --git a/beginner_source/basics/saveloadrun_tutorial.py b/beginner_source/basics/saveloadrun_tutorial.py
index 16a9f037417..5b3aef124b0 100644
--- a/beginner_source/basics/saveloadrun_tutorial.py
+++ b/beginner_source/basics/saveloadrun_tutorial.py
@@ -32,9 +32,14 @@
 ##########################
 # To load model weights, you need to create an instance of the same model first, and then load the parameters
 # using ``load_state_dict()`` method.
+#
+# In the code below, we set ``weights_only=True`` to limit the
+# functions executed during unpickling to only those necessary for
+# loading weights. Using ``weights_only=True`` is considered
+# a best practice when loading weights.
 
 model = models.vgg16() # we do not specify ``weights``, i.e. create untrained model
-model.load_state_dict(torch.load('model_weights.pth'))
+model.load_state_dict(torch.load('model_weights.pth', weights_only=True))
 model.eval()
 
 ###########################
@@ -50,9 +55,14 @@
 torch.save(model, 'model.pth')
 
 ########################
-# We can then load the model like this:
+# We can then load the model as demonstrated below.
+#
+# As described in `Saving and loading torch.nn.Modules <pytorch.org/docs/main/notes/serialization.html#saving-and-loading-torch-nn-modules>`__,
+# saving ``state_dict``s is considered the best practice. However,
+# below we use ``weights_only=False`` because this involves loading the
+# model, which is a legacy use case for ``torch.save``.
 
-model = torch.load('model.pth')
+model = torch.load('model.pth', weights_only=False),
 
 ########################
 # .. note:: This approach uses Python `pickle <https://docs.python.org/3/library/pickle.html>`_ module when serializing the model, thus it relies on the actual class definition to be available when loading the model.
diff --git a/beginner_source/blitz/cifar10_tutorial.py b/beginner_source/blitz/cifar10_tutorial.py
index 8e3f3252921..f38abdd5666 100644
--- a/beginner_source/blitz/cifar10_tutorial.py
+++ b/beginner_source/blitz/cifar10_tutorial.py
@@ -221,7 +221,7 @@ def forward(self, x):
 # wasn't necessary here, we only did it to illustrate how to do so):
 
 net = Net()
-net.load_state_dict(torch.load(PATH))
+net.load_state_dict(torch.load(PATH, weights_only=True))
 
 ########################################################################
 # Okay, now let us see what the neural network thinks these examples above are:
diff --git a/beginner_source/fgsm_tutorial.py b/beginner_source/fgsm_tutorial.py
index 007ad3fd956..9bdf52d84b4 100644
--- a/beginner_source/fgsm_tutorial.py
+++ b/beginner_source/fgsm_tutorial.py
@@ -192,7 +192,7 @@ def forward(self, x):
 model = Net().to(device)
 
 # Load the pretrained model
-model.load_state_dict(torch.load(pretrained_model, map_location=device))
+model.load_state_dict(torch.load(pretrained_model, map_location=device, weights_only=True))
 
 # Set the model in evaluation mode. In this case this is for the Dropout layers
 model.eval()
diff --git a/beginner_source/saving_loading_models.py b/beginner_source/saving_loading_models.py
index fcd33be2537..6c9b6b1fd77 100644
--- a/beginner_source/saving_loading_models.py
+++ b/beginner_source/saving_loading_models.py
@@ -153,7 +153,7 @@
 # .. code:: python
 #
 #    model = TheModelClass(*args, **kwargs)
-#    model.load_state_dict(torch.load(PATH))
+#    model.load_state_dict(torch.load(PATH), weights_only=True)
 #    model.eval()
 #
 # .. note::
@@ -206,7 +206,7 @@
 # .. code:: python
 #
 #    # Model class must be defined somewhere
-#    model = torch.load(PATH)
+#    model = torch.load(PATH, weights_only=False)
 #    model.eval()
 #
 # This save/load process uses the most intuitive syntax and involves the
@@ -290,7 +290,7 @@
 #    model = TheModelClass(*args, **kwargs)
 #    optimizer = TheOptimizerClass(*args, **kwargs)
 #
-#    checkpoint = torch.load(PATH)
+#    checkpoint = torch.load(PATH, weights_only=True)
 #    model.load_state_dict(checkpoint['model_state_dict'])
 #    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
 #    epoch = checkpoint['epoch']
@@ -354,7 +354,7 @@
 #    optimizerA = TheOptimizerAClass(*args, **kwargs)
 #    optimizerB = TheOptimizerBClass(*args, **kwargs)
 #
-#    checkpoint = torch.load(PATH)
+#    checkpoint = torch.load(PATH, weights_only=True)
 #    modelA.load_state_dict(checkpoint['modelA_state_dict'])
 #    modelB.load_state_dict(checkpoint['modelB_state_dict'])
 #    optimizerA.load_state_dict(checkpoint['optimizerA_state_dict'])
@@ -407,7 +407,7 @@
 # .. code:: python
 #
 #    modelB = TheModelBClass(*args, **kwargs)
-#    modelB.load_state_dict(torch.load(PATH), strict=False)
+#    modelB.load_state_dict(torch.load(PATH), strict=False, weights_only=True)
 #
 # Partially loading a model or loading a partial model are common
 # scenarios when transfer learning or training a new complex model.
@@ -446,7 +446,7 @@
 #
 #    device = torch.device('cpu')
 #    model = TheModelClass(*args, **kwargs)
-#    model.load_state_dict(torch.load(PATH, map_location=device))
+#    model.load_state_dict(torch.load(PATH, map_location=device, weights_only=True))
 #
 # When loading a model on a CPU that was trained with a GPU, pass
 # ``torch.device('cpu')`` to the ``map_location`` argument in the
@@ -469,7 +469,7 @@
 #
 #    device = torch.device("cuda")
 #    model = TheModelClass(*args, **kwargs)
-#    model.load_state_dict(torch.load(PATH))
+#    model.load_state_dict(torch.load(PATH, weights_only=True))
 #    model.to(device)
 #    # Make sure to call input = input.to(device) on any input tensors that you feed to the model
 #
@@ -497,7 +497,7 @@
 #
 #    device = torch.device("cuda")
 #    model = TheModelClass(*args, **kwargs)
-#    model.load_state_dict(torch.load(PATH, map_location="cuda:0"))  # Choose whatever GPU device number you want
+#    model.load_state_dict(torch.load(PATH, weights_only=True, map_location="cuda:0"))  # Choose whatever GPU device number you want
 #    model.to(device)
 #    # Make sure to call input = input.to(device) on any input tensors that you feed to the model
 #
diff --git a/beginner_source/transfer_learning_tutorial.py b/beginner_source/transfer_learning_tutorial.py
index 7a2b053763a..de7a178bd7d 100644
--- a/beginner_source/transfer_learning_tutorial.py
+++ b/beginner_source/transfer_learning_tutorial.py
@@ -209,7 +209,7 @@ def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
         print(f'Best val Acc: {best_acc:4f}')
 
         # load best model weights
-        model.load_state_dict(torch.load(best_model_params_path))
+        model.load_state_dict(torch.load(best_model_params_path, weights_only=True))
     return model
 
 
diff --git a/intermediate_source/autograd_saved_tensors_hooks_tutorial.py b/intermediate_source/autograd_saved_tensors_hooks_tutorial.py
index f16b170ee6a..ed581426c2e 100644
--- a/intermediate_source/autograd_saved_tensors_hooks_tutorial.py
+++ b/intermediate_source/autograd_saved_tensors_hooks_tutorial.py
@@ -397,7 +397,7 @@ def pack_hook(tensor):
     return name
 
 def unpack_hook(name):
-    return torch.load(name)
+    return torch.load(name, weights_only=True)
 
 
 ######################################################################
@@ -420,7 +420,7 @@ def pack_hook(tensor):
     return name
 
 def unpack_hook(name):
-    tensor = torch.load(name)
+    tensor = torch.load(name, weights_only=True)
     os.remove(name)
     return tensor
 
@@ -462,7 +462,7 @@ def pack_hook(tensor):
     return temp_file
 
 def unpack_hook(temp_file):
-    return torch.load(temp_file.name)
+    return torch.load(temp_file.name, weights_only=True)
 
 
 ######################################################################
diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
index 13297fb2a12..cff5105fa54 100644
--- a/intermediate_source/ddp_tutorial.rst
+++ b/intermediate_source/ddp_tutorial.rst
@@ -214,7 +214,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
         # configure map_location properly
         map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
         ddp_model.load_state_dict(
-            torch.load(CHECKPOINT_PATH, map_location=map_location))
+            torch.load(CHECKPOINT_PATH, map_location=map_location, weights_only=True))
 
         loss_fn = nn.MSELoss()
         optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
diff --git a/intermediate_source/tiatoolbox_tutorial.rst b/intermediate_source/tiatoolbox_tutorial.rst
index dbaf3cdc464..de9b3031330 100644
--- a/intermediate_source/tiatoolbox_tutorial.rst
+++ b/intermediate_source/tiatoolbox_tutorial.rst
@@ -368,7 +368,7 @@ The PatchPredictor class runs a CNN-based classifier written in PyTorch.
 
     # Users can load any PyTorch model architecture instead using the following script
     model = vanilla.CNNModel(backbone="resnet18", num_classes=9) # Importing model from torchvision.models.resnet18
-    model.load_state_dict(torch.load(weights_path, map_location="cpu"), strict=True)
+    model.load_state_dict(torch.load(weights_path, map_location="cpu", weights_only=True), strict=True)
     def preproc_func(img):
         img = PIL.Image.fromarray(img)
         img = transforms.ToTensor()(img)
diff --git a/prototype_source/fx_graph_mode_ptq_dynamic.py b/prototype_source/fx_graph_mode_ptq_dynamic.py
index 84d6ccb1832..fc29e5fa97b 100644
--- a/prototype_source/fx_graph_mode_ptq_dynamic.py
+++ b/prototype_source/fx_graph_mode_ptq_dynamic.py
@@ -171,7 +171,8 @@ def tokenize(self, path):
 model.load_state_dict(
     torch.load(
         model_data_filepath + 'word_language_model_quantize.pth',
-        map_location=torch.device('cpu')
+        map_location=torch.device('cpu'),
+        weights_only=True
         )
     )
 
diff --git a/prototype_source/fx_graph_mode_ptq_static.rst b/prototype_source/fx_graph_mode_ptq_static.rst
index a7165f713f8..0c4f8065e37 100644
--- a/prototype_source/fx_graph_mode_ptq_static.rst
+++ b/prototype_source/fx_graph_mode_ptq_static.rst
@@ -157,7 +157,7 @@ Download the `torchvision resnet18 model <https://download.pytorch.org/models/re
 
     def load_model(model_file):
         model = resnet18(pretrained=False)
-        state_dict = torch.load(model_file)
+        state_dict = torch.load(model_file, weights_only=True)
         model.load_state_dict(state_dict)
         model.to("cpu")
         return model
@@ -320,7 +320,7 @@ We can now print the size and accuracy of the quantized model.
     # ModuleAttributeError: 'ConvReLU2d' object has no attribute '_modules'
     # save the whole model directly
     # torch.save(quantized_model, fx_graph_mode_model_file_path)
-    # loaded_quantized_model = torch.load(fx_graph_mode_model_file_path)
+    # loaded_quantized_model = torch.load(fx_graph_mode_model_file_path, weights_only=False)
 
     # save with state_dict
     # torch.save(quantized_model.state_dict(), fx_graph_mode_model_file_path)
@@ -328,7 +328,7 @@ We can now print the size and accuracy of the quantized model.
     # model_to_quantize = copy.deepcopy(float_model)
     # prepared_model = prepare_fx(model_to_quantize, {"": qconfig})
     # loaded_quantized_model = convert_fx(prepared_model)
-    # loaded_quantized_model.load_state_dict(torch.load(fx_graph_mode_model_file_path))
+    # loaded_quantized_model.load_state_dict(torch.load(fx_graph_mode_model_file_path), weights_only=True)
 
     # save with script
     torch.jit.save(torch.jit.script(quantized_model), fx_graph_mode_model_file_path)
diff --git a/prototype_source/pt2e_quant_ptq.rst b/prototype_source/pt2e_quant_ptq.rst
index 7f46c86e42e..0fe713f8abe 100644
--- a/prototype_source/pt2e_quant_ptq.rst
+++ b/prototype_source/pt2e_quant_ptq.rst
@@ -274,7 +274,7 @@ and rename it to ``data/resnet18_pretrained_float.pth``.
 
     def load_model(model_file):
         model = resnet18(pretrained=False)
-        state_dict = torch.load(model_file)
+        state_dict = torch.load(model_file, weights_only=True)
         model.load_state_dict(state_dict)
         model.to("cpu")
         return model
diff --git a/prototype_source/pt2e_quant_qat.rst b/prototype_source/pt2e_quant_qat.rst
index 6d995d368e0..d716af5fec8 100644
--- a/prototype_source/pt2e_quant_qat.rst
+++ b/prototype_source/pt2e_quant_qat.rst
@@ -172,7 +172,7 @@ prepare the data. These steps are very similar to the ones defined in the
 
     def load_model(model_file):
         model = resnet18(pretrained=False)
-        state_dict = torch.load(model_file)
+        state_dict = torch.load(model_file, weights_only=True)
         model.load_state_dict(state_dict)
         return model
 
diff --git a/recipes_source/intel_neural_compressor_for_pytorch.rst b/recipes_source/intel_neural_compressor_for_pytorch.rst
index 67f1a7f333e..02ce3d7b378 100755
--- a/recipes_source/intel_neural_compressor_for_pytorch.rst
+++ b/recipes_source/intel_neural_compressor_for_pytorch.rst
@@ -115,7 +115,7 @@ In this tutorial, the LeNet model is used to demonstrate how to deal with *Intel
             return F.log_softmax(x, dim=1)
 
     model = Net()
-    model.load_state_dict(torch.load('./lenet_mnist_model.pth'))
+    model.load_state_dict(torch.load('./lenet_mnist_model.pth', weights_only=True))
 
 The pretrained model weight `lenet_mnist_model.pth` comes from
 `here <https://drive.google.com/drive/folders/1fn83DF14tWmit0RTKWRhPq5uVXt73e0h?usp=sharing>`_.
diff --git a/recipes_source/recipes/module_load_state_dict_tips.py b/recipes_source/recipes/module_load_state_dict_tips.py
index 17c812b016f..70e9830cb3c 100644
--- a/recipes_source/recipes/module_load_state_dict_tips.py
+++ b/recipes_source/recipes/module_load_state_dict_tips.py
@@ -39,7 +39,7 @@ def forward(self, x):
 # to ``torch.load``, the ``torch.device()`` context manager and the ``assign``
 # keyword argument to ``nn.Module.load_state_dict()``.
 
-state_dict = torch.load('checkpoint.pth', mmap=True)
+state_dict = torch.load('checkpoint.pth', mmap=True, weights_only=True)
 with torch.device('meta'):
   meta_m = SomeModule(1000)
 meta_m.load_state_dict(state_dict, assign=True)
@@ -47,7 +47,7 @@ def forward(self, x):
 #############################################################################
 # Compare the snippet below to the one above:
 
-state_dict = torch.load('checkpoint.pth')
+state_dict = torch.load('checkpoint.pth', weights_only=True)
 m = SomeModule(1000)
 m.load_state_dict(state_dict)
 
@@ -71,7 +71,7 @@ def forward(self, x):
 # * Waiting for the entire checkpoint to be loaded into RAM before performing, for example, some per-tensor processing.
 
 start_time = time.time()
-state_dict = torch.load('checkpoint.pth')
+state_dict = torch.load('checkpoint.pth', weights_only=True)
 end_time = time.time()
 print(f"loading time without mmap={end_time - start_time}")
 
@@ -84,7 +84,7 @@ def forward(self, x):
 # storages will be memory-mapped.
 
 start_time = time.time()
-state_dict = torch.load('checkpoint.pth', mmap=True)
+state_dict = torch.load('checkpoint.pth', mmap=True, weights_only=True)
 end_time = time.time()
 print(f"loading time with mmap={end_time - start_time}")
 
diff --git a/recipes_source/recipes/save_load_across_devices.py b/recipes_source/recipes/save_load_across_devices.py
index be950e15b13..c59af8821e9 100644
--- a/recipes_source/recipes/save_load_across_devices.py
+++ b/recipes_source/recipes/save_load_across_devices.py
@@ -97,7 +97,7 @@ def forward(self, x):
 # Load
 device = torch.device('cpu')
 model = Net()
-model.load_state_dict(torch.load(PATH, map_location=device))
+model.load_state_dict(torch.load(PATH, map_location=device, weights_only=True))
 
 
 ######################################################################
diff --git a/recipes_source/recipes/saving_and_loading_a_general_checkpoint.py b/recipes_source/recipes/saving_and_loading_a_general_checkpoint.py
index 31b14f3a28a..8c773a14909 100644
--- a/recipes_source/recipes/saving_and_loading_a_general_checkpoint.py
+++ b/recipes_source/recipes/saving_and_loading_a_general_checkpoint.py
@@ -131,7 +131,7 @@ def forward(self, x):
 model = Net()
 optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
 
-checkpoint = torch.load(PATH)
+checkpoint = torch.load(PATH, weights_only=True)
 model.load_state_dict(checkpoint['model_state_dict'])
 optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
 epoch = checkpoint['epoch']
diff --git a/recipes_source/recipes/saving_and_loading_models_for_inference.py b/recipes_source/recipes/saving_and_loading_models_for_inference.py
index cd24b77c1de..7adce2a90b5 100644
--- a/recipes_source/recipes/saving_and_loading_models_for_inference.py
+++ b/recipes_source/recipes/saving_and_loading_models_for_inference.py
@@ -117,7 +117,7 @@ def forward(self, x):
 
 # Load
 model = Net()
-model.load_state_dict(torch.load(PATH))
+model.load_state_dict(torch.load(PATH, weights_only=True))
 model.eval()
 
 
diff --git a/recipes_source/recipes/saving_multiple_models_in_one_file.py b/recipes_source/recipes/saving_multiple_models_in_one_file.py
index f468d7ac6a1..e938be03b45 100644
--- a/recipes_source/recipes/saving_multiple_models_in_one_file.py
+++ b/recipes_source/recipes/saving_multiple_models_in_one_file.py
@@ -128,7 +128,7 @@ def forward(self, x):
 optimModelA = optim.SGD(modelA.parameters(), lr=0.001, momentum=0.9)
 optimModelB = optim.SGD(modelB.parameters(), lr=0.001, momentum=0.9)
 
-checkpoint = torch.load(PATH)
+checkpoint = torch.load(PATH, weights_only=True)
 modelA.load_state_dict(checkpoint['modelA_state_dict'])
 modelB.load_state_dict(checkpoint['modelB_state_dict'])
 optimizerA.load_state_dict(checkpoint['optimizerA_state_dict'])
diff --git a/recipes_source/recipes/warmstarting_model_using_parameters_from_a_different_model.py b/recipes_source/recipes/warmstarting_model_using_parameters_from_a_different_model.py
index 40aeeea9db8..a0752bfc67d 100644
--- a/recipes_source/recipes/warmstarting_model_using_parameters_from_a_different_model.py
+++ b/recipes_source/recipes/warmstarting_model_using_parameters_from_a_different_model.py
@@ -124,7 +124,7 @@ def forward(self, x):
 # are loading into.
 # 
 
-netB.load_state_dict(torch.load(PATH), strict=False)
+netB.load_state_dict(torch.load(PATH, weights_only=True), strict=False)
 
 
 ######################################################################

From 3dacf894518aa9885ef5e55ab0edcc878913495a Mon Sep 17 00:00:00 2001
From: dev_thomas <36235705+hadh93@users.noreply.github.com>
Date: Thu, 29 Aug 2024 01:41:22 +0900
Subject: [PATCH 21/31] Fix typos in dynamic_quantization_bert_tutorial.rst
 (#3019)

---
 intermediate_source/dynamic_quantization_bert_tutorial.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/intermediate_source/dynamic_quantization_bert_tutorial.rst b/intermediate_source/dynamic_quantization_bert_tutorial.rst
index 1ea6ea46dd0..e515f53a1df 100644
--- a/intermediate_source/dynamic_quantization_bert_tutorial.rst
+++ b/intermediate_source/dynamic_quantization_bert_tutorial.rst
@@ -79,7 +79,7 @@ Mac:
 
 .. code:: shell
 
-   yes y | pip uninstall torch tochvision
+   yes y | pip uninstall torch torchvision
    yes y | pip install --pre torch -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
 
 
@@ -206,7 +206,7 @@ in `examples <https://github.com/huggingface/transformers/tree/master/examples#m
        --save_steps 100000 \
        --output_dir $OUT_DIR
 
-We provide the fined-tuned BERT model for MRPC task `here <https://download.pytorch.org/tutorial/MRPC.zip>`_.
+We provide the fine-tuned BERT model for MRPC task `here <https://download.pytorch.org/tutorial/MRPC.zip>`_.
 To save time, you can download the model file (~400 MB) directly into your local folder ``$OUT_DIR``.
 
 2.1 Set global configurations
@@ -273,7 +273,7 @@ We load the tokenizer and fine-tuned BERT sequence classifier model
 2.3 Define the tokenize and evaluation function
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We reuse the tokenize and evaluation function from `Huggingface <https://github.com/huggingface/transformers/blob/master/examples/run_glue.py>`_.
+We reuse the tokenize and evaluation function from `HuggingFace <https://github.com/huggingface/transformers/blob/master/examples/run_glue.py>`_.
 
 .. code:: python
 

From fc5a61252b2424528801e8fd79054969fe21fc8a Mon Sep 17 00:00:00 2001
From: Richard Zou <zou3519@users.noreply.github.com>
Date: Thu, 29 Aug 2024 10:05:07 -0400
Subject: [PATCH 22/31] Improve custom ops tutorials (#3020)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 advanced_source/cpp_custom_ops.rst   |  2 ++
 advanced_source/python_custom_ops.py | 15 ++++++++++-----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/advanced_source/cpp_custom_ops.rst b/advanced_source/cpp_custom_ops.rst
index 435ff088bc0..ffabd6eff77 100644
--- a/advanced_source/cpp_custom_ops.rst
+++ b/advanced_source/cpp_custom_ops.rst
@@ -174,6 +174,8 @@ To add ``torch.compile`` support for an operator, we must add a FakeTensor kerne
 known as a "meta kernel" or "abstract impl"). FakeTensors are Tensors that have
 metadata (such as shape, dtype, device) but no data: the FakeTensor kernel for an
 operator specifies how to compute the metadata of output tensors given the metadata of input tensors.
+The FakeTensor kernel should return dummy Tensors of your choice with
+the correct Tensor metadata (shape/strides/``dtype``/device).
 
 We recommend that this be done from Python via the `torch.library.register_fake` API,
 though it is possible to do this from C++ as well (see
diff --git a/advanced_source/python_custom_ops.py b/advanced_source/python_custom_ops.py
index 1e429b76b35..0b3bf6e4748 100644
--- a/advanced_source/python_custom_ops.py
+++ b/advanced_source/python_custom_ops.py
@@ -66,7 +66,7 @@ def display(img):
 ######################################################################
 # ``crop`` is not handled effectively out-of-the-box by
 # ``torch.compile``: ``torch.compile`` induces a
-# `"graph break" <https://pytorch.org/docs/stable/torch.compiler_faq.html#graph-breaks>`_ 
+# `"graph break" <https://pytorch.org/docs/stable/torch.compiler_faq.html#graph-breaks>`_
 # on functions it is unable to handle and graph breaks are bad for performance.
 # The following code demonstrates this by raising an error
 # (``torch.compile`` with ``fullgraph=True`` raises an error if a
@@ -85,9 +85,9 @@ def f(img):
 #
 # 1. wrap the function into a PyTorch custom operator.
 # 2. add a "``FakeTensor`` kernel" (aka "meta kernel") to the operator.
-#    Given the metadata (e.g. shapes)
-#    of the input Tensors, this function says how to compute the metadata
-#    of the output Tensor(s).
+#    Given some ``FakeTensors`` inputs (dummy Tensors that don't have storage),
+#    this function should return dummy Tensors of your choice with the correct
+#    Tensor metadata (shape/strides/``dtype``/device).
 
 
 from typing import Sequence
@@ -130,6 +130,11 @@ def f(img):
 # ``autograd.Function`` with PyTorch operator registration APIs can lead to (and
 # has led to) silent incorrectness when composed with ``torch.compile``.
 #
+# If you don't need training support, there is no need to use
+# ``torch.library.register_autograd``.
+# If you end up training with a ``custom_op`` that doesn't have an autograd
+# registration, we'll raise an error message.
+#
 # The gradient formula for ``crop`` is essentially ``PIL.paste`` (we'll leave the
 # derivation as an exercise to the reader). Let's first wrap ``paste`` into a
 # custom operator:
@@ -203,7 +208,7 @@ def setup_context(ctx, inputs, output):
 ######################################################################
 # Mutable Python Custom operators
 # -------------------------------
-# You can also wrap a Python function that mutates its inputs into a custom 
+# You can also wrap a Python function that mutates its inputs into a custom
 # operator.
 # Functions that mutate inputs are common because that is how many low-level
 # kernels are written; for example, a kernel that computes ``sin`` may take in

From c1e792a099392628e898194acf6268f1e94d5001 Mon Sep 17 00:00:00 2001
From: Tim Statler <tim.statler@gmail.com>
Date: Thu, 29 Aug 2024 08:13:14 -0700
Subject: [PATCH 23/31] Removed outdated steps in README about running about
 setup.py (#3014)

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 README.md | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index fe4b4b6edd6..af84d9ebe79 100644
--- a/README.md
+++ b/README.md
@@ -33,7 +33,7 @@ If you are starting off with a Jupyter notebook, you can use [this script](https
 
 ## Building locally
 
-The tutorial build is very large and requires a GPU. If your machine does not have a GPU device, you can preview your HTML build without actually downloading the data and running the tutorial code: 
+The tutorial build is very large and requires a GPU. If your machine does not have a GPU device, you can preview your HTML build without actually downloading the data and running the tutorial code:
 
 1. Install required dependencies by running: `pip install -r requirements.txt`.
 
@@ -42,8 +42,6 @@ The tutorial build is very large and requires a GPU. If your machine does not ha
 - If you have a GPU-powered laptop, you can build using `make docs`. This will download the data, execute the tutorials and build the documentation to `docs/` directory. This might take about 60-120 min for systems with GPUs. If you do not have a GPU installed on your system, then see next step.
 - You can skip the computationally intensive graph generation by running `make html-noplot` to build basic html documentation to `_build/html`. This way, you can quickly preview your tutorial.
 
-> If you get **ModuleNotFoundError: No module named 'pytorch_sphinx_theme' make: *** [html-noplot] Error 2** from /tutorials/src/pytorch-sphinx-theme or /venv/src/pytorch-sphinx-theme (while using virtualenv), run `python setup.py install`.
-
 ## Building a single tutorial
 
 You can build a single tutorial by using the `GALLERY_PATTERN` environment variable. For example to run only `neural_style_transfer_tutorial.py`, run:
@@ -61,8 +59,8 @@ The `GALLERY_PATTERN` variable respects regular expressions.
 
 
 ## About contributing to PyTorch Documentation and Tutorials
-* You can find information about contributing to PyTorch documentation in the 
-PyTorch Repo [README.md](https://github.com/pytorch/pytorch/blob/master/README.md) file. 
+* You can find information about contributing to PyTorch documentation in the
+PyTorch Repo [README.md](https://github.com/pytorch/pytorch/blob/master/README.md) file.
 * Additional information can be found in [PyTorch CONTRIBUTING.md](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md).
 
 

From 202693f1c875a991edcaf7c0499fc26e326cd236 Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Thu, 29 Aug 2024 09:56:59 -0700
Subject: [PATCH 24/31] Fix hovering over the GCS search button (#3005)

* Fix hovering over the GCS search button
---
 _static/css/custom.css | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/_static/css/custom.css b/_static/css/custom.css
index 09aba28f258..cc195d99061 100755
--- a/_static/css/custom.css
+++ b/_static/css/custom.css
@@ -100,3 +100,15 @@
    padding-left: 0px !important;
    padding-bottom: 0px !important;
 }
+
+.gsc-search-button .gsc-search-button-v2:focus {
+   border: transparent !important;
+   outline: none;
+   box-shadow: none;
+}
+.gsc-search-button-v2:active {
+   border: none !important;
+}
+.gsc-search-button-v2 {
+   border: none !important;
+}

From 5465f9b33fd38fe600648592b9c88c31d25ab60f Mon Sep 17 00:00:00 2001
From: Tim Statler <tim.statler@gmail.com>
Date: Fri, 30 Aug 2024 14:36:59 -0700
Subject: [PATCH 25/31] Added warnings to select Pytorch mobile tutorials
 directing users to ExecuTorch (#3016)

* Added warnings to select ExecuTorch tutorials/recipes/prototypes

* Added warnings to select ExecuTorch tutorials/recipes/prototypes

* Added redirect for renamed prototype

* Update deeplabv3_on_android.rst

Fixed misplaced info tag.

* Apply suggestions from code review

---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 beginner_source/deeplabv3_on_android.rst    | 4 ++++
 prototype_source/lite_interpreter.rst       | 9 +++++++++
 recipes_source/mobile_interpreter.rst       | 3 +++
 recipes_source/mobile_perf.rst              | 5 ++++-
 recipes_source/ptmobile_recipes_summary.rst | 3 +++
 5 files changed, 23 insertions(+), 1 deletion(-)
 create mode 100644 prototype_source/lite_interpreter.rst

diff --git a/beginner_source/deeplabv3_on_android.rst b/beginner_source/deeplabv3_on_android.rst
index f2fe0e48f15..5ca7f01ad06 100644
--- a/beginner_source/deeplabv3_on_android.rst
+++ b/beginner_source/deeplabv3_on_android.rst
@@ -5,6 +5,10 @@ Image Segmentation DeepLabV3 on Android
 
 **Reviewed by**: `Jeremiah Chung <https://github.com/jeremiahschung>`_
 
+.. warning::
+    PyTorch Mobile is no longer actively supported. Please check out `ExecuTorch <https://pytorch.org/executorch-overview>`_, PyTorch’s all-new on-device inference library. You can also review our `end-to-end workflows <https://github.com/pytorch/executorch/tree/main/examples/portable#readme>`_ and review the `source code for DeepLabV3 <https://github.com/pytorch/executorch/tree/main/examples/models/deeplab_v3>`_.
+
+
 Introduction
 ------------
 
diff --git a/prototype_source/lite_interpreter.rst b/prototype_source/lite_interpreter.rst
new file mode 100644
index 00000000000..73e950d72e2
--- /dev/null
+++ b/prototype_source/lite_interpreter.rst
@@ -0,0 +1,9 @@
+(Prototype) Introduce lite interpreter workflow in Android and iOS
+=======================
+
+This tutorial has been moved to https://pytorch.org/tutorials/recipes/mobile_interpreter.html
+
+
+.. raw:: html
+
+   <meta http-equiv="Refresh" content="0; url='https://pytorch.org/tutorials/recipes/mobile_interpreter.html'" />
diff --git a/recipes_source/mobile_interpreter.rst b/recipes_source/mobile_interpreter.rst
index dda1dd92435..44036e74ffd 100644
--- a/recipes_source/mobile_interpreter.rst
+++ b/recipes_source/mobile_interpreter.rst
@@ -3,6 +3,9 @@
 
 **Author**: `Chen Lai <https://github.com/cccclai>`_, `Martin Yuan <https://github.com/iseeyuan>`_
 
+.. warning::
+    PyTorch Mobile is no longer actively supported. Please check out `ExecuTorch <https://pytorch.org/executorch-overview>`_, PyTorch’s all-new on-device inference library. You can also review our new documentation to learn more about how to build `iOS <https://pytorch.org/executorch/stable/demo-apps-ios.html>`_ and `Android <https://pytorch.org/executorch/stable/demo-apps-android.html>`_ apps with ExecuTorch.
+
 Introduction
 ------------
 
diff --git a/recipes_source/mobile_perf.rst b/recipes_source/mobile_perf.rst
index aae1447cbf8..14f183ab69e 100644
--- a/recipes_source/mobile_perf.rst
+++ b/recipes_source/mobile_perf.rst
@@ -1,6 +1,9 @@
 Pytorch Mobile Performance Recipes
 ==================================
 
+.. warning::
+    PyTorch Mobile is no longer actively supported. Please check out `ExecuTorch <https://pytorch.org/executorch-overview>`_, PyTorch’s all-new on-device inference library. You can also learn more about `quantization <https://pytorch.org/executorch/stable/quantization-overview.html>`_, `Hardware acceleration (op fusion using hw) <https://pytorch.org/executorch/stable/examples-end-to-end-to-lower-model-to-delegate.html>`_, and `benchmarking <https://pytorch.org/executorch/stable/sdk-profiling.html>`_ on ExecuTorch’s documentation pages.
+
 Introduction
 ----------------
 Performance (aka latency) is crucial to most, if not all,
@@ -245,7 +248,7 @@ For example, using ResNet-50 and running the following script:
 
 
 
-you would get the following result: 
+you would get the following result:
 
 ::
 
diff --git a/recipes_source/ptmobile_recipes_summary.rst b/recipes_source/ptmobile_recipes_summary.rst
index cddee940f2a..6cc8f6f7514 100644
--- a/recipes_source/ptmobile_recipes_summary.rst
+++ b/recipes_source/ptmobile_recipes_summary.rst
@@ -1,6 +1,9 @@
 Summary of PyTorch Mobile Recipes
 =====================================
 
+.. warning::
+    Note: PyTorch Mobile is no longer actively supported. Please check out `ExecuTorch <https://pytorch.org/executorch-overview>`_, PyTorch’s all-new on-device inference library. You can also review these `ExecuTorch examples <https://github.com/pytorch/executorch/tree/main/examples#readme>`_.
+
 This summary provides a top level overview of recipes for PyTorch Mobile to help developers choose which recipes to follow for their PyTorch-powered mobile app development.
 
 Introduction

From f45ddc242ba12e3e3590b23ea07bfd44f6e9acca Mon Sep 17 00:00:00 2001
From: ibartol <50273483+ignaciobartol@users.noreply.github.com>
Date: Fri, 30 Aug 2024 16:40:02 -0500
Subject: [PATCH 26/31] Patched docs for torch_compile_tutorial (#2936)

* Patched docs for torch_compile_tutorial

---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 intermediate_source/torch_compile_tutorial.py | 102 +++++++++++++++++-
 1 file changed, 100 insertions(+), 2 deletions(-)

diff --git a/intermediate_source/torch_compile_tutorial.py b/intermediate_source/torch_compile_tutorial.py
index 5e7112f5b93..67b055d9ff2 100644
--- a/intermediate_source/torch_compile_tutorial.py
+++ b/intermediate_source/torch_compile_tutorial.py
@@ -73,17 +73,21 @@ def foo(x, y):
 
 ######################################################################
 # Alternatively, we can decorate the function.
+t1 = torch.randn(10, 10)
+t2 = torch.randn(10, 10)
 
 @torch.compile
 def opt_foo2(x, y):
     a = torch.sin(x)
     b = torch.cos(y)
     return a + b
-print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))
+print(opt_foo2(t1, t2))
 
 ######################################################################
 # We can also optimize ``torch.nn.Module`` instances.
 
+t = torch.randn(10, 100)
+
 class MyModule(torch.nn.Module):
     def __init__(self):
         super().__init__()
@@ -94,7 +98,101 @@ def forward(self, x):
 
 mod = MyModule()
 opt_mod = torch.compile(mod)
-print(opt_mod(torch.randn(10, 100)))
+print(opt_mod(t))
+
+######################################################################
+# torch.compile and Nested Calls
+# ------------------------------
+# Nested function calls within the decorated function will also be compiled.
+
+def nested_function(x):
+    return torch.sin(x)
+
+@torch.compile
+def outer_function(x, y):
+    a = nested_function(x)
+    b = torch.cos(y)
+    return a + b
+
+print(outer_function(t1, t2))
+
+######################################################################
+# In the same fashion, when compiling a module all sub-modules and methods
+# within it, that are not in a skip list, are also compiled.
+
+class OuterModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.inner_module = MyModule()
+        self.outer_lin = torch.nn.Linear(10, 2)
+
+    def forward(self, x):
+        x = self.inner_module(x)
+        return torch.nn.functional.relu(self.outer_lin(x))
+
+outer_mod = OuterModule()
+opt_outer_mod = torch.compile(outer_mod)
+print(opt_outer_mod(t))
+
+######################################################################
+# We can also disable some functions from being compiled by using
+# ``torch.compiler.disable``. Suppose you want to disable the tracing on just
+# the ``complex_function`` function, but want to continue the tracing back in
+# ``complex_conjugate``. In this case, you can use
+# ``torch.compiler.disable(recursive=False)`` option. Otherwise, the default is
+# ``recursive=True``.
+
+def complex_conjugate(z):
+    return torch.conj(z)
+
+@torch.compiler.disable(recursive=False)
+def complex_function(real, imag):
+    # Assuming this function cause problems in the compilation
+    z = torch.complex(real, imag)
+    return complex_conjugate(z)
+
+def outer_function():
+    real = torch.tensor([2, 3], dtype=torch.float32)
+    imag = torch.tensor([4, 5], dtype=torch.float32)
+    z = complex_function(real, imag)
+    return torch.abs(z)
+
+# Try to compile the outer_function
+try:
+    opt_outer_function = torch.compile(outer_function)
+    print(opt_outer_function())
+except Exception as e:
+    print("Compilation of outer_function failed:", e)
+
+######################################################################
+# Best Practices and Recommendations
+# ----------------------------------
+#
+# Behavior of ``torch.compile`` with Nested Modules and Function Calls
+#
+# When you use ``torch.compile``, the compiler will try to recursively compile
+# every function call inside the target function or module inside the target
+# function or module that is not in a skip list (such as built-ins, some functions in
+# the torch.* namespace).
+# 
+# **Best Practices:**
+#
+# 1. **Top-Level Compilation:** One approach is to compile at the highest level
+# possible (i.e., when the top-level module is initialized/called) and
+# selectively disable compilation when encountering excessive graph breaks or
+# errors. If there are still many compile issues, compile individual
+# subcomponents instead.
+#
+# 2. **Modular Testing:** Test individual functions and modules with ``torch.compile``
+# before integrating them into larger models to isolate potential issues.
+#
+# 3. **Disable Compilation Selectively:** If certain functions or sub-modules
+# cannot be handled by `torch.compile`, use the `torch.compiler.disable` context
+# managers to recursively exclude them from compilation.
+#
+# 4. **Compile Leaf Functions First:** In complex models with multiple nested
+# functions and modules, start by compiling the leaf functions or modules first.
+# For more information see `TorchDynamo APIs for fine-grained tracing <https://pytorch.org/docs/stable/torch.compiler_fine_grain_apis.html>`__.
 
 ######################################################################
 # Demonstrating Speedups

From c7a99c5ec844ea1833f917064ff50ae22cfaf563 Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Wed, 4 Sep 2024 10:50:02 -0700
Subject: [PATCH 27/31] Upgrade pygame version to 2.6.0 (#3025)

This allows one to install pygame binaries for Python-3.11/3.12 runtime, while PyGame-2.1.2 was only available up to python-3.10, see https://pypi.org/project/pygame/2.1.2/#files

Fixes https://github.com/pytorch/tutorials/issues/3023
---
 .ci/docker/requirements.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.ci/docker/requirements.txt b/.ci/docker/requirements.txt
index 9668b17fc3a..bd3711bfb0e 100644
--- a/.ci/docker/requirements.txt
+++ b/.ci/docker/requirements.txt
@@ -64,7 +64,7 @@ pyopengl
 gymnasium[mujoco]==0.27.0
 timm
 iopath
-pygame==2.1.2
+pygame==2.6.0
 pycocotools
 semilearn==0.3.2
 torchao==0.0.3

From deb89baa8723268402fc80623677847ce3a472a3 Mon Sep 17 00:00:00 2001
From: Labintsev A I <andrej.labintsev@yandex.ru>
Date: Thu, 5 Sep 2024 18:24:17 +0300
Subject: [PATCH 28/31] Update intro_onnx.py (#3033)

Add onnxruntime dependency
---
 beginner_source/onnx/intro_onnx.py | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/beginner_source/onnx/intro_onnx.py b/beginner_source/onnx/intro_onnx.py
index b5cbafc1c64..ec625ec78ff 100644
--- a/beginner_source/onnx/intro_onnx.py
+++ b/beginner_source/onnx/intro_onnx.py
@@ -39,13 +39,14 @@
 
   - `ONNX <https://onnx.ai>`_ standard library
   - `ONNX Script <https://onnxscript.ai>`_ library that enables developers to author ONNX operators,
-    functions and models using a subset of Python in an expressive, and yet simple fashion.
+    functions and models using a subset of Python in an expressive, and yet simple fashion
+  - `ONNX Runtime <https://onnxruntime.ai>`_ accelerated machine learning library.
 
 They can be installed through `pip <https://pypi.org/project/pip/>`_:
 
 .. code-block:: bash
 
-  pip install --upgrade onnx onnxscript
+  pip install --upgrade onnx onnxscript onnxruntime
 
 To validate the installation, run the following commands:
 

From a5d85eda40683000ae0753ee5bac1ce4797fdb16 Mon Sep 17 00:00:00 2001
From: Tim Statler <tim.statler@gmail.com>
Date: Thu, 5 Sep 2024 09:06:14 -0700
Subject: [PATCH 29/31] Fixed Rst formatting, minor text changes (#3029)

* Fixed Rst formatting, minor text changes
* Removed duplicate sentence about CUDA hardware that is already mentioned in the intro text. Minor text change.

---------

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
---
 prototype_source/gpu_quantization_torchao_tutorial.py | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/prototype_source/gpu_quantization_torchao_tutorial.py b/prototype_source/gpu_quantization_torchao_tutorial.py
index 513d54faba7..4050a88e56e 100644
--- a/prototype_source/gpu_quantization_torchao_tutorial.py
+++ b/prototype_source/gpu_quantization_torchao_tutorial.py
@@ -35,14 +35,12 @@
 #
 # Segment Anything Model checkpoint setup:
 #
-# 1. Go to the `segment-anything repo <checkpoint https://github.com/facebookresearch/segment-anything/tree/main#model-checkpoints>`_ and download the ``vit_h`` checkpoint. Alternatively, you can just use ``wget``: `wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth --directory-prefix=<path>
+# 1. Go to the `segment-anything repo checkpoint <https://github.com/facebookresearch/segment-anything/tree/main#model-checkpoints>`_ and download the ``vit_h`` checkpoint. Alternatively, you can use ``wget`` (for example, ``wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth --directory-prefix=<path>``).
 # 2. Pass in that directory by editing the code below to say:
 #
-# .. code-block::
-#
-# {sam_checkpoint_base_path}=<path>
+# .. code-block:: bash
 #
-# This was run on an A100-PG509-200 power limited to 330.00 W
+#   {sam_checkpoint_base_path}=<path>
 #
 
 import torch
@@ -297,7 +295,7 @@ def get_sam_model(only_one_block=False, batchsize=1):
 # -----------------
 # In this tutorial, we have learned about the quantization and optimization techniques
 # on the example of the segment anything model.
-
+#
 # In the end, we achieved a full-model apples to apples quantization speedup
 # of about 7.7% on batch size 16 (677.28ms to 729.65ms). We can push this a
 # bit further by increasing the batch size and optimizing other parts of

From 200c4e539ddd357e4b51e91d359bc85540c08078 Mon Sep 17 00:00:00 2001
From: Svetlana Karslioglu <svekars@meta.com>
Date: Thu, 5 Sep 2024 12:45:36 -0700
Subject: [PATCH 30/31] Add meta tag to torch_export_aoti_python (#3036)

* Add meta tag to torch_export_aoti_python

* Feature on the landing page
---
 conf.py                                    | 6 ++++++
 index.rst                                  | 1 +
 recipes_source/torch_export_aoti_python.py | 8 ++++++--
 3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/conf.py b/conf.py
index f0f4905844c..e4bca1ac7fa 100644
--- a/conf.py
+++ b/conf.py
@@ -67,6 +67,12 @@
 #
 # needs_sphinx = '1.0'
 
+html_meta = {
+    'description': 'Master PyTorch with our step-by-step tutorials for all skill levels. Start your journey to becoming a PyTorch expert today!',
+    'keywords': 'PyTorch, tutorials, Getting Started, deep learning, AI',
+    'author': 'PyTorch Contributors'
+}
+
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
diff --git a/index.rst b/index.rst
index 91517834fd8..95c4a8f3efb 100644
--- a/index.rst
+++ b/index.rst
@@ -3,6 +3,7 @@ Welcome to PyTorch Tutorials
 
 **What's new in PyTorch tutorials?**
 
+* `torch.export AOTInductor Tutorial for Python runtime (Beta) <https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html>`__
 * `A guide on good usage of non_blocking and pin_memory() in PyTorch <https://pytorch.org/tutorials/intermediate/pinmem_nonblock.html>`__
 * `Introduction to Distributed Pipeline Parallelism <https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html>`__
 * `Introduction to Libuv TCPStore Backend <https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html>`__ 
diff --git a/recipes_source/torch_export_aoti_python.py b/recipes_source/torch_export_aoti_python.py
index 136862078c1..312491b660f 100644
--- a/recipes_source/torch_export_aoti_python.py
+++ b/recipes_source/torch_export_aoti_python.py
@@ -1,7 +1,11 @@
 # -*- coding: utf-8 -*-
 
 """
-(Beta) ``torch.export`` AOTInductor Tutorial for Python runtime
+.. meta::
+   :description: An end-to-end example of how to use AOTInductor for Python runtime.
+   :keywords: torch.export, AOTInductor, torch._inductor.aot_compile, torch._export.aot_load
+
+``torch.export`` AOTInductor Tutorial for Python runtime (Beta)
 ===============================================================
 **Author:** Ankith Gunapal, Bin Bao, Angela Yi
 """
@@ -18,7 +22,7 @@
 # a shared library that can be run in a non-Python environment.
 #
 #
-# In this tutorial, you will learn an end-to-end example of how to use AOTInductor for python runtime.
+# In this tutorial, you will learn an end-to-end example of how to use AOTInductor for Python runtime.
 # We will look at how  to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a 
 # shared library. Additionally, we will examine how to execute the shared library in Python runtime using :func:`torch._export.aot_load`.
 # You will learn about the speed up seen in the first inference time using AOTInductor, especially when using 

From 44493fb83d45f87ceb3ba022d9b41a1a58f26cdd Mon Sep 17 00:00:00 2001
From: Paul Angerer <pangerer@canva.com>
Date: Thu, 5 Sep 2024 23:37:13 +0200
Subject: [PATCH 31/31] Fix reference to dcp in loading example (#2972)

* Fix reference to DCP
* Keep import style as saving

---------

Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com>
---
 recipes_source/distributed_checkpoint_recipe.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/recipes_source/distributed_checkpoint_recipe.rst b/recipes_source/distributed_checkpoint_recipe.rst
index 2467db878eb..8f93c2222d6 100644
--- a/recipes_source/distributed_checkpoint_recipe.rst
+++ b/recipes_source/distributed_checkpoint_recipe.rst
@@ -289,7 +289,7 @@ the intent is to save or load in "non-distributed" style, meaning entirely in th
     import os
 
     import torch
-    import torch.distributed.checkpoint as DCP
+    import torch.distributed.checkpoint as dcp
     import torch.nn as nn