pytorch · svekars · Jan 16, 2024 · Jan 12, 2024 · Jan 16, 2024
diff --git a/advanced_source/ddp_pipeline.py b/advanced_source/ddp_pipeline.py
@@ -439,7 +439,7 @@ def evaluate(eval_model, data_source):
 
 ######################################################################
 # Evaluate the model with the test dataset
-# -------------------------------------
+# ----------------------------------------
 #
 # Apply the best model to check the result with the test dataset.
 

diff --git a/advanced_source/dispatcher.rst b/advanced_source/dispatcher.rst
@@ -129,7 +129,7 @@ for debugging in larger models where previously it can be hard to pin-point
 exactly where the ``requires_grad``-ness is lost during the forward pass.
 
 In-place or view ops
-^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^
 
 To ensure correctness and best possible performance, if your op mutates an input
 in-place or returns a tensor that aliases with one of the inputs, two additional

diff --git a/advanced_source/usb_semisup_learn.py b/advanced_source/usb_semisup_learn.py
@@ -157,7 +157,7 @@
 
 ######################################################################
 # Use USB to Train ``SoftMatch`` with specific imbalanced algorithm on imbalanced CIFAR-10
-# ------------------------------------------------------------------------------------
+# ----------------------------------------------------------------------------------------
 # 
 # Now let's say we have imbalanced labeled set and unlabeled set of CIFAR-10,
 # and we want to train a ``SoftMatch`` model on it.

diff --git a/beginner_source/basics/autogradqs_tutorial.py b/beginner_source/basics/autogradqs_tutorial.py
@@ -10,7 +10,7 @@
 `Save & Load Model <saveloadrun_tutorial.html>`_
 
 Automatic Differentiation with ``torch.autograd``
-=======================================
+=================================================
 
 When training neural networks, the most frequently used algorithm is
 **back propagation**. In this algorithm, parameters (model weights) are
@@ -170,7 +170,7 @@
 
 ######################################################################
 # Optional Reading: Tensor Gradients and Jacobian Products
-# --------------------------------------
+# --------------------------------------------------------
 #
 # In many cases, we have a scalar loss function, and we need to compute
 # the gradient with respect to some parameters. However, there are cases

diff --git a/beginner_source/basics/buildmodel_tutorial.py b/beginner_source/basics/buildmodel_tutorial.py
@@ -10,7 +10,7 @@
 `Save & Load Model <saveloadrun_tutorial.html>`_
 
 Build the Neural Network
-===================
+========================
 
 Neural networks comprise of layers/modules that perform operations on data.
 The `torch.nn <https://pytorch.org/docs/stable/nn.html>`_ namespace provides all the building blocks you need to
@@ -197,5 +197,5 @@ def forward(self, x):
 
 #################################################################
 # Further Reading
-# --------------
+# -----------------
 # - `torch.nn API <https://pytorch.org/docs/stable/nn.html>`_
diff --git a/beginner_source/basics/data_tutorial.py b/beginner_source/basics/data_tutorial.py
@@ -10,7 +10,7 @@
 `Save & Load Model <saveloadrun_tutorial.html>`_
 
 Datasets & DataLoaders
-===================
+======================
 
 """
 
@@ -69,7 +69,7 @@
 
 #################################################################
 # Iterating and Visualizing the Dataset
-# -----------------
+# -------------------------------------
 #
 # We can index ``Datasets`` manually like a list: ``training_data[index]``.
 # We use ``matplotlib`` to visualize some samples in our training data.
@@ -144,7 +144,7 @@ def __getitem__(self, idx):
 
 
 #################################################################
-# __init__
+# ``__init__``
 # ^^^^^^^^^^^^^^^^^^^^
 #
 # The __init__ function is run once when instantiating the Dataset object. We initialize
@@ -167,7 +167,7 @@ def __init__(self, annotations_file, img_dir, transform=None, target_transform=N
 
 
 #################################################################
-# __len__
+# ``__len__``
 # ^^^^^^^^^^^^^^^^^^^^
 #
 # The __len__ function returns the number of samples in our dataset.
@@ -180,7 +180,7 @@ def __len__(self):
 
 
 #################################################################
-# __getitem__
+# ``__getitem__``
 # ^^^^^^^^^^^^^^^^^^^^
 #
 # The __getitem__ function loads and returns a sample from the dataset at the given index ``idx``.
@@ -220,7 +220,7 @@ def __getitem__(self, idx):
 
 ###########################
 # Iterate through the DataLoader
-# --------------------------
+# -------------------------------
 #
 # We have loaded that dataset into the ``DataLoader`` and can iterate through the dataset as needed.
 # Each iteration below returns a batch of ``train_features`` and ``train_labels`` (containing ``batch_size=64`` features and labels respectively).
@@ -243,5 +243,5 @@ def __getitem__(self, idx):
 
 #################################################################
 # Further Reading
-# --------------
+# ----------------
 # - `torch.utils.data API <https://pytorch.org/docs/stable/data.html>`_
diff --git a/beginner_source/basics/intro.py b/beginner_source/basics/intro.py
@@ -31,15 +31,15 @@
 
 
 Running the Tutorial Code
-------------------
+-------------------------
 You can run this tutorial in a couple of ways:
 
 - **In the cloud**: This is the easiest way to get started! Each section has a "Run in Microsoft Learn" and "Run in Google Colab" link at the top, which opens an integrated notebook in Microsoft Learn or Google Colab, respectively, with the code in a fully-hosted environment.
 - **Locally**: This option requires you to setup PyTorch and TorchVision first on your local machine (`installation instructions <https://pytorch.org/get-started/locally/>`_). Download the notebook or copy the code into your favorite IDE.
 
 
 How to Use this Guide
------------------
+---------------------
 If you're familiar with other deep learning frameworks, check out the `0. Quickstart <quickstart_tutorial.html>`_ first
 to quickly familiarize yourself with PyTorch's API.
 

diff --git a/beginner_source/basics/tensorqs_tutorial.py b/beginner_source/basics/tensorqs_tutorial.py
@@ -80,7 +80,7 @@
 
 ######################################################################
 # Attributes of a Tensor
-# ~~~~~~~~~~~~~~~~~
+# ~~~~~~~~~~~~~~~~~~~~~~
 #
 # Tensor attributes describe their shape, datatype, and the device on which they are stored.
 
@@ -97,7 +97,7 @@
 
 ######################################################################
 # Operations on Tensors
-# ~~~~~~~~~~~~~~~~~
+# ~~~~~~~~~~~~~~~~~~~~~~~
 #
 # Over 100 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing,
 # indexing, slicing), sampling and more are

diff --git a/beginner_source/blitz/autograd_tutorial.py b/beginner_source/blitz/autograd_tutorial.py
@@ -1,7 +1,7 @@
 # -*- coding: utf-8 -*-
 """
 A Gentle Introduction to ``torch.autograd``
----------------------------------
+===========================================
 
 ``torch.autograd`` is PyTorch’s automatic differentiation engine that powers
 neural network training. In this section, you will get a conceptual
@@ -149,7 +149,7 @@
 
 ######################################################################
 # Optional Reading - Vector Calculus using ``autograd``
-# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 #
 # Mathematically, if you have a vector valued function
 # :math:`\vec{y}=f(\vec{x})`, then the gradient of :math:`\vec{y}` with

diff --git a/beginner_source/blitz/cifar10_tutorial.py b/beginner_source/blitz/cifar10_tutorial.py
@@ -115,7 +115,7 @@ def imshow(img):
 
 ########################################################################
 # 2. Define a Convolutional Neural Network
-# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 # Copy the neural network from the Neural Networks section before and modify it to
 # take 3-channel images (instead of 1-channel images as it was defined).
 

diff --git a/beginner_source/blitz/tensor_tutorial.py b/beginner_source/blitz/tensor_tutorial.py
@@ -1,6 +1,6 @@
 """
 Tensors
---------------------------------------------
+========
 
 Tensors are a specialized data structure that are very similar to arrays
 and matrices. In PyTorch, we use tensors to encode the inputs and

diff --git a/beginner_source/ddp_series_fault_tolerance.rst b/beginner_source/ddp_series_fault_tolerance.rst
@@ -93,11 +93,7 @@ In elastic training, whenever there are any membership changes (adding or removi
 on available devices. Having this structure ensures your training job can continue without manual intervention.
 
 
-
-
-
 Diff for `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py>`__ v/s `multigpu_torchrun.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu_torchrun.py>`__
------------------------------------------------------------
 
 Process group initialization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/beginner_source/ddp_series_multigpu.rst b/beginner_source/ddp_series_multigpu.rst
@@ -52,7 +52,6 @@ Along the way, we will talk through important concepts in distributed training w
 
 
 Diff for `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py>`__ v/s `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py>`__
-----------------------------------------------------
 
 These are the changes you typically make to a single-GPU training script to enable DDP.
 

diff --git a/beginner_source/dist_overview.rst b/beginner_source/dist_overview.rst
@@ -150,7 +150,7 @@ throws an exception, it is likely to lead to desynchronization (mismatched
 adds fault tolerance and the ability to make use of a dynamic pool of machines (elasticity).
 
 RPC-Based Distributed Training
-----------------------------
+------------------------------
 
 Many training paradigms do not fit into data parallelism, e.g.,
 parameter server paradigm, distributed pipeline parallelism, reinforcement

diff --git a/beginner_source/knowledge_distillation_tutorial.py b/beginner_source/knowledge_distillation_tutorial.py
@@ -25,7 +25,7 @@
 # - How to improve the performance of lightweight models by using more complex models as teachers
 #
 # Prerequisites
-# ~~~~~~~~~~~
+# ~~~~~~~~~~~~~
 #
 # * 1 GPU, 4GB of memory
 # * PyTorch v2.0 or later 

diff --git a/beginner_source/nn_tutorial.py b/beginner_source/nn_tutorial.py
@@ -98,7 +98,7 @@
 
 ###############################################################################
 # Neural net from scratch (without ``torch.nn``)
-# ---------------------------------------------
+# -----------------------------------------------
 #
 # Let's first create a model using nothing but PyTorch tensor operations. We're assuming
 # you're already familiar with the basics of neural networks. (If you're not, you can

diff --git a/beginner_source/profiler.py b/beginner_source/profiler.py
@@ -1,6 +1,7 @@
 """
 Profiling your PyTorch Module
-------------
+-----------------------------
+
 **Author:** `Suraj Subramanian <https://github.com/suraj813>`_
 
 PyTorch includes a profiler API that is useful to identify the time and

diff --git a/beginner_source/pytorch_with_examples.rst b/beginner_source/pytorch_with_examples.rst
@@ -1,5 +1,6 @@
 Learning PyTorch with Examples
-******************************
+==============================
+
 **Author**: `Justin Johnson <https://github.com/jcjohnson/pytorch-examples>`_
 
 .. note::
@@ -29,7 +30,7 @@ between the network output and the true output.
    :local:
 
 Tensors
-=======
+~~~~~~~
 
 Warm-up: numpy
 --------------
@@ -74,7 +75,7 @@ and backward passes through the network:
 
 
 Autograd
-========
+~~~~~~~~
 
 PyTorch: Tensors and autograd
 -------------------------------
@@ -133,7 +134,7 @@ our model:
 .. includenodoc:: /beginner/examples_autograd/polynomial_custom_function.py
 
 ``nn`` module
-===========
+~~~~~~~~~~~~~
 
 PyTorch: ``nn``
 ---------------
@@ -219,7 +220,7 @@ We can easily implement this model as a Module subclass:
 .. _examples-download:
 
 Examples
-========
+~~~~~~~~
 
 You can browse the above examples here.
 
@@ -261,7 +262,7 @@ Autograd
     <div style='clear:both'></div>
 
 ``nn`` module
------------
+--------------
 
 .. toctree::
    :maxdepth: 2

diff --git a/beginner_source/t5_tutorial.py b/beginner_source/t5_tutorial.py
@@ -223,8 +223,10 @@ def process_labels(labels, x):
 
 
 #######################################################################
-# Summarization Output (Might vary since we shuffle the dataloader)
+# Summarization Output
 # --------------------
+# 
+# Summarization output might vary since we shuffle the dataloader.
 #
 # .. code-block::
 #
@@ -315,7 +317,7 @@ def process_labels(labels, x):
 # Sentiment Output
 # ----------------
 #
-# ::
+# .. code-block:: bash
 #
 #    Example 1:
 #
@@ -408,7 +410,7 @@ def process_labels(labels, x):
 # Translation Output
 # ------------------
 #
-# ::
+# .. code-block:: bash
 #
 #    Example 1:
 #

diff --git a/beginner_source/vt_tutorial.py b/beginner_source/vt_tutorial.py
@@ -1,6 +1,6 @@
 """
 Optimizing Vision Transformer Model for Deployment
-===========================
+==================================================
 
 `Jeff Tang <https://github.com/jeffxtang>`_,
 `Geeta Chauhan <https://github.com/gchauhan/>`_

diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst
@@ -1,5 +1,5 @@
 Getting Started with Fully Sharded Data Parallel(FSDP)
-=====================================================
+======================================================
 
 **Author**: `Hamid Shojanazeri <https://github.com/HamidShojanazeri>`__, `Yanli Zhao <https://github.com/zhaojuanmao>`__, `Shen Li <https://mrshenli.github.io/>`__
 
@@ -56,7 +56,7 @@ One way to view FSDP's sharding is to decompose the DDP gradient all-reduce into
    FSDP Allreduce
 
 How to use FSDP
---------------
+---------------
 Here we use a toy model to run training on the MNIST dataset for demonstration purposes. The APIs and logic can be applied to training larger models as well. 
 
 *Setup*
@@ -267,7 +267,7 @@ We add the following code snippets to a python script “FSDP_mnist.py”.
 
 
 
-2.5 Finally parse the arguments and set the main function
+2.5 Finally, parse the arguments and set the main function
 
 .. code-block:: python
 

diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst
@@ -236,7 +236,7 @@ and elasticity support, please refer to `TorchElastic <https://pytorch.org/elast
         cleanup()
 
 Combining DDP with Model Parallelism
-----------------------------------
+------------------------------------
 
 DDP also works with multi-GPU models. DDP wrapping multi-GPU models is especially
 helpful when training large models with a huge amount of data.
@@ -297,7 +297,7 @@ either the application or the model ``forward()`` method.
         run_demo(demo_model_parallel, world_size)
 
 Initialize DDP with torch.distributed.run/torchrun
-----------------------------------
+---------------------------------------------------
 
 We can leverage PyTorch Elastic to simplify the DDP code and initialize the job more easily.
 Let's still use the Toymodel example and create a file named ``elastic_ddp.py``.

diff --git a/intermediate_source/dynamic_quantization_bert_tutorial.rst b/intermediate_source/dynamic_quantization_bert_tutorial.rst
@@ -414,7 +414,7 @@ We reuse the tokenize and evaluation function from `Huggingface <https://github.
 
 
 3. Apply the dynamic quantization
--------------------------------
+---------------------------------
 
 We call ``torch.quantization.quantize_dynamic`` on the model to apply
 the dynamic quantization on the HuggingFace BERT model. Specifically,
Original file line number	Diff line number	Diff line change
Expand Up		@@ -52,7 +52,6 @@ Along the way, we will talk through important concepts in distributed training w


		Diff for `single_gpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/single_gpu.py>`__ v/s `multigpu.py <https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/multigpu.py>`__
		----------------------------------------------------

		These are the changes you typically make to a single-GPU training script to enable DDP.

Expand Down
-Original file line number
+Diff line change
@@ Expand Up @@
 . Apply the dynamic quantization
-    -------------------------------
+    ---------------------------------
     We call ``torch.quantization.quantize_dynamic`` on the model to apply
     the dynamic quantization on the HuggingFace BERT model. Specifically,
@@ Expand Down @@