Merge remote-tracking branch 'origin/main' into pinmem-nonblock-tuto

Vincent Moens · Vincent Moens · commit c8f7e41894f2 · 2024-07-29T18:38:08.000-04:00
diff --git a/.ci/docker/requirements.txt b/.ci/docker/requirements.txt
@@ -13,7 +13,7 @@ tqdm==4.66.1
 numpy==1.24.4
 matplotlib
 librosa
-torch==2.3
+torch==2.4
 torchvision
 torchtext
 torchdata
diff --git a/.jenkins/build.sh b/.jenkins/build.sh
@@ -22,8 +22,8 @@ sudo apt-get install -y pandoc
 #Install PyTorch Nightly for test.
 # Nightly - pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
 # Install 2.4 to merge all 2.4 PRs - uncomment to install nightly binaries (update the version as needed).
-pip uninstall -y torch torchvision torchaudio torchtext torchdata
-pip3 install torch==2.4.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
+# pip uninstall -y torch torchvision torchaudio torchtext torchdata
+# pip3 install torch==2.4.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
 
 # Install two language tokenizers for Translation with TorchText tutorial
 python -m spacy download en_core_web_sm
diff --git a/beginner_source/knowledge_distillation_tutorial.py b/beginner_source/knowledge_distillation_tutorial.py
@@ -352,7 +352,7 @@ def train_knowledge_distillation(teacher, student, train_loader, epochs, learnin
 # Cosine loss minimization run
 # ----------------------------
 # Feel free to play around with the temperature parameter that controls the softness of the softmax function and the loss coefficients.
-# In neural networks, it is easy to include to include additional loss functions to the main objectives to achieve goals like better generalization.
+# In neural networks, it is easy to include additional loss functions to the main objectives to achieve goals like better generalization.
 # Let's try including an objective for the student, but now let's focus on their hidden states rather than their output layers.
 # Our goal is to convey information from the teacher's representation to the student by including a naive loss function,
 # whose minimization implies that the flattened vectors that are subsequently passed to the classifiers have become more *similar* as the loss decreases.
diff --git a/index.rst b/index.rst
@@ -3,11 +3,11 @@ Welcome to PyTorch Tutorials
 
 **What's new in PyTorch tutorials?**
 
-* `Using User-Defined Triton Kernels with torch.compile <https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html>`__
-* `Large Scale Transformer model training with Tensor Parallel (TP) <https://pytorch.org/tutorials/intermediate/TP_tutorial.html>`__
-* `Accelerating BERT with semi-structured (2:4) sparsity <https://pytorch.org/tutorials/advanced/semi_structured_sparse.html>`__
-* `torch.export Tutorial with torch.export.Dim <https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html>`__
-* `Extension points in nn.Module for load_state_dict and tensor subclasses <https://pytorch.org/tutorials/recipes/recipes/swap_tensors.html>`__
+* `Introduction to Distributed Pipeline Parallelism <https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html>`__
+* `Introduction to Libuv TCPStore Backend <https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html>`__ 
+* `Asynchronous Saving with Distributed Checkpoint (DCP) <https://pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html>`__
+* `Python Custom Operators <https://pytorch.org/tutorials/advanced/python_custom_ops.html>`__
+* Updated `Getting Started with DeviceMesh <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__
 
 .. raw:: html
 
diff --git a/intermediate_source/FSDP_adavnced_tutorial.rst b/intermediate_source/FSDP_adavnced_tutorial.rst
@@ -502,7 +502,7 @@ layer class (holding MHSA and FFN).
   
 
     model = FSDP(model,
-        fsdp_auto_wrap_policy=t5_auto_wrap_policy)
+        auto_wrap_policy=t5_auto_wrap_policy)
 
 To see the wrapped model, you can easily print the model and visually inspect
 the sharding and FSDP units as well.
diff --git a/intermediate_source/FSDP_tutorial.rst b/intermediate_source/FSDP_tutorial.rst
@@ -70,7 +70,7 @@ We add the following code snippets to a python script “FSDP_mnist.py”.
 1.2  Import necessary packages
 
 .. note::
-    This tutorial is intended for PyTorch versions 1.12 and later. If you are using an earlier version, replace all instances of `size_based_auto_wrap_policy` with `default_auto_wrap_policy`.
+    This tutorial is intended for PyTorch versions 1.12 and later. If you are using an earlier version, replace all instances of `size_based_auto_wrap_policy` with `default_auto_wrap_policy` and `fsdp_auto_wrap_policy` with `auto_wrap_policy`.
 
 .. code-block:: python
 
@@ -308,7 +308,7 @@ We have recorded cuda events to measure the time of FSDP model specifics. The CU
     CUDA event elapsed time on training loop 40.67462890625sec
 
 Wrapping the model with FSDP, the model will look as follows, we can see the model has been wrapped in one FSDP unit.
-Alternatively, we will look at adding the fsdp_auto_wrap_policy next and will discuss the differences. 
+Alternatively, we will look at adding the auto_wrap_policy next and will discuss the differences. 
 
 .. code-block:: bash
 
@@ -335,12 +335,12 @@ The following is the peak memory usage from FSDP MNIST training on g4dn.12.xlarg
 
    FSDP Peak Memory Usage
 
-Applying *fsdp_auto_wrap_policy* in FSDP otherwise, FSDP will put the entire model in one FSDP unit, which will reduce computation efficiency and memory efficiency. 
+Applying *auto_wrap_policy* in FSDP otherwise, FSDP will put the entire model in one FSDP unit, which will reduce computation efficiency and memory efficiency. 
 The way it works is that, suppose your model contains 100 Linear layers. If you do FSDP(model), there will only be one FSDP unit which wraps the entire model. 
 In that case, the allgather would collect the full parameters for all 100 linear layers, and hence won't save CUDA memory for parameter sharding.
 Also, there is only one blocking allgather call for the all 100 linear layers, there will not be communication and computation overlapping between layers. 
 
-To avoid that, you can pass in an fsdp_auto_wrap_policy, which will seal the current FSDP unit and start a new one automatically when the specified condition is met (e.g., size limit).
+To avoid that, you can pass in an auto_wrap_policy, which will seal the current FSDP unit and start a new one automatically when the specified condition is met (e.g., size limit).
 In that way you will have multiple FSDP units, and only one FSDP unit needs to collect full parameters at a time. E.g., suppose you have 5 FSDP units, and each wraps 20 linear layers.
 Then, in the forward, the 1st FSDP unit will allgather parameters for the first 20 linear layers, do computation, discard the parameters and then move on to the next 20 linear layers. So, at any point in time, each rank only materializes parameters/grads for 20 linear layers instead of 100.
 
@@ -358,9 +358,9 @@ Finding an optimal auto wrap policy is challenging, PyTorch will add auto tuning
     model = Net().to(rank)
 
     model = FSDP(model,
-        fsdp_auto_wrap_policy=my_auto_wrap_policy)
+        auto_wrap_policy=my_auto_wrap_policy)
 
-Applying the fsdp_auto_wrap_policy, the model would be as follows:
+Applying the auto_wrap_policy, the model would be as follows:
 
 .. code-block:: bash
 
@@ -411,7 +411,7 @@ In 2.4 we just add it to the FSDP wrapper
 .. code-block:: python
 
     model = FSDP(model,
-        fsdp_auto_wrap_policy=my_auto_wrap_policy,
+        auto_wrap_policy=my_auto_wrap_policy,
         cpu_offload=CPUOffload(offload_params=True))
 
 
diff --git a/intermediate_source/TCPStore_libuv_backend.rst b/intermediate_source/TCPStore_libuv_backend.rst
@@ -8,7 +8,8 @@ Introduction to Libuv TCPStore Backend
 .. grid:: 2
 
    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
-    :class-card: card-prerequisites
+      :class-card: card-prerequisites
+  
       *  What is the new TCPStore backend
       *  Compare the new libuv backend against the legacy backend
       *  How to enable to use the legacy backend
diff --git a/intermediate_source/pipelining_tutorial.rst b/intermediate_source/pipelining_tutorial.rst
@@ -12,13 +12,15 @@ APIs.
 .. grid:: 2
 
    .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+      :class-card: card-prerequisites
 
       *  How to use ``torch.distributed.pipelining`` APIs
       *  How to apply pipeline parallelism to a transformer model
       *  How to utilize different schedules on a set of microbatches
 
 
    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+      :class-card: card-prerequisites
 
       * Familiarity with `basic distributed training  <https://pytorch.org/tutorials/beginner/dist_overview.html>`__ in PyTorch