Merge branch 'main' into third-nav

svekars · web-flow · commit ad7f5c8f10d8 · 2024-10-31T14:50:03.000-07:00
diff --git a/beginner_source/hyperparameter_tuning_tutorial.py b/beginner_source/hyperparameter_tuning_tutorial.py
@@ -10,7 +10,7 @@
 Fortunately, there are tools that help with finding the best combination of parameters.
 `Ray Tune <https://docs.ray.io/en/latest/tune.html>`_ is an industry standard tool for
 distributed hyperparameter tuning. Ray Tune includes the latest hyperparameter search
-algorithms, integrates with TensorBoard and other analysis libraries, and natively
+algorithms, integrates with various analysis libraries, and natively
 supports distributed training through `Ray's distributed machine learning engine
 <https://ray.io/>`_.
 
diff --git a/beginner_source/ptcheat.rst b/beginner_source/ptcheat.rst
@@ -22,27 +22,12 @@ Neural Network API
     import torch.nn as nn                     # neural networks
     import torch.nn.functional as F           # layers, activations and more
     import torch.optim as optim               # optimizers e.g. gradient descent, ADAM, etc.
-    from torch.jit import script, trace       # hybrid frontend decorator and tracing jit
 
 See `autograd <https://pytorch.org/docs/stable/autograd.html>`__,
 `nn <https://pytorch.org/docs/stable/nn.html>`__,
 `functional <https://pytorch.org/docs/stable/nn.html#torch-nn-functional>`__
 and `optim <https://pytorch.org/docs/stable/optim.html>`__
 
-TorchScript and JIT
--------------------
-
-.. code-block:: python
-
-    torch.jit.trace()         # takes your module or function and an example 
-                              # data input, and traces the computational steps 
-                              # that the data encounters as it progresses through the model
-
-    @script                   # decorator used to indicate data-dependent 
-                              # control flow within the code being traced
-
-See `Torchscript <https://pytorch.org/docs/stable/jit.html>`__
-
 ONNX
 ----
 
@@ -225,8 +210,10 @@ Optimizers
 
     opt = optim.x(model.parameters(), ...)      # create optimizer
     opt.step()                                  # update weights
-    optim.X                                     # where X is SGD, Adadelta, Adagrad, Adam, 
-                                                # AdamW, SparseAdam, Adamax, ASGD, 
+    opt.zero_grad()                             # clear the gradients
+    optim.X                                     # where X is SGD, AdamW, Adam,
+                                                # Adafactor, NAdam, RAdam, Adadelta,
+                                                # Adagrad, SparseAdam, Adamax, ASGD,
                                                 # LBFGS, RMSprop or Rprop
 
 See `optimizers <https://pytorch.org/docs/stable/optim.html>`__
diff --git a/distributed/home.rst b/distributed/home.rst
@@ -77,7 +77,7 @@ Learn FSDP
 
      .. grid-item-card:: :octicon:`file-code;1em`
         FSDP Advanced
-        :link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
+        :link: https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
         :link-type: url
 
         In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
diff --git a/en-wordlist.txt b/en-wordlist.txt
@@ -1,4 +1,3 @@
-
 ACL
 ADI
 AOT
@@ -188,6 +187,7 @@ PIL's
 PPO
 PatchPredictor
 PennFudan
+Perfetto
 Pixman
 Plotly
 Pohjonen
diff --git a/index.rst b/index.rst
@@ -763,7 +763,7 @@ Welcome to PyTorch Tutorials
    :header: Advanced Model Training with Fully Sharded Data Parallel (FSDP)
    :card_description: Explore advanced model training with Fully Sharded Data Parallel package.
    :image: _static/img/thumbnails/cropped/Getting-Started-with-FSDP.png
-   :link: intermediate/FSDP_adavnced_tutorial.html
+   :link: intermediate/FSDP_advanced_tutorial.html
    :tags: Parallel-and-Distributed-Training
 
 .. customcarditem::
@@ -1091,7 +1091,7 @@ Additional Resources
    intermediate/ddp_tutorial
    intermediate/dist_tuto
    intermediate/FSDP_tutorial
-   intermediate/FSDP_adavnced_tutorial
+   intermediate/FSDP_advanced_tutorial
    intermediate/TCPStore_libuv_backend
    intermediate/TP_tutorial
    intermediate/pipelining_tutorial
diff --git a/intermediate_source/FSDP_advanced_tutorial.rst b/intermediate_source/FSDP_advanced_tutorial.rst
@@ -6,25 +6,44 @@ Wright <https://github.com/lessw2020>`__, `Rohan Varma
 <https://github.com/rohan-varma/>`__, `Yanli Zhao
 <https://github.com/zhaojuanmao>`__
 
+.. grid:: 2
+
+   .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
+      :class-card: card-prerequisites
+
+      * PyTorch's Fully Sharded Data Parallel Module: A wrapper for sharding module parameters across
+      data parallel workers.
+
+
+
+
+   .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
+      :class-card: card-prerequisites
+
+      * PyTorch 1.12 or later
+      * Read about the `FSDP API <https://pytorch.org/docs/main/fsdp.html>`__.
+
 
 This tutorial introduces more advanced features of Fully Sharded Data Parallel
 (FSDP) as part of the PyTorch 1.12 release. To get familiar with FSDP, please
 refer to the `FSDP getting started tutorial
 <https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__.
 
 In this tutorial, we fine-tune a HuggingFace (HF) T5 model with FSDP for text
-summarization as a working example. 
+summarization as a working example.
 
 The example uses Wikihow and for simplicity, we will showcase the training on a
-single node, P4dn instance with 8 A100 GPUs. We will soon have a blog post on
-large scale FSDP training on a multi-node cluster, please stay tuned for that on
-the PyTorch medium channel.
+single node, P4dn instance with 8 A100 GPUs. We now have several blog posts (
+`(link1), <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`__
+`(link2) <https://engineering.fb.com/2021/07/15/open-source/fsdp/>`__)
+and a `paper <https://arxiv.org/abs/2304.11277>`__ on
+large scale FSDP training on a multi-node cluster.
 
 FSDP is a production ready package with focus on ease of use, performance, and
 long-term support.  One of the main benefits of FSDP is reducing the memory
 footprint on each GPU. This enables training of larger models with lower total
 memory vs DDP, and leverages the overlap of computation and communication to
-train models efficiently. 
+train models efficiently.
 This reduced memory pressure can be leveraged to either train larger models or
 increase batch size, potentially helping overall training throughput.  You can
 read more about PyTorch FSDP `here
@@ -47,21 +66,21 @@ Recap on How FSDP Works
 
 At a high level FDSP works as follow:
 
-*In constructor*
+*In the constructor*
 
 * Shard model parameters and each rank only keeps its own shard
 
-*In forward pass*
+*In the forward pass*
 
 * Run `all_gather` to collect all shards from all ranks to recover the full
-  parameter for this FSDP unit Run forward computation
-* Discard non-owned parameter shards it has just collected to free memory
+  parameter for this FSDP unit and run the forward computation
+* Discard the non-owned parameter shards it has just collected to free memory
 
-*In backward pass*
+*In the backward pass*
 
 * Run `all_gather` to collect all shards from all ranks to recover the full
-  parameter in this FSDP unit Run backward computation
-* Discard non-owned parameters to free memory. 
+  parameter in this FSDP unit and run backward computation
+* Discard non-owned parameters to free memory.
 * Run reduce_scatter to sync gradients
 
 
@@ -80,15 +99,11 @@ examples
 
 *Setup*
 
-1.1 Install PyTorch Nightlies
-
-We will install PyTorch nightlies, as some of the features such as activation
-checkpointing is available in nightlies and will be added in next PyTorch
-release after 1.12.
+1.1 Install the latest PyTorch
 
-.. code-block:: bash 
+.. code-block:: bash
 
-    pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
+    pip3 install torch torchvision torchaudio
 
 1.2 Dataset Setup
 
@@ -154,7 +169,7 @@ Next, we add the following code snippets to a Python script “T5_training.py”
     import tqdm
     from datetime import datetime
 
-1.4 Distributed training setup. 
+1.4 Distributed training setup.
 Here we use two helper functions to initialize the processes for distributed
 training,  and then to clean up after training completion.  In this tutorial, we
 are going to use torch elastic, using `torchrun
@@ -191,13 +206,13 @@ metrics.
         date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p")
         print(f"--> current date and time of run = {date_of_run}")
         return date_of_run
-   
+
     def format_metrics_to_gb(item):
         """quick function to format numbers to gigabyte and round to 4 digit precision"""
         metric_num = item / g_gigabyte
         metric_num = round(metric_num, ndigits=4)
         return metric_num
-    
+
 
 2.2 Define a train function:
 
@@ -275,7 +290,7 @@ metrics.
 
 .. code-block:: python
 
-    
+
     def fsdp_main(args):
 
         model, tokenizer = setup_model("t5-base")
@@ -292,7 +307,7 @@ metrics.
 
 
         #wikihow(tokenizer, type_path, num_samples, input_length, output_length, print_text=False)
-        train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False) 
+        train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False)
         val_dataset = wikihow(tokenizer, 'validation', 300, 512, 150, False)
 
         sampler1 = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
@@ -430,7 +445,7 @@ metrics.
 
 .. code-block:: python
 
-    
+
     if __name__ == '__main__':
         # Training settings
         parser = argparse.ArgumentParser(description='PyTorch T5 FSDP Example')
@@ -463,7 +478,7 @@ metrics.
 
 To run the the training using torchrun:
 
-.. code-block:: bash 
+.. code-block:: bash
 
     torchrun --nnodes 1 --nproc_per_node 4  T5_training.py
 
@@ -487,7 +502,7 @@ communication efficient.  In PyTorch 1.12, FSDP added this support and now we
 have a wrapping policy for transfomers.
 
 It can be created as follows, where the T5Block represents the T5 transformer
-layer class (holding MHSA and FFN).  
+layer class (holding MHSA and FFN).
 
 
 .. code-block:: python
@@ -499,7 +514,7 @@ layer class (holding MHSA and FFN).
             },
         )
     torch.cuda.set_device(local_rank)
-  
+
 
     model = FSDP(model,
         auto_wrap_policy=t5_auto_wrap_policy)
@@ -513,22 +528,22 @@ Mixed Precision
 FSDP supports flexible mixed precision training allowing for arbitrary reduced
 precision types (such as fp16 or bfloat16). Currently BFloat16 is only available
 on Ampere GPUs, so you need to confirm native support before you use it. On
-V100s for example, BFloat16 can still be run but due to it running non-natively,
+V100s for example, BFloat16 can still be run but because it runs non-natively,
 it can result in significant slowdowns.
 
 To check if BFloat16 is natively supported, you can use the following :
 
 .. code-block:: python
-    
+
     bf16_ready = (
         torch.version.cuda
-        and torch.cuda.is_bf16_supported() 
+        and torch.cuda.is_bf16_supported()
         and LooseVersion(torch.version.cuda) >= "11.0"
         and dist.is_nccl_available()
         and nccl.version() >= (2, 10)
     )
 
-One of the advantages of mixed percision in FSDP is providing granular control
+One of the advantages of mixed precision in FSDP is providing granular control
 over different precision levels for parameters, gradients, and buffers as
 follows:
 
@@ -571,7 +586,7 @@ with the following policy:
 .. code-block:: bash
 
     grad_bf16 = MixedPrecision(reduce_dtype=torch.bfloat16)
-    
+
 
 In 2.4 we just add the relevant mixed precision policy to the FSDP wrapper:
 
@@ -604,9 +619,9 @@ CPU-based initialization:
             auto_wrap_policy=t5_auto_wrap_policy,
             mixed_precision=bfSixteen,
             device_id=torch.cuda.current_device())
-     
 
-    
+
+
 Sharding Strategy
 -----------------
 FSDP sharding strategy by default is set to fully shard the model parameters,
@@ -627,7 +642,7 @@ instead of "ShardingStrategy.FULL_SHARD" to the FSDP initialization  as follows:
             sharding_strategy=ShardingStrategy.SHARD_GRAD_OP # ZERO2)
 
 This will reduce the communication overhead in FSDP, in this case, it holds full
-parameters after forward and through the backwards pass. 
+parameters after forward and through the backwards pass.
 
 This saves an all_gather during backwards so there is less communication at the
 cost of a higher memory footprint. Note that full model params are freed at the
@@ -652,12 +667,12 @@ wrapper in 2.4 as follows:
             mixed_precision=bfSixteen,
             device_id=torch.cuda.current_device(),
             backward_prefetch = BackwardPrefetch.BACKWARD_PRE)
-            
+
 `backward_prefetch` has two modes, `BACKWARD_PRE` and `BACKWARD_POST`.
 `BACKWARD_POST` means that the next FSDP unit's params will not be requested
 until the current FSDP unit processing is complete, thus minimizing memory
 overhead.  In some cases, using `BACKWARD_PRE` can increase model training speed
-up to 2-10%, with even higher speed improvements noted for larger models. 
+up to 2-10%, with even higher speed improvements noted for larger models.
 
 Model Checkpoint Saving, by streaming to the Rank0 CPU
 ------------------------------------------------------
@@ -696,7 +711,7 @@ Pytorch 1.12 and used HF T5 as the running example.  Using the proper wrapping
 policy especially for transformer models, along with mixed precision and
 backward prefetch should speed up your training runs. Also, features such as
 initializing the model on device, and checkpoint saving via streaming to CPU
-should help to avoid OOM error in dealing with large models. 
+should help to avoid OOM error in dealing with large models.
 
 We are actively working to add new features to FSDP for the next release. If
 you have feedback, feature requests, questions or are encountering issues
diff --git a/intermediate_source/tensorboard_profiler_tutorial.py b/intermediate_source/tensorboard_profiler_tutorial.py
@@ -4,6 +4,14 @@
 This tutorial demonstrates how to use TensorBoard plugin with PyTorch Profiler
 to detect performance bottlenecks of the model.
 
+.. warning::
+   The TensorBoard integration with the PyTorch profiler is now
+   deprecated. Instead, use Perfetto or the Chrome trace to
+   view ``trace.json`` files. After
+   `generating a trace <https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#using-tracing-functionality>`__,
+   simply drag the ``trace.json`` into `Perfetto UI <https://ui.perfetto.dev/>`__
+   or ``chrome://tracing`` to visualize your profile.
+
 Introduction
 ------------
 PyTorch 1.8 includes an updated profiler API capable of
diff --git a/recipes_source/recipes/profiler_recipe.py b/recipes_source/recipes/profiler_recipe.py
@@ -459,7 +459,6 @@ def trace_handler(p):
 #
 # Take a look at the following recipes/tutorials to continue your learning:
 #
-# -  `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
-# -  `PyTorch Profiler with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html>`_ tutorial
-# -  `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
+# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
+# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
 #

Original file line number	Diff line number	Diff line change
`@@ -459,7 +459,6 @@ def trace_handler(p):`
`459`	`459`	`#`
`460`	`460`	`# Take a look at the following recipes/tutorials to continue your learning:`
`461`	`461`	`#`
`462`		-# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
`463`		-# - `PyTorch Profiler with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html>`_ tutorial
`464`		-# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
	`462`	+# - `PyTorch Benchmark <https://pytorch.org/tutorials/recipes/recipes/benchmark.html>`_
	`463`	+# - `Visualizing models, data, and training with TensorBoard <https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html>`_ tutorial
`465`	`464`	`#`