pytorch · brianjo · Jul 28, 2020 · Jun 11, 2020 · Jun 18, 2020 · Jun 19, 2020
diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ intermediate
 advanced
 pytorch_basics
 recipes
+prototype
 
 #data things
 _data/
@@ -117,3 +118,6 @@ ENV/
 .DS_Store
 cleanup.sh
 *.swp
+
+# PyTorch things
+*.pt
diff --git a/.jenkins/build.sh b/.jenkins/build.sh
@@ -86,6 +86,16 @@ if [[ "${JOB_BASE_NAME}" == *worker_* ]]; then
       FILES_TO_RUN+=($(basename $filename .py))
     fi
     count=$((count+1))
+   done
+   for filename in $(find prototype_source/ -name '*.py' -not -path '*/data/*'); do
+    if [ $(($count % $NUM_WORKERS)) != $WORKER_ID ]; then
+      echo "Removing runnable code from "$filename
+      python $DIR/remove_runnable_code.py $filename $filename
+    else
+      echo "Keeping "$filename
+      FILES_TO_RUN+=($(basename $filename .py))
+    fi
+    count=$((count+1))
   done
   echo "FILES_TO_RUN: " ${FILES_TO_RUN[@]}
 
@@ -94,13 +104,13 @@ if [[ "${JOB_BASE_NAME}" == *worker_* ]]; then
 
   # Step 4: If any of the generated files are not related the tutorial files we want to run,
   # then we remove them
-  for filename in $(find docs/beginner docs/intermediate docs/advanced docs/recipes -name '*.html'); do
+  for filename in $(find docs/beginner docs/intermediate docs/advanced docs/recipes docs/prototype -name '*.html'); do
     file_basename=$(basename $filename .html)
     if [[ ! " ${FILES_TO_RUN[@]} " =~ " ${file_basename} " ]]; then
       rm $filename
     fi
   done
-  for filename in $(find docs/beginner docs/intermediate docs/advanced docs/recipes -name '*.rst'); do
+  for filename in $(find docs/beginner docs/intermediate docs/advanced docs/recipes docs/prototype -name '*.rst'); do
     file_basename=$(basename $filename .rst)
     if [[ ! " ${FILES_TO_RUN[@]} " =~ " ${file_basename} " ]]; then
       rm $filename
@@ -124,7 +134,7 @@ if [[ "${JOB_BASE_NAME}" == *worker_* ]]; then
       rm $filename
     fi
   done
-  for filename in $(find docs/.doctrees/beginner docs/.doctrees/intermediate docs/.doctrees/advanced docs/.doctrees/recipes -name '*.doctree'); do
+  for filename in $(find docs/.doctrees/beginner docs/.doctrees/intermediate docs/.doctrees/advanced docs/.doctrees/recipes docs/.doctrees/prototype -name '*.doctree'); do
     file_basename=$(basename $filename .doctree)
     if [[ ! " ${FILES_TO_RUN[@]} " =~ " ${file_basename} " ]]; then
       rm $filename

diff --git a/README.md b/README.md
@@ -14,8 +14,8 @@ We use sphinx-gallery's [notebook styled examples](https://sphinx-gallery.github
 Here's how to create a new tutorial or recipe:
 1. Create a notebook styled python file. If you want it executed while inserted into documentation, save the file with suffix `tutorial` so that file name is `your_tutorial.py`.
 2. Put it in one of the beginner_source, intermediate_source, advanced_source based on the level. If it is a recipe, add to recipes_source.
-2. For Tutorials, include it in the TOC tree at index.rst
-3. For Tutorials, create a thumbnail in the [index.rst file](https://github.com/pytorch/tutorials/blob/master/index.rst) using a command like `.. customcarditem:: beginner/your_tutorial.html`. For Recipes, create a thumbnail in the [recipes_index.rst](https://github.com/pytorch/tutorials/blob/master/recipes_source/recipes_index.rst)
+2. For Tutorials (except if it is a prototype feature), include it in the TOC tree at index.rst
+3. For Tutorials (except if it is a prototype feature), create a thumbnail in the [index.rst file](https://github.com/pytorch/tutorials/blob/master/index.rst) using a command like `.. customcarditem:: beginner/your_tutorial.html`. For Recipes, create a thumbnail in the [recipes_index.rst](https://github.com/pytorch/tutorials/blob/master/recipes_source/recipes_index.rst)
 
 In case you prefer to write your tutorial in jupyter, you can use [this script](https://gist.github.com/chsasank/7218ca16f8d022e02a9c0deb94a310fe) to convert the notebook to python file. After conversion and addition to the project, please make sure the sections headings etc are in logical order.
 

diff --git a/_static/img/rpc-images/batch.png b/_static/img/rpc-images/batch.png
diff --git a/_static/img/rpc_trace_img.png b/_static/img/rpc_trace_img.png
diff --git a/...s/cropped/Combining-Distributed-DataParallel-with-Distributed-RPC-Framework.png b/...s/cropped/Combining-Distributed-DataParallel-with-Distributed-RPC-Framework.png
diff --git a/_static/img/thumbnails/cropped/Distributed-Pipeline-Parallelism-Using-RPC.png b/_static/img/thumbnails/cropped/Distributed-Pipeline-Parallelism-Using-RPC.png
diff --git a/...ils/cropped/Implementing-Batch-RPC-Processing-Using-Asynchronous-Executions.png b/...ils/cropped/Implementing-Batch-RPC-Processing-Using-Asynchronous-Executions.png
diff --git a/_static/img/thumbnails/cropped/PyTorch-Distributed-Overview.png b/_static/img/thumbnails/cropped/PyTorch-Distributed-Overview.png
diff --git a/_static/img/thumbnails/cropped/TorchScript-Parallelism.jpg b/_static/img/thumbnails/cropped/TorchScript-Parallelism.jpg
diff --git a/_static/img/thumbnails/cropped/android.png b/_static/img/thumbnails/cropped/android.png
diff --git a/_static/img/thumbnails/cropped/ios.png b/_static/img/thumbnails/cropped/ios.png
diff --git a/_static/img/thumbnails/cropped/mobile.png b/_static/img/thumbnails/cropped/mobile.png
diff --git a/_static/img/thumbnails/cropped/profiler.png b/_static/img/thumbnails/cropped/profiler.png
diff --git a/_static/img/trace_img.png b/_static/img/trace_img.png
diff --git a/advanced_source/dynamic_quantization_tutorial.py b/advanced_source/dynamic_quantization_tutorial.py
@@ -1,5 +1,5 @@
 """
-(experimental) Dynamic Quantization on an LSTM Word Language Model
+(beta) Dynamic Quantization on an LSTM Word Language Model
 ==================================================================
 
 **Author**: `James Reed <https://github.com/jamesr66a>`_
@@ -13,7 +13,7 @@
 to int, which can result in smaller model size and faster inference with only a small
 hit to accuracy.
 
-In this tutorial, we'll apply the easiest form of quantization - 
+In this tutorial, we'll apply the easiest form of quantization -
 `dynamic quantization <https://pytorch.org/docs/stable/quantization.html#torch.quantization.quantize_dynamic>`_ -
 to an LSTM-based next word-prediction model, closely following the
 `word language model <https://github.com/pytorch/examples/tree/master/word_language_model>`_

diff --git a/advanced_source/neural_style_tutorial.py b/advanced_source/neural_style_tutorial.py
@@ -83,7 +83,7 @@
 # An important detail to note is that neural networks from the
 # torch library are trained with tensor values ranging from 0 to 1. If you
 # try to feed the networks with 0 to 255 tensor images, then the activated
-# feature maps will be unable sense the intended content and style.
+# feature maps will be unable to sense the intended content and style.
 # However, pre-trained networks from the Caffe library are trained with 0
 # to 255 tensor images. 
 #

diff --git a/advanced_source/rpc_ddp_tutorial.rst b/advanced_source/rpc_ddp_tutorial.rst
@@ -0,0 +1,159 @@
+Combining Distributed DataParallel with Distributed RPC Framework
+=================================================================
+**Author**: `Pritam Damania <https://github.com/pritamdamania87>`_
+
+
+This tutorial uses a simple example to demonstrate how you can combine 
+`DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__ (DDP)
+with the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ 
+to combine distributed data parallelism with distributed model parallelism to 
+train a simple model. Source code of the example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__.
+
+Previous tutorials,
+`Getting Started With Distributed Data Parallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`__
+and `Getting Started with Distributed RPC Framework <https://pytorch.org/tutorials/intermediate/rpc_tutorial.html>`__,
+described how to perform distributed data parallel and distributed model 
+parallel training respectively. Although, there are several training paradigms 
+where you might want to combine these two techniques. For example:
+
+1) If we have a model with a sparse part (large embedding table) and a dense 
+   part (FC layers), we might want to put the embedding table on a parameter 
+   server and replicate the FC layer across multiple trainers using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
+   The `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ 
+   can be used to perform embedding lookups on the parameter server.
+2) Enable hybrid parallelism as described in the `PipeDream <https://arxiv.org/abs/1806.03377>`__ paper.
+   We can use the `Distributed RPC framework <https://pytorch.org/docs/master/rpc.html>`__ 
+   to pipeline stages of the model across multiple workers and replicate each 
+   stage (if needed) using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
+
+|
+In this tutorial we will cover case 1 mentioned above. We have a total of 4 
+workers in our setup as follows:
+
+
+1) 1 Master, which is responsible for creating an embedding table 
+   (nn.EmbeddingBag) on the parameter server. The master also drives the 
+   training loop on the two trainers.
+2) 1 Parameter Server, which basically holds the embedding table in memory and 
+   responds to RPCs from the Master and Trainers.
+3) 2 Trainers, which store an FC layer (nn.Linear) which is replicated amongst 
+   themselves using `DistributedDataParallel <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
+   The trainers are also responsible for executing the forward pass, backward 
+   pass and optimizer step.
+
+|
+The entire training process is executed as follows:
+
+1) The master creates an embedding table on the Parameter Server and holds an 
+   `RRef <https://pytorch.org/docs/master/rpc.html#rref>`__ to it.
+2) The master, then kicks off the training loop on the trainers and passes the 
+   embedding table RRef to the trainers.
+3) The trainers create a ``HybridModel`` which first performs an embedding lookup 
+   using the embedding table RRef provided by the master and then executes the 
+   FC layer which is wrapped inside DDP.
+4) The trainer executes the forward pass of the model and uses the loss to 
+   execute the backward pass using `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__.
+5) As part of the backward pass, the gradients for the FC layer are computed 
+   first and synced to all trainers via allreduce in DDP.
+6) Next, Distributed Autograd propagates the gradients to the parameter server, 
+   where the gradients for the embedding table are updated.
+7) Finally, the `Distributed Optimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__ is used to update all the parameters.
+
+
+.. attention::
+
+  You should always use `Distributed Autograd <https://pytorch.org/docs/master/rpc.html#distributed-autograd-framework>`__ 
+  for the backward pass if you're combining DDP and RPC.
+
+
+Now, let's go through each part in detail. Firstly, we need to setup all of our 
+workers before we can perform any training. We create 4 processes such that 
+ranks 0 and 1 are our trainers, rank 2 is the master and rank 3 is the 
+parameter server.
+
+We initialize the RPC framework on all 4 workers using the TCP init_method. 
+Once RPC initialization is done, the master creates an `EmbeddingBag <https://pytorch.org/docs/master/generated/torch.nn.EmbeddingBag.html>`__ 
+on the Parameter Server using `rpc.remote <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.remote>`__.
+The master then loops through each trainer and kicks of the training loop by 
+calling ``_run_trainer`` on each trainer using `rpc_async <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.rpc_async>`__.
+Finally, the master waits for all training to finish before exiting.
+
+The trainers first initialize a ``ProcessGroup`` for DDP with world_size=2 
+(for two trainers) using `init_process_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group>`__.
+Next, they initialize the RPC framework using the TCP init_method. Note that 
+the ports are different in RPC initialization and ProcessGroup initialization. 
+This is to avoid port conflicts between initialization of both frameworks. 
+Once the initialization is done, the trainers just wait for the ``_run_trainer`` 
+RPC from the master.
+
+The parameter server just initializes the RPC framework and waits for RPCs from 
+the trainers and master.
+
+
+.. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
+  :language: py
+  :start-after: BEGIN run_worker
+  :end-before: END run_worker
+
+Before we discuss details of the Trainer, let's introduce the ``HybridModel`` that 
+the trainer uses. As described below, the ``HybridModel`` is initialized using an 
+RRef to the embedding table (emb_rref) on the parameter server and the ``device`` 
+to use for DDP. The initialization of the model wraps an 
+`nn.Linear <https://pytorch.org/docs/master/generated/torch.nn.Linear.html>`__ 
+layer inside DDP to replicate and synchronize this layer across all trainers.
+
+The forward method of the model is pretty straightforward. It performs an 
+embedding lookup on the parameter server using an 
+`RRef helper <https://pytorch.org/docs/master/rpc.html#torch.distributed.rpc.RRef.rpc_sync>`__ 
+and passes its output onto the FC layer.
+
+
+.. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
+  :language: py
+  :start-after: BEGIN hybrid_model
+  :end-before: END hybrid_model
+
+Next, let's look at the setup on the Trainer. The trainer first creates the 
+``HybridModel`` described above using an RRef to the embedding table on the 
+parameter server and its own rank.
+
+Now, we need to retrieve a list of RRefs to all the parameters that we would 
+like to optimize with `DistributedOptimizer <https://pytorch.org/docs/master/rpc.html#module-torch.distributed.optim>`__. 
+To retrieve the parameters for the embedding table from the parameter server, 
+we define a simple helper function ``_retrieve_embedding_parameters``, which 
+basically walks through all the parameters for the embedding table and returns 
+a list of RRefs. The trainer calls this method on the parameter server via RPC 
+to receive a list of RRefs to the desired parameters. Since the 
+DistributedOptimizer always takes a list of RRefs to parameters that need to 
+be optimized, we need to create RRefs even for the local parameters for our 
+FC layers. This is done by walking ``model.parameters()``, creating an RRef for 
+each parameter and appending it to a list. Note that ``model.parameters()`` only 
+returns local parameters and doesn't include ``emb_rref``.
+
+Finally, we create our DistributedOptimizer using all the RRefs and define a 
+CrossEntropyLoss function.
+
+.. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
+  :language: py
+  :start-after: BEGIN setup_trainer 
+  :end-before: END setup_trainer
+
+Now we're ready to introduce the main training loop that is run on each trainer. 
+``get_next_batch`` is just a helper function to generate random inputs and 
+targets for training. We run the training loop for multiple epochs and for each 
+batch:
+
+1) Setup a `Distributed Autograd Context <https://pytorch.org/docs/master/rpc.html#torch.distributed.autograd.context>`__ 
+   for Distributed Autograd.
+2) Run the forward pass of the model and retrieve its output.
+3) Compute the loss based on our outputs and targets using the loss function.
+4) Use Distributed Autograd to execute a distributed backward pass using the loss.
+5) Finally, run a Distributed Optimizer step to optimize all the parameters.
+
+.. literalinclude:: ../advanced_source/rpc_ddp_tutorial/main.py
+  :language: py
+  :start-after: BEGIN run_trainer
+  :end-before: END run_trainer
+.. code:: python
+
+Source code for the entire example can be found `here <https://github.com/pytorch/examples/tree/master/distributed/rpc/ddp_rpc>`__.