From 8c32c4dfda3e19162dc85ef9f8fa2509c5d58ad1 Mon Sep 17 00:00:00 2001
From: Jessica Lin <jplin@fb.com>
Date: Tue, 21 Jul 2020 10:57:23 -0700
Subject: [PATCH] Make RPC profiling recipe into prototype tutorial

---
 .../distributed_rpc_profiling.rst             | 70 +++++++++----------
 recipes_source/recipes_index.rst              |  7 --
 2 files changed, 35 insertions(+), 42 deletions(-)
 rename {recipes_source => prototype_source}/distributed_rpc_profiling.rst (88%)

diff --git a/recipes_source/distributed_rpc_profiling.rst b/prototype_source/distributed_rpc_profiling.rst
similarity index 88%
rename from recipes_source/distributed_rpc_profiling.rst
rename to prototype_source/distributed_rpc_profiling.rst
index d43c3a0e217..6af0fd883e9 100644
--- a/recipes_source/distributed_rpc_profiling.rst
+++ b/prototype_source/distributed_rpc_profiling.rst
@@ -19,10 +19,10 @@ What is the Distributed RPC Framework?
 ---------------------------------------
 
 The **Distributed RPC Framework** provides mechanisms for multi-machine model
-training through a set of primitives to allow for remote communication, and a 
+training through a set of primitives to allow for remote communication, and a
 higher-level API to automatically differentiate models split across several machines.
 For this recipe, it would be helpful to be familiar with the `Distributed RPC Framework`_
-as well as the `RPC Tutorials`_. 
+as well as the `RPC Tutorials`_.
 
 What is the PyTorch Profiler?
 ---------------------------------------
@@ -97,7 +97,7 @@ Running the above program should present you with the following output:
   DEBUG:root:Rank 1 shutdown RPC
   DEBUG:root:Rank 0 shutdown RPC
 
-Now that we have a skeleton setup of our RPC framework, we can move on to 
+Now that we have a skeleton setup of our RPC framework, we can move on to
 sending RPCs back and forth and using the profiler to obtain a view of what's
 happening under the hood. Let's add to the above ``worker`` function:
 
@@ -108,7 +108,7 @@ happening under the hood. Let's add to the above ``worker`` function:
         if rank == 0:
             dst_worker_rank = (rank + 1) % world_size
             dst_worker_name = f"worker{dst_worker_rank}"
-            t1, t2 = random_tensor(), random_tensor() 
+            t1, t2 = random_tensor(), random_tensor()
             # Send and wait RPC completion under profiling scope.
             with profiler.profile() as prof:
                 fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
@@ -119,37 +119,37 @@ happening under the hood. Let's add to the above ``worker`` function:
 
             print(prof.key_averages().table())
 
-The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively, 
-to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API, 
+The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
+to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API,
 we are returned a ``torch.futures.Future`` object, which must be awaited for the result
 of the computation. Note that this wait must take place within the scope created by
 the profiling context manager in order for the RPC to be accurately profiled. Running
 the code with this new worker function should result in the following output:
 
-:: 
+::
 
   # Some columns are omitted for brevity, exact output subject to randomness
-  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
-  Name                                                              Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID          
-  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
-  rpc_async#aten::add(worker0 -> worker1)                           0.00%            0.000us          0                20.462ms         20.462ms         1                0                         
-  rpc_async#aten::mul(worker0 -> worker1)                           0.00%            0.000us          0                5.712ms          5.712ms          1                0                
-  rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul            1.84%            206.864us        2.69%            302.162us        151.081us        2                1                
-  rpc_async#aten::add(worker0 -> worker1)#remote_op: add            1.41%            158.501us        1.57%            176.924us        176.924us        1                1                
-  rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr      0.04%            4.980us          0.04%            4.980us          2.490us          2                1                
-  rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf        0.07%            7.806us          0.07%            7.806us          1.952us          4                1                
-  rpc_async#aten::add(worker0 -> worker1)#remote_op: empty          0.16%            18.423us         0.16%            18.423us         18.423us         1                1                
-  rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty          0.14%            15.712us         0.14%            15.712us         15.712us         1                1                
-  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
+  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
+  Name                                                              Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
+  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
+  rpc_async#aten::add(worker0 -> worker1)                           0.00%            0.000us          0                20.462ms         20.462ms         1                0
+  rpc_async#aten::mul(worker0 -> worker1)                           0.00%            0.000us          0                5.712ms          5.712ms          1                0
+  rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul            1.84%            206.864us        2.69%            302.162us        151.081us        2                1
+  rpc_async#aten::add(worker0 -> worker1)#remote_op: add            1.41%            158.501us        1.57%            176.924us        176.924us        1                1
+  rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr      0.04%            4.980us          0.04%            4.980us          2.490us          2                1
+  rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf        0.07%            7.806us          0.07%            7.806us          1.952us          4                1
+  rpc_async#aten::add(worker0 -> worker1)#remote_op: empty          0.16%            18.423us         0.16%            18.423us         18.423us         1                1
+  rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty          0.14%            15.712us         0.14%            15.712us         15.712us         1                1
+  ----------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
   Self CPU time total: 11.237ms
 
 Here we can see that the profiler has profiled our ``rpc_async`` calls made to ``worker1``
 from ``worker0``. In particular, the first 2 entries in the table show details (such as
 the operator name, originating worker, and destination worker) about each RPC call made
-and the ``CPU total`` column indicates the end-to-end latency of the RPC call. 
+and the ``CPU total`` column indicates the end-to-end latency of the RPC call.
 
 We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
-We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For 
+We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For
 example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul``
 as a ``mul`` operation taking place on the remote node, as a result of the RPC sent to ``worker1``
 from ``worker0``, specifying ``worker1`` to run the builtin ``mul`` operator on the input tensors.
@@ -157,7 +157,7 @@ Note that names of remote operations are prefixed with the name of the RPC event
 in them. For example, remote operations corresponding to the ``rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))``
 call are prefixed with ``rpc_async#aten::mul(worker0 -> worker1)``.
 
-We can also use the profiler gain insight into user-defined functions that are executed over RPC. 
+We can also use the profiler gain insight into user-defined functions that are executed over RPC.
 For example, let's add the following to the above ``worker`` function:
 
 ::
@@ -184,19 +184,19 @@ run our user-defined function. Running this code should result in the following
 ::
 
   # Exact output subject to randomness
-  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
-  Name                                                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID          
-  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
-  rpc_async#udf_with_ops(worker0 -> worker1)                            0.00%            0.000us          0                1.008s           1.008s           1                0                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand            12.58%           80.037us         47.09%           299.589us        149.795us        2                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty           15.40%           98.013us         15.40%           98.013us         24.503us         4                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_        22.85%           145.358us        23.87%           151.870us        75.935us         2                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex      1.02%            6.512us          1.02%            6.512us          3.256us          2                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add             25.80%           164.179us        28.43%           180.867us        180.867us        1                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul             20.48%           130.293us        31.43%           199.949us        99.975us         2                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr       0.71%            4.506us          0.71%            4.506us          2.253us          2                1                
-  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf         1.16%            7.367us          1.16%            7.367us          1.842us          4                1                
-  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  
+  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
+  Name                                                                  Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Node ID
+  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
+  rpc_async#udf_with_ops(worker0 -> worker1)                            0.00%            0.000us          0                1.008s           1.008s           1                0
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand            12.58%           80.037us         47.09%           299.589us        149.795us        2                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty           15.40%           98.013us         15.40%           98.013us         24.503us         4                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_        22.85%           145.358us        23.87%           151.870us        75.935us         2                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex      1.02%            6.512us          1.02%            6.512us          3.256us          2                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add             25.80%           164.179us        28.43%           180.867us        180.867us        1                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul             20.48%           130.293us        31.43%           199.949us        99.975us         2                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr       0.71%            4.506us          0.71%            4.506us          2.253us          2                1
+  rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf         1.16%            7.367us          1.16%            7.367us          1.842us          4                1
+  --------------------------------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
 
 Here we can see that the user-defined function has successfully been profiled with its name
 ``(rpc_async#udf_with_ops(worker0 -> worker1))``, and has the CPU total time we would roughly expect
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
index e25cb199aa5..92059e9afbe 100644
--- a/recipes_source/recipes_index.rst
+++ b/recipes_source/recipes_index.rst
@@ -102,13 +102,6 @@ Recipes are bite-sized bite-sized, actionable examples of how to use specific Py
    :link: ../recipes/recipes/profiler.html
    :tags: Basics
 
-.. customcarditem::
-   :header: Distributed RPC Profiling
-   :card_description: Learn how to use PyTorch's profiler in conjunction with the Distributed RPC Framework.
-   :image: ../_static/img/thumbnails/cropped/profiler.png
-   :link: ../recipes/distributed_rpc_profiling.html
-   :tags: Basics
-
 .. Customization
 
 .. customcarditem::