Skip to content

Commit f9f3088

Browse files
author
Jessica Lin
authored
Make RPC profiling recipe into prototype tutorial (#1078)
1 parent bfb3bb5 commit f9f3088

File tree

2 files changed

+35
-42
lines changed

2 files changed

+35
-42
lines changed

recipes_source/distributed_rpc_profiling.rst renamed to prototype_source/distributed_rpc_profiling.rst

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,10 @@ What is the Distributed RPC Framework?
1919
---------------------------------------
2020

2121
The **Distributed RPC Framework** provides mechanisms for multi-machine model
22-
training through a set of primitives to allow for remote communication, and a
22+
training through a set of primitives to allow for remote communication, and a
2323
higher-level API to automatically differentiate models split across several machines.
2424
For this recipe, it would be helpful to be familiar with the `Distributed RPC Framework`_
25-
as well as the `RPC Tutorials`_.
25+
as well as the `RPC Tutorials`_.
2626

2727
What is the PyTorch Profiler?
2828
---------------------------------------
@@ -97,7 +97,7 @@ Running the above program should present you with the following output:
9797
DEBUG:root:Rank 1 shutdown RPC
9898
DEBUG:root:Rank 0 shutdown RPC
9999

100-
Now that we have a skeleton setup of our RPC framework, we can move on to
100+
Now that we have a skeleton setup of our RPC framework, we can move on to
101101
sending RPCs back and forth and using the profiler to obtain a view of what's
102102
happening under the hood. Let's add to the above ``worker`` function:
103103

@@ -108,7 +108,7 @@ happening under the hood. Let's add to the above ``worker`` function:
108108
if rank == 0:
109109
dst_worker_rank = (rank + 1) % world_size
110110
dst_worker_name = f"worker{dst_worker_rank}"
111-
t1, t2 = random_tensor(), random_tensor()
111+
t1, t2 = random_tensor(), random_tensor()
112112
# Send and wait RPC completion under profiling scope.
113113
with profiler.profile() as prof:
114114
fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
@@ -119,45 +119,45 @@ happening under the hood. Let's add to the above ``worker`` function:
119119

120120
print(prof.key_averages().table())
121121

122-
The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
123-
to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API,
122+
The aformented code creates 2 RPCs, specifying ``torch.add`` and ``torch.mul``, respectively,
123+
to be run with two random input tensors on worker 1. Since we use the ``rpc_async`` API,
124124
we are returned a ``torch.futures.Future`` object, which must be awaited for the result
125125
of the computation. Note that this wait must take place within the scope created by
126126
the profiling context manager in order for the RPC to be accurately profiled. Running
127127
the code with this new worker function should result in the following output:
128128

129-
::
129+
::
130130

131131
# Some columns are omitted for brevity, exact output subject to randomness
132-
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
133-
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
134-
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
135-
rpc_async#aten::add(worker0 -> worker1) 0.00% 0.000us 0 20.462ms 20.462ms 1 0
136-
rpc_async#aten::mul(worker0 -> worker1) 0.00% 0.000us 0 5.712ms 5.712ms 1 0
137-
rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul 1.84% 206.864us 2.69% 302.162us 151.081us 2 1
138-
rpc_async#aten::add(worker0 -> worker1)#remote_op: add 1.41% 158.501us 1.57% 176.924us 176.924us 1 1
139-
rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr 0.04% 4.980us 0.04% 4.980us 2.490us 2 1
140-
rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf 0.07% 7.806us 0.07% 7.806us 1.952us 4 1
141-
rpc_async#aten::add(worker0 -> worker1)#remote_op: empty 0.16% 18.423us 0.16% 18.423us 18.423us 1 1
142-
rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty 0.14% 15.712us 0.14% 15.712us 15.712us 1 1
143-
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
132+
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
133+
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
134+
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
135+
rpc_async#aten::add(worker0 -> worker1) 0.00% 0.000us 0 20.462ms 20.462ms 1 0
136+
rpc_async#aten::mul(worker0 -> worker1) 0.00% 0.000us 0 5.712ms 5.712ms 1 0
137+
rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul 1.84% 206.864us 2.69% 302.162us 151.081us 2 1
138+
rpc_async#aten::add(worker0 -> worker1)#remote_op: add 1.41% 158.501us 1.57% 176.924us 176.924us 1 1
139+
rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr 0.04% 4.980us 0.04% 4.980us 2.490us 2 1
140+
rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf 0.07% 7.806us 0.07% 7.806us 1.952us 4 1
141+
rpc_async#aten::add(worker0 -> worker1)#remote_op: empty 0.16% 18.423us 0.16% 18.423us 18.423us 1 1
142+
rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty 0.14% 15.712us 0.14% 15.712us 15.712us 1 1
143+
---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
144144
Self CPU time total: 11.237ms
145145

146146
Here we can see that the profiler has profiled our ``rpc_async`` calls made to ``worker1``
147147
from ``worker0``. In particular, the first 2 entries in the table show details (such as
148148
the operator name, originating worker, and destination worker) about each RPC call made
149-
and the ``CPU total`` column indicates the end-to-end latency of the RPC call.
149+
and the ``CPU total`` column indicates the end-to-end latency of the RPC call.
150150

151151
We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
152-
We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For
152+
We can see operations that took place on ``worker1`` by checking the ``Node ID`` column. For
153153
example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul``
154154
as a ``mul`` operation taking place on the remote node, as a result of the RPC sent to ``worker1``
155155
from ``worker0``, specifying ``worker1`` to run the builtin ``mul`` operator on the input tensors.
156156
Note that names of remote operations are prefixed with the name of the RPC event that resulted
157157
in them. For example, remote operations corresponding to the ``rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))``
158158
call are prefixed with ``rpc_async#aten::mul(worker0 -> worker1)``.
159159

160-
We can also use the profiler gain insight into user-defined functions that are executed over RPC.
160+
We can also use the profiler gain insight into user-defined functions that are executed over RPC.
161161
For example, let's add the following to the above ``worker`` function:
162162

163163
::
@@ -184,19 +184,19 @@ run our user-defined function. Running this code should result in the following
184184
::
185185

186186
# Exact output subject to randomness
187-
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
188-
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
189-
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
190-
rpc_async#udf_with_ops(worker0 -> worker1) 0.00% 0.000us 0 1.008s 1.008s 1 0
191-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand 12.58% 80.037us 47.09% 299.589us 149.795us 2 1
192-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty 15.40% 98.013us 15.40% 98.013us 24.503us 4 1
193-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_ 22.85% 145.358us 23.87% 151.870us 75.935us 2 1
194-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex 1.02% 6.512us 1.02% 6.512us 3.256us 2 1
195-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add 25.80% 164.179us 28.43% 180.867us 180.867us 1 1
196-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul 20.48% 130.293us 31.43% 199.949us 99.975us 2 1
197-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr 0.71% 4.506us 0.71% 4.506us 2.253us 2 1
198-
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf 1.16% 7.367us 1.16% 7.367us 1.842us 4 1
199-
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
187+
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
188+
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
189+
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
190+
rpc_async#udf_with_ops(worker0 -> worker1) 0.00% 0.000us 0 1.008s 1.008s 1 0
191+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand 12.58% 80.037us 47.09% 299.589us 149.795us 2 1
192+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty 15.40% 98.013us 15.40% 98.013us 24.503us 4 1
193+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_ 22.85% 145.358us 23.87% 151.870us 75.935us 2 1
194+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex 1.02% 6.512us 1.02% 6.512us 3.256us 2 1
195+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add 25.80% 164.179us 28.43% 180.867us 180.867us 1 1
196+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul 20.48% 130.293us 31.43% 199.949us 99.975us 2 1
197+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr 0.71% 4.506us 0.71% 4.506us 2.253us 2 1
198+
rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf 1.16% 7.367us 1.16% 7.367us 1.842us 4 1
199+
-------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
200200

201201
Here we can see that the user-defined function has successfully been profiled with its name
202202
``(rpc_async#udf_with_ops(worker0 -> worker1))``, and has the CPU total time we would roughly expect

recipes_source/recipes_index.rst

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -102,13 +102,6 @@ Recipes are bite-sized bite-sized, actionable examples of how to use specific Py
102102
:link: ../recipes/recipes/profiler.html
103103
:tags: Basics
104104

105-
.. customcarditem::
106-
:header: Distributed RPC Profiling
107-
:card_description: Learn how to use PyTorch's profiler in conjunction with the Distributed RPC Framework.
108-
:image: ../_static/img/thumbnails/cropped/profiler.png
109-
:link: ../recipes/distributed_rpc_profiling.html
110-
:tags: Basics
111-
112105
.. Customization
113106
114107
.. customcarditem::

0 commit comments

Comments
 (0)