@@ -19,10 +19,10 @@ What is the Distributed RPC Framework?
19
19
---------------------------------------
20
20
21
21
The **Distributed RPC Framework ** provides mechanisms for multi-machine model
22
- training through a set of primitives to allow for remote communication, and a
22
+ training through a set of primitives to allow for remote communication, and a
23
23
higher-level API to automatically differentiate models split across several machines.
24
24
For this recipe, it would be helpful to be familiar with the `Distributed RPC Framework `_
25
- as well as the `RPC Tutorials `_.
25
+ as well as the `RPC Tutorials `_.
26
26
27
27
What is the PyTorch Profiler?
28
28
---------------------------------------
@@ -97,7 +97,7 @@ Running the above program should present you with the following output:
97
97
DEBUG:root:Rank 1 shutdown RPC
98
98
DEBUG:root:Rank 0 shutdown RPC
99
99
100
- Now that we have a skeleton setup of our RPC framework, we can move on to
100
+ Now that we have a skeleton setup of our RPC framework, we can move on to
101
101
sending RPCs back and forth and using the profiler to obtain a view of what's
102
102
happening under the hood. Let's add to the above ``worker `` function:
103
103
@@ -108,7 +108,7 @@ happening under the hood. Let's add to the above ``worker`` function:
108
108
if rank == 0:
109
109
dst_worker_rank = (rank + 1) % world_size
110
110
dst_worker_name = f"worker{dst_worker_rank}"
111
- t1, t2 = random_tensor(), random_tensor()
111
+ t1, t2 = random_tensor(), random_tensor()
112
112
# Send and wait RPC completion under profiling scope.
113
113
with profiler.profile() as prof:
114
114
fut1 = rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2))
@@ -119,45 +119,45 @@ happening under the hood. Let's add to the above ``worker`` function:
119
119
120
120
print(prof.key_averages().table())
121
121
122
- The aformented code creates 2 RPCs, specifying ``torch.add `` and ``torch.mul ``, respectively,
123
- to be run with two random input tensors on worker 1. Since we use the ``rpc_async `` API,
122
+ The aformented code creates 2 RPCs, specifying ``torch.add `` and ``torch.mul ``, respectively,
123
+ to be run with two random input tensors on worker 1. Since we use the ``rpc_async `` API,
124
124
we are returned a ``torch.futures.Future `` object, which must be awaited for the result
125
125
of the computation. Note that this wait must take place within the scope created by
126
126
the profiling context manager in order for the RPC to be accurately profiled. Running
127
127
the code with this new worker function should result in the following output:
128
128
129
- ::
129
+ ::
130
130
131
131
# Some columns are omitted for brevity, exact output subject to randomness
132
- ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
133
- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
134
- ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
135
- rpc_async#aten::add(worker0 -> worker1) 0.00% 0.000us 0 20.462ms 20.462ms 1 0
136
- rpc_async#aten::mul(worker0 -> worker1) 0.00% 0.000us 0 5.712ms 5.712ms 1 0
137
- rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul 1.84% 206.864us 2.69% 302.162us 151.081us 2 1
138
- rpc_async#aten::add(worker0 -> worker1)#remote_op: add 1.41% 158.501us 1.57% 176.924us 176.924us 1 1
139
- rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr 0.04% 4.980us 0.04% 4.980us 2.490us 2 1
140
- rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf 0.07% 7.806us 0.07% 7.806us 1.952us 4 1
141
- rpc_async#aten::add(worker0 -> worker1)#remote_op: empty 0.16% 18.423us 0.16% 18.423us 18.423us 1 1
142
- rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty 0.14% 15.712us 0.14% 15.712us 15.712us 1 1
143
- ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
132
+ ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
133
+ Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
134
+ ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
135
+ rpc_async#aten::add(worker0 -> worker1) 0.00% 0.000us 0 20.462ms 20.462ms 1 0
136
+ rpc_async#aten::mul(worker0 -> worker1) 0.00% 0.000us 0 5.712ms 5.712ms 1 0
137
+ rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul 1.84% 206.864us 2.69% 302.162us 151.081us 2 1
138
+ rpc_async#aten::add(worker0 -> worker1)#remote_op: add 1.41% 158.501us 1.57% 176.924us 176.924us 1 1
139
+ rpc_async#aten::mul(worker0 -> worker1)#remote_op: output_nr 0.04% 4.980us 0.04% 4.980us 2.490us 2 1
140
+ rpc_async#aten::mul(worker0 -> worker1)#remote_op: is_leaf 0.07% 7.806us 0.07% 7.806us 1.952us 4 1
141
+ rpc_async#aten::add(worker0 -> worker1)#remote_op: empty 0.16% 18.423us 0.16% 18.423us 18.423us 1 1
142
+ rpc_async#aten::mul(worker0 -> worker1)#remote_op: empty 0.14% 15.712us 0.14% 15.712us 15.712us 1 1
143
+ ---------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
144
144
Self CPU time total: 11.237ms
145
145
146
146
Here we can see that the profiler has profiled our ``rpc_async `` calls made to ``worker1 ``
147
147
from ``worker0 ``. In particular, the first 2 entries in the table show details (such as
148
148
the operator name, originating worker, and destination worker) about each RPC call made
149
- and the ``CPU total `` column indicates the end-to-end latency of the RPC call.
149
+ and the ``CPU total `` column indicates the end-to-end latency of the RPC call.
150
150
151
151
We also have visibility into the actual operators invoked remotely on worker 1 due RPC.
152
- We can see operations that took place on ``worker1 `` by checking the ``Node ID `` column. For
152
+ We can see operations that took place on ``worker1 `` by checking the ``Node ID `` column. For
153
153
example, we can interpret the row with name ``rpc_async#aten::mul(worker0 -> worker1)#remote_op: mul ``
154
154
as a ``mul `` operation taking place on the remote node, as a result of the RPC sent to ``worker1 ``
155
155
from ``worker0 ``, specifying ``worker1 `` to run the builtin ``mul `` operator on the input tensors.
156
156
Note that names of remote operations are prefixed with the name of the RPC event that resulted
157
157
in them. For example, remote operations corresponding to the ``rpc.rpc_async(dst_worker_name, torch.add, args=(t1, t2)) ``
158
158
call are prefixed with ``rpc_async#aten::mul(worker0 -> worker1) ``.
159
159
160
- We can also use the profiler gain insight into user-defined functions that are executed over RPC.
160
+ We can also use the profiler gain insight into user-defined functions that are executed over RPC.
161
161
For example, let's add the following to the above ``worker `` function:
162
162
163
163
::
@@ -184,19 +184,19 @@ run our user-defined function. Running this code should result in the following
184
184
::
185
185
186
186
# Exact output subject to randomness
187
- -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
188
- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
189
- -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
190
- rpc_async#udf_with_ops(worker0 -> worker1) 0.00% 0.000us 0 1.008s 1.008s 1 0
191
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand 12.58% 80.037us 47.09% 299.589us 149.795us 2 1
192
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty 15.40% 98.013us 15.40% 98.013us 24.503us 4 1
193
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_ 22.85% 145.358us 23.87% 151.870us 75.935us 2 1
194
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex 1.02% 6.512us 1.02% 6.512us 3.256us 2 1
195
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add 25.80% 164.179us 28.43% 180.867us 180.867us 1 1
196
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul 20.48% 130.293us 31.43% 199.949us 99.975us 2 1
197
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr 0.71% 4.506us 0.71% 4.506us 2.253us 2 1
198
- rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf 1.16% 7.367us 1.16% 7.367us 1.842us 4 1
199
- -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
187
+ -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
188
+ Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
189
+ -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
190
+ rpc_async#udf_with_ops(worker0 -> worker1) 0.00% 0.000us 0 1.008s 1.008s 1 0
191
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: rand 12.58% 80.037us 47.09% 299.589us 149.795us 2 1
192
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: empty 15.40% 98.013us 15.40% 98.013us 24.503us 4 1
193
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: uniform_ 22.85% 145.358us 23.87% 151.870us 75.935us 2 1
194
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_complex 1.02% 6.512us 1.02% 6.512us 3.256us 2 1
195
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: add 25.80% 164.179us 28.43% 180.867us 180.867us 1 1
196
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: mul 20.48% 130.293us 31.43% 199.949us 99.975us 2 1
197
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: output_nr 0.71% 4.506us 0.71% 4.506us 2.253us 2 1
198
+ rpc_async#udf_with_ops(worker0 -> worker1)#remote_op: is_leaf 1.16% 7.367us 1.16% 7.367us 1.842us 4 1
199
+ -------------------------------------------------------------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
200
200
201
201
Here we can see that the user-defined function has successfully been profiled with its name
202
202
``(rpc_async#udf_with_ops(worker0 -> worker1)) ``, and has the CPU total time we would roughly expect
0 commit comments