You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intermediate_source/pipelining_tutorial.rst
+14-10Lines changed: 14 additions & 10 deletions
Original file line number
Diff line number
Diff line change
@@ -67,7 +67,7 @@ chunks. First, let us define the model:
67
67
h = layer(h, h)
68
68
69
69
h =self.norm(h) ifself.norm else h
70
-
output =self.output(h).float() ifself.output else h
70
+
output =self.output(h).clone() ifself.output else h
71
71
return output
72
72
73
73
Then, we need to import the necessary libraries in our script and initialize the distributed training process. In this case, we are defining some global variables to use
@@ -109,32 +109,29 @@ Step 1: Partition the Transformer Model
109
109
There are two different ways of partitioning the model:
110
110
111
111
First is the manual mode in which we can manually create two instances of the model by deleting portions of
112
-
attributes of the model. In this example for a 2 stage (2 ranks) the model is cut in half.
112
+
attributes of the model. In this example for two stages (2 ranks), the model is cut in half.
@@ -202,6 +200,7 @@ as well as multiple-stage-per-rank schedules such as ``Interleaved1F1B`` and ``L
202
200
elif rank ==1:
203
201
losses = []
204
202
output = schedule.step(target=y, losses=losses)
203
+
print(f"losses: {losses}")
205
204
dist.destroy_process_group()
206
205
207
206
In the example above, we are using the manual method to split the model, but the code can be uncommented to also try the
@@ -232,5 +231,10 @@ We discussed two methods of model partitioning, manual and tracer-based, and dem
232
231
micro-batches across different stages. Finally, we covered the execution of the pipeline schedule and the launch of distributed
233
232
processes using ``torchrun``.
234
233
235
-
For a production ready usage of pipeline parallelism as well as composition with other distributed techniques, see also
234
+
Additional Resources
235
+
--------------------
236
+
237
+
We have successfully integrated ``torch.distributed.pipelining`` into the `torchtitan repository <https://github.com/pytorch/torchtitan>`__. TorchTitan is a clean, minimal code base for
238
+
large-scale LLM training using native PyTorch. For a production ready usage of pipeline
239
+
parallelism as well as composition with other distributed techniques, see
236
240
`TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__.
0 commit comments