|
| 1 | +Introduction to Distributed Pipeline Parallelism |
| 2 | +================================================ |
| 3 | +**Authors**: `Howard Huang <https://github.com/H-Huang>`_ |
| 4 | + |
| 5 | +.. note:: |
| 6 | + |edit| View and edit this tutorial in `github <https://github.com/pytorch/tutorials/blob/main/intermediate_source/pipelining_tutorial.rst>`__. |
| 7 | + |
| 8 | +This tutorial uses a gpt-style transformer model to demonstrate implementing distributed |
| 9 | +pipeline parallelism with `torch.distributed.pipelining <https://pytorch.org/docs/main/distributed.pipelining.html>`__ |
| 10 | +APIs. |
| 11 | + |
| 12 | +.. grid:: 2 |
| 13 | + |
| 14 | + .. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn |
| 15 | + |
| 16 | + * How to use ``torch.distributed.pipelining`` APIs |
| 17 | + * How to apply pipeline parallelism to a transformer model |
| 18 | + * How to utilize different schedules on a set of microbatches |
| 19 | + |
| 20 | + |
| 21 | + .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites |
| 22 | + |
| 23 | + * Familiarity with `basic distributed training <https://pytorch.org/tutorials/beginner/dist_overview.html>`__ in PyTorch |
| 24 | + |
| 25 | +Setup |
| 26 | +----- |
| 27 | + |
| 28 | +With ``torch.distributed.pipelining`` we will be partitioning the execution of a model and scheduling computation on micro-batches. We will be using a simplified version |
| 29 | +of a transformer decoder model. The model architecture is for educational purposes and has multiple transformer decoder layers as we want to demonstrate how to split the model into different |
| 30 | +chunks. First, let us define the model: |
| 31 | + |
| 32 | +.. code:: python |
| 33 | +
|
| 34 | + import torch |
| 35 | + import torch.nn as nn |
| 36 | + from dataclasses import dataclass |
| 37 | +
|
| 38 | + @dataclass |
| 39 | + class ModelArgs: |
| 40 | + dim: int = 512 |
| 41 | + n_layers: int = 8 |
| 42 | + n_heads: int = 8 |
| 43 | + vocab_size: int = 10000 |
| 44 | +
|
| 45 | + class Transformer(nn.Module): |
| 46 | + def __init__(self, model_args: ModelArgs): |
| 47 | + super().__init__() |
| 48 | +
|
| 49 | + self.tok_embeddings = nn.Embedding(model_args.vocab_size, model_args.dim) |
| 50 | +
|
| 51 | + # Using a ModuleDict lets us delete layers witout affecting names, |
| 52 | + # ensuring checkpoints will correctly save and load. |
| 53 | + self.layers = torch.nn.ModuleDict() |
| 54 | + for layer_id in range(model_args.n_layers): |
| 55 | + self.layers[str(layer_id)] = nn.TransformerDecoderLayer(model_args.dim, model_args.n_heads) |
| 56 | +
|
| 57 | + self.norm = nn.LayerNorm(model_args.dim) |
| 58 | + self.output = nn.Linear(model_args.dim, model_args.vocab_size) |
| 59 | +
|
| 60 | + def forward(self, tokens: torch.Tensor): |
| 61 | + # Handling layers being 'None' at runtime enables easy pipeline splitting |
| 62 | + h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens |
| 63 | +
|
| 64 | + for layer in self.layers.values(): |
| 65 | + h = layer(h, h) |
| 66 | +
|
| 67 | + h = self.norm(h) if self.norm else h |
| 68 | + output = self.output(h).float() if self.output else h |
| 69 | + return output |
| 70 | +
|
| 71 | +Then, we need to import the necessary libraries in our script and initialize the distributed training process. In this case, we are defining some global variables to use |
| 72 | +later in the script: |
| 73 | + |
| 74 | +.. code:: python |
| 75 | +
|
| 76 | + import os |
| 77 | + import torch.distributed as dist |
| 78 | + from torch.distributed.pipelining import pipeline, SplitPoint, PipelineStage, ScheduleGPipe |
| 79 | +
|
| 80 | + global rank, device, pp_group, stage_index, num_stages |
| 81 | + def init_distributed(): |
| 82 | + global rank, device, pp_group, stage_index, num_stages |
| 83 | + rank = int(os.environ["LOCAL_RANK"]) |
| 84 | + world_size = int(os.environ["WORLD_SIZE"]) |
| 85 | + device = torch.device(f"cuda:{rank}") if torch.cuda.is_available() else torch.device("cpu") |
| 86 | + dist.init_process_group() |
| 87 | +
|
| 88 | + # This group can be a sub-group in the N-D parallel case |
| 89 | + pp_group = dist.new_group() |
| 90 | + stage_index = rank |
| 91 | + num_stages = world_size |
| 92 | +
|
| 93 | +The ``rank``, ``world_size``, and ``init_process_group()`` code should seem familiar to you as those are commonly used in |
| 94 | +all distributed programs. The globals specific to pipeline parallelism include ``pp_group`` which is the process |
| 95 | +group that will be used for send/recv communications, ``stage_index`` which, in this example, is a single rank |
| 96 | +per stage so the index is equivalent to the rank, and ``num_stages`` which is equivalent to world_size. |
| 97 | + |
| 98 | +The ``num_stages`` is used to set the number of stages that will be used in the pipeline parallelism schedule. For example, |
| 99 | +for ``num_stages=4``, a microbatch will need to go through 4 forwards and 4 backwards before it is completed. The ``stage_index`` |
| 100 | +is necessary for the framework to know how to communicate between stages. For example, for the first stage (``stage_index=0``), it will |
| 101 | +use data from the dataloader and does not need to receive data from any previous peers to perform its computation. |
| 102 | + |
| 103 | + |
| 104 | +Step 1: Partition the Transformer Model |
| 105 | +--------------------------------------- |
| 106 | + |
| 107 | +There are two different ways of partitioning the model: |
| 108 | + |
| 109 | +First is the manual mode in which we can manually create two instances of the model by deleting portions of |
| 110 | +attributes of the model. In this example for a 2 stage (2 ranks) the model is cut in half. |
| 111 | + |
| 112 | +.. code:: python |
| 113 | +
|
| 114 | + def manual_model_split(model, example_input_microbatch, model_args) -> PipelineStage: |
| 115 | + if stage_index == 0: |
| 116 | + # prepare the first stage model |
| 117 | + for i in range(4, 8): |
| 118 | + del model.layers[str(i)] |
| 119 | + model.norm = None |
| 120 | + model.output = None |
| 121 | + stage_input_microbatch = example_input_microbatch |
| 122 | +
|
| 123 | + elif stage_index == 1: |
| 124 | + # prepare the second stage model |
| 125 | + for i in range(4): |
| 126 | + del model.layers[str(i)] |
| 127 | + model.tok_embeddings = None |
| 128 | + stage_input_microbatch = torch.randn(example_input_microbatch.shape[0], example_input_microbatch.shape[1], model_args.dim) |
| 129 | +
|
| 130 | + stage = PipelineStage( |
| 131 | + model, |
| 132 | + stage_index, |
| 133 | + num_stages, |
| 134 | + device, |
| 135 | + input_args=stage_input_microbatch, |
| 136 | + ) |
| 137 | + return stage |
| 138 | +
|
| 139 | +As we can see the first stage does not have the layer norm or the output layer, and it only includes the first four transformer blocks. |
| 140 | +The second stage does not have the input embedding layers, but includes the output layers and the final four transformer blocks. The function |
| 141 | +then returns the ``PipelineStage`` for the current rank. |
| 142 | + |
| 143 | +The second method is the tracer-based mode which automatically splits the model based on a ``split_spec`` argument. Using the pipeline specification, we can instruct |
| 144 | +``torch.distributed.pipelining`` where to split the model. In the following code block, |
| 145 | +we are splitting before the before 4th transformer decoder layer, mirroring the manual split described above. Similarly, |
| 146 | +we can retrieve a ``PipelineStage`` by calling ``build_stage`` after this splitting is done. |
| 147 | + |
| 148 | +.. code:: python |
| 149 | + def tracer_model_split(model, example_input_microbatch) -> PipelineStage: |
| 150 | + pipe = pipeline( |
| 151 | + module=model, |
| 152 | + mb_args=(example_input_microbatch,), |
| 153 | + split_spec={ |
| 154 | + "layers.4": SplitPoint.BEGINNING, |
| 155 | + } |
| 156 | + ) |
| 157 | + stage = pipe.build_stage(stage_index, device, pp_group) |
| 158 | + return stage |
| 159 | +
|
| 160 | +
|
| 161 | +Step 2: Define The Main Execution |
| 162 | +--------------------------------- |
| 163 | + |
| 164 | +In the main function we will create a particular pipeline schedule that the stages should follow. ``torch.distributed.pipelining`` |
| 165 | +supports multiple schedules including supports multiple schedules, including single-stage-per-rank schedules ``GPipe`` and ``1F1B``, |
| 166 | +as well as multiple-stage-per-rank schedules such as ``Interleaved1F1B`` and ``LoopedBFS``. |
| 167 | + |
| 168 | +.. code:: python |
| 169 | +
|
| 170 | + if __name__ == "__main__": |
| 171 | + init_distributed() |
| 172 | + num_microbatches = 4 |
| 173 | + model_args = ModelArgs() |
| 174 | + model = Transformer(model_args) |
| 175 | +
|
| 176 | + # Dummy data |
| 177 | + x = torch.ones(32, 500, dtype=torch.long) |
| 178 | + y = torch.randint(0, model_args.vocab_size, (32, 500), dtype=torch.long) |
| 179 | + example_input_microbatch = x.chunk(num_microbatches)[0] |
| 180 | +
|
| 181 | + # Option 1: Manual model splitting |
| 182 | + stage = manual_model_split(model, example_input_microbatch, model_args) |
| 183 | +
|
| 184 | + # Option 2: Tracer model splitting |
| 185 | + # stage = tracer_model_split(model, example_input_microbatch) |
| 186 | +
|
| 187 | + x = x.to(device) |
| 188 | + y = y.to(device) |
| 189 | +
|
| 190 | + def tokenwise_loss_fn(outputs, targets): |
| 191 | + loss_fn = nn.CrossEntropyLoss() |
| 192 | + outputs = outputs.view(-1, model_args.vocab_size) |
| 193 | + targets = targets.view(-1) |
| 194 | + return loss_fn(outputs, targets) |
| 195 | +
|
| 196 | + schedule = ScheduleGPipe(stage, n_microbatches=num_microbatches, loss_fn=tokenwise_loss_fn) |
| 197 | +
|
| 198 | + if rank == 0: |
| 199 | + schedule.step(x) |
| 200 | + elif rank == 1: |
| 201 | + losses = [] |
| 202 | + output = schedule.step(target=y, losses=losses) |
| 203 | + dist.destroy_process_group() |
| 204 | +
|
| 205 | +In the example above, we are using the manual method to split the model, but the code can be uncommented to also try the |
| 206 | +tracer-based model splitting function. In our schedule, we need to pass in the number of microbatches and |
| 207 | +the loss function used to evaluate the targets. |
| 208 | + |
| 209 | +The ``.step()`` function processes the entire minibatch and automatically splits it into microbatches based |
| 210 | +on the ``n_microbatches`` passed previously. The microbatches are then operated on according to the schedule class. |
| 211 | +In the example above, we are using GPipe, which follows a simple all-forwards and then all-backwards schedule. The output |
| 212 | +returned from rank 1 will be the same as if the model was on a single GPU and run with the entire batch. Similarly, |
| 213 | +we can pass in a ``losses`` container to store the corresponding losses for each microbatch. |
| 214 | + |
| 215 | +Step 3: Launch the Distributed Processes |
| 216 | +---------------------------------------- |
| 217 | + |
| 218 | +Finally, we are ready to run the script. We will use ``torchrun`` to create a single host, 2-process job. |
| 219 | +Our script is already written in a way rank 0 that performs the required logic for pipeline stage 0, and rank 1 |
| 220 | +performs the logic for pipeline stage 1. |
| 221 | + |
| 222 | +``torchrun --nnodes 1 --nproc_per_node 2 pipelining_tutorial.py`` |
| 223 | + |
| 224 | +Conclusion |
| 225 | +---------- |
| 226 | + |
| 227 | +In this tutorial, we have learned how to implement distributed pipeline parallelism using PyTorch's ``torch.distributed.pipelining`` APIs. |
| 228 | +We explored setting up the environment, defining a transformer model, and partitioning it for distributed training. |
| 229 | +We discussed two methods of model partitioning, manual and tracer-based, and demonstrated how to schedule computations on |
| 230 | +micro-batches across different stages. Finally, we covered the execution of the pipeline schedule and the launch of distributed |
| 231 | +processes using ``torchrun``. |
| 232 | + |
| 233 | +For a production ready usage of pipeline parallelism as well as composition with other distributed techniques, see also |
| 234 | +`TorchTitan end to end example of 3D parallelism <https://github.com/pytorch/torchtitan>`__. |
0 commit comments