|
4 | 4 | *************************************************************
|
5 | 5 | **Author**: `Shen Li <https://mrshenli.github.io/>`_
|
6 | 6 |
|
7 |
| -Data parallel and model parallel are widely-used distributed training |
| 7 | +Data parallel and model parallel are widely-used in distributed training |
8 | 8 | techniques. Previous posts have explained how to use
|
9 | 9 | `DataParallel <https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html>`_
|
10 | 10 | to train a neural network on multiple GPUs. ``DataParallel`` replicates the
|
11 | 11 | same model to all GPUs, where each GPU consumes a different partition of the
|
12 | 12 | input data. Although it can significantly accelerate the training process, it
|
13 |
| -does not work for some use cases where the model is large to fit into a single |
14 |
| -GPU. This post shows how to solve that problem by using model parallel and also |
15 |
| -shares some insights on how to speed up model parallel training. |
| 13 | +does not work for some use cases where the model is too large to fit into a |
| 14 | +single GPU. This post shows how to solve that problem by using model parallel |
| 15 | +and also shares some insights on how to speed up model parallel training. |
16 | 16 |
|
17 | 17 | The high-level idea of model parallel is to place different sub-networks of a
|
18 | 18 | model onto different devices, and implement the ``forward`` method accordingly
|
|
23 | 23 | of model parallel. It is up to the readers to apply the ideas to real-world
|
24 | 24 | applications.
|
25 | 25 |
|
26 |
| -Let us start with a toy model that contains two linear layers. To run this |
27 |
| -model on two GPUs, simply put each linear layer on a different GPU, and move |
28 |
| -inputs and intermediate outputs to match the layer devices accordingly. |
| 26 | +**Recommended Reading:** |
| 27 | +
|
| 28 | +- https://pytorch.org/ For installation instructions |
| 29 | +- :doc:`/beginner/blitz/data_parallel_tutorial` Single-Machine Data Parallel |
| 30 | +- :doc:`/intermediate/ddp_tutorial` Combine Distributed Data Parallel and Model Parallel |
29 | 31 | """
|
30 | 32 |
|
| 33 | +###################################################################### |
| 34 | +# Basic Usage |
| 35 | +# ======================= |
| 36 | +# |
| 37 | +# Let us start with a toy model that contains two linear layers. To run this |
| 38 | +# model on two GPUs, simply put each linear layer on a different GPU, and move |
| 39 | +# inputs and intermediate outputs to match the layer devices accordingly. |
| 40 | + |
31 | 41 | import torch
|
32 | 42 | import torch.nn as nn
|
33 | 43 | import torch.optim as optim
|
|
0 commit comments