|
| 1 | +Distributed and Parallel Training Tutorials |
| 2 | +=========================================== |
| 3 | + |
| 4 | +Distributed training is a model training paradigm that involves |
| 5 | +spreading training workload across multiple worker nodes, therefore |
| 6 | +significantly improving the speed of training and model accuracy. While |
| 7 | +distributed training can be used for any type of ML model training, it |
| 8 | +is most beneficial to use it for large models and compute demanding |
| 9 | +tasks as deep learning. |
| 10 | + |
| 11 | +There are a few ways you can perform distributed training in |
| 12 | +PyTorch with each method having their advantages in certain use cases: |
| 13 | + |
| 14 | +* `DistributedDataParallel (DDP) <#learn-ddp>`__ |
| 15 | +* `Fully Sharded Data Parallel (FSDP) <#learn-fsdp>`__ |
| 16 | +* `Remote Procedure Call (RPC) distributed training <#learn-rpc>`__ |
| 17 | +* `Custom Extensions <#custom-extensions>`__ |
| 18 | + |
| 19 | +Read more about these options in `Distributed Overview <../beginner/dist_overview.html>`__. |
| 20 | + |
| 21 | +.. _learn-ddp: |
| 22 | + |
| 23 | +Learn DDP |
| 24 | +--------- |
| 25 | + |
| 26 | +.. grid:: 3 |
| 27 | + |
| 28 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 29 | + DDP Intro Video Tutorials |
| 30 | + :link: https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro |
| 31 | + :link-type: url |
| 32 | + |
| 33 | + A step-by-step video series on how to get started with |
| 34 | + `DistributedDataParallel` and advance to more complex topics |
| 35 | + +++ |
| 36 | + :octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video |
| 37 | + |
| 38 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 39 | + Getting Started with Distributed Data Parallel |
| 40 | + :link: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial |
| 41 | + :link-type: url |
| 42 | + |
| 43 | + This tutorial provides a short and gentle intro to the PyTorch |
| 44 | + DistributedData Parallel. |
| 45 | + +++ |
| 46 | + :octicon:`code;1em` Code |
| 47 | + |
| 48 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 49 | + Distributed Training with Uneven Inputs Using |
| 50 | + the Join Context Manager |
| 51 | + :link: https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join |
| 52 | + :link-type: url |
| 53 | + |
| 54 | + This tutorial provides a short and gentle intro to the PyTorch |
| 55 | + DistributedData Parallel. |
| 56 | + +++ |
| 57 | + :octicon:`code;1em` Code |
| 58 | + |
| 59 | +.. _learn-fsdp: |
| 60 | + |
| 61 | +Learn FSDP |
| 62 | +---------- |
| 63 | + |
| 64 | +.. grid:: 3 |
| 65 | + |
| 66 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 67 | + Getting Started with FSDP |
| 68 | + :link: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started |
| 69 | + :link-type: url |
| 70 | + |
| 71 | + This tutorial demonstrates how you can perform distributed training |
| 72 | + with FSDP on a MNIST dataset. |
| 73 | + +++ |
| 74 | + :octicon:`code;1em` Code |
| 75 | + |
| 76 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 77 | + FSDP Advanced |
| 78 | + :link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced |
| 79 | + :link-type: url |
| 80 | + |
| 81 | + In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5 |
| 82 | + model with FSDP for text summarization. |
| 83 | + +++ |
| 84 | + :octicon:`code;1em` Code |
| 85 | + |
| 86 | +.. _learn-rpc: |
| 87 | + |
| 88 | +Learn RPC |
| 89 | +--------- |
| 90 | + |
| 91 | +.. grid:: 3 |
| 92 | + |
| 93 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 94 | + Getting Started with Distributed RPC Framework |
| 95 | + :link: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started |
| 96 | + :link-type: url |
| 97 | + |
| 98 | + This tutorial demonstrates how to get started with RPC-based distributed |
| 99 | + training. |
| 100 | + +++ |
| 101 | + :octicon:`code;1em` Code |
| 102 | + |
| 103 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 104 | + Implementing a Parameter Server Using Distributed RPC Framework |
| 105 | + :link: https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial |
| 106 | + :link-type: url |
| 107 | + |
| 108 | + This tutorial walks you through a simple example of implementing a |
| 109 | + parameter server using PyTorch’s Distributed RPC framework. |
| 110 | + +++ |
| 111 | + :octicon:`code;1em` Code |
| 112 | + |
| 113 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 114 | + Implementing Batch RPC Processing Using Asynchronous Executions |
| 115 | + :link: https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution |
| 116 | + :link-type: url |
| 117 | + |
| 118 | + In this tutorial you will build batch-processing RPC applications |
| 119 | + with the @rpc.functions.async_execution decorator. |
| 120 | + +++ |
| 121 | + :octicon:`code;1em` Code |
| 122 | + |
| 123 | +.. grid:: 3 |
| 124 | + |
| 125 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 126 | + Combining Distributed DataParallel with Distributed RPC Framework |
| 127 | + :link: https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp |
| 128 | + :link-type: url |
| 129 | + |
| 130 | + In this tutorial you will learn how to combine distributed data |
| 131 | + parallelism with distributed model parallelism. |
| 132 | + +++ |
| 133 | + :octicon:`code;1em` Code |
| 134 | + |
| 135 | +.. _custom-extensions: |
| 136 | + |
| 137 | +Custom Extensions |
| 138 | +----------------- |
| 139 | + |
| 140 | +.. grid:: 3 |
| 141 | + |
| 142 | + .. grid-item-card:: :octicon:`file-code;1em` |
| 143 | + Customize Process Group Backends Using Cpp Extensions |
| 144 | + :link: https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp |
| 145 | + :link-type: url |
| 146 | + |
| 147 | + In this tutorial you will learn to implement a custom `ProcessGroup` |
| 148 | + backend and plug that into PyTorch distributed package using |
| 149 | + cpp extensions. |
| 150 | + +++ |
| 151 | + :octicon:`code;1em` Code |
0 commit comments