1
1
Distributed and Parallel Training Tutorials
2
2
===========================================
3
3
4
- Distributed and
5
- at pytorch.org website.
4
+ Distributed training is a model training paradigm that involves
5
+ spreading training workload across multiple worker nodes, therefore
6
+ significantly improving the speed of training and model accuracy. While
7
+ distributed training can be used for any type of ML model training, it
8
+ is most beneficial to use it for large models and compute demanding
9
+ tasks as deep learning.
6
10
7
- Getting Started with Distributed Data-Parallel Training (DDP)
8
- -------------------------------------------------------------
11
+ There are a few ways you can perform distributed training in
12
+ PyTorch with each method having their advantages in certain use cases:
9
13
14
+ * `DistributedDataParallel (DDP) <#learn-ddp >`__
15
+ * `Fully Sharded Data Parallel (FSDP) <#learn-fsdp >`__
16
+ * `Remote Procedure Call (RPC) distributed training <#learn-rpc >`__
17
+ * `Pipeline Parallelism <#learn-pipeline-parallelism >`__
10
18
19
+ Read more about these options in [Distributed Overview](../beginner/dist_overview.rst).
11
20
12
- .. grid :: 3
21
+ .. _ learn-ddp :
13
22
14
- .. grid-item-card :: :octicon:`file-code;1em`
15
- Getting Started with PyTorch Distributed
16
- :shadow: none
17
- :link: https://example.com
18
- :link-type: url
19
-
20
- This tutorial provides a gentle intro to the PyTorch
21
- DistributedData Parallel.
22
-
23
- :octicon: `code;1em ` Code
23
+ Learn DDP
24
+ ---------
25
+
26
+ .. grid :: 3
24
27
25
28
.. grid-item-card :: :octicon:`file-code;1em`
26
- Single Machine Model Parallel Best Practices
29
+ DDP Intro Video Tutorials
27
30
:shadow: none
28
31
:link: https://example.com
29
32
:link-type: url
30
33
31
- In this tutorial you will learn about best practices in
32
- using model parallel.
33
-
34
- :octicon: `code;1em ` Code :octicon: `square-fill;1em ` :octicon: `video;1em ` Video
34
+ A step-by-step video series on how to get started with
35
+ ` DistributedDataParallel ` and advance to more complex topics
36
+ +++
37
+ :octicon: `code;1em ` Code :octicon: `square-fill;1em ` :octicon: `video;1em ` Video
35
38
36
- .. grid-item-card :: :octicon:`file-code;1em` Writing Distributed Applications with PyTorch
39
+ .. grid-item-card :: :octicon:`file-code;1em`
40
+ Getting Started with PyTorch Distributed
37
41
:shadow: none
38
42
:link: https://example.com
39
43
:link-type: url
40
44
41
- This tutorial demonstrates how to write a distributed application
42
- with PyTorch.
45
+ This tutorial provides a short and gentle intro to the PyTorch
46
+ DistributedData Parallel.
47
+ +++
48
+ :octicon: `code;1em ` Code
43
49
44
- :octicon: ` code;1em ` Code :octicon: ` square-fill;1em ` :octicon: ` video;1em ` Video
50
+ .. _ learn-fsdp :
45
51
46
52
Learn FSDP
47
53
----------
@@ -53,8 +59,93 @@ models.
53
59
54
60
.. grid :: 3
55
61
62
+ .. grid-item-card :: :octicon:`file-code;1em`
63
+ Getting Started with FSDP
64
+ :shadow: none
65
+ :link: https://example.com
66
+ :link-type: url
67
+
68
+ This tutorial demonstrates how you can perform distributed training
69
+ with FSDP on a MNIST dataset.
70
+ +++
71
+ :octicon: `code;1em ` Code
72
+
73
+ .. grid-item-card :: :octicon:`file-code;1em`
74
+ FSDP Advanced
75
+ :shadow: none
76
+ :link: https://example.com
77
+ :link-type: url
78
+
79
+ In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
80
+ model with FSDP for text summarization.
81
+ +++
82
+ :octicon: `code;1em ` Code
83
+
84
+ .. _learn-rpc :
85
+
56
86
Learn RPC
57
87
---------
58
88
59
89
Distributed Remote Procedure Call (RPC) framework provides
60
90
mechanisms for multi-machine model training
91
+
92
+ .. grid :: 3
93
+
94
+ .. grid-item-card :: :octicon:`file-code;1em`
95
+ Getting Started with Distributed RPC Framework
96
+ :shadow: none
97
+ :link: https://example.com
98
+ :link-type: url
99
+
100
+ This tutorial demonstrates how to get started with RPC-based distributed
101
+ training.
102
+ +++
103
+ :octicon: `code;1em ` Code
104
+
105
+ .. grid-item-card :: :octicon:`file-code;1em`
106
+ Implementing a Parameter Server Using Distributed RPC Framework
107
+ :shadow: none
108
+ :link: https://example.com
109
+ :link-type: url
110
+
111
+ This tutorial walks you through a simple example of implementing a
112
+ parameter server using PyTorch’s Distributed RPC framework.
113
+ +++
114
+ :octicon: `code;1em ` Code
115
+
116
+ .. grid-item-card :: :octicon:`file-code;1em`
117
+ Distributed Pipeline Parallelism Using RPC
118
+ :shadow: none
119
+ :link: https://example.com
120
+ :link-type: url
121
+
122
+ Learn how to use a Resnet50 model for distributed pipeline parallelism
123
+ with the Distributed RPC APIs.
124
+ +++
125
+ :octicon: `code;1em ` Code
126
+
127
+ .. grid :: 3
128
+
129
+ .. grid-item-card :: :octicon:`file-code;1em`
130
+ Implementing Batch RPC Processing Using Asynchronous Executions
131
+ :shadow: none
132
+ :link: https://example.com
133
+ :link-type: url
134
+
135
+ In this tutorial you will build batch-processing RPC applications
136
+ with the @rpc.functions.async_execution decorator.
137
+ +++
138
+ :octicon: `code;1em ` Code
139
+
140
+ .. grid-item-card :: :octicon:`file-code;1em`
141
+ Combining Distributed DataParallel with Distributed RPC Framework
142
+ :shadow: none
143
+ :link: https://example.com
144
+ :link-type: url
145
+
146
+ In this tutorial you will learn how to combine distributed data
147
+ parallelism with distributed model parallelism.
148
+ +++
149
+ :octicon: `code;1em ` Code
150
+
151
+ .. _learn-pipeline-parallelism :
0 commit comments