Skip to content

Commit 0ebdf09

Browse files
author
Svetlana Karslioglu
committed
Update
1 parent 88a3db8 commit 0ebdf09

File tree

1 file changed

+115
-24
lines changed

1 file changed

+115
-24
lines changed

distributed/home.rst

Lines changed: 115 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,53 @@
11
Distributed and Parallel Training Tutorials
22
===========================================
33

4-
Distributed and
5-
at pytorch.org website.
4+
Distributed training is a model training paradigm that involves
5+
spreading training workload across multiple worker nodes, therefore
6+
significantly improving the speed of training and model accuracy. While
7+
distributed training can be used for any type of ML model training, it
8+
is most beneficial to use it for large models and compute demanding
9+
tasks as deep learning.
610

7-
Getting Started with Distributed Data-Parallel Training (DDP)
8-
-------------------------------------------------------------
11+
There are a few ways you can perform distributed training in
12+
PyTorch with each method having their advantages in certain use cases:
913

14+
* `DistributedDataParallel (DDP) <#learn-ddp>`__
15+
* `Fully Sharded Data Parallel (FSDP) <#learn-fsdp>`__
16+
* `Remote Procedure Call (RPC) distributed training <#learn-rpc>`__
17+
* `Pipeline Parallelism <#learn-pipeline-parallelism>`__
1018

19+
Read more about these options in [Distributed Overview](../beginner/dist_overview.rst).
1120

12-
.. grid:: 3
21+
.. _learn-ddp:
1322

14-
.. grid-item-card:: :octicon:`file-code;1em`
15-
Getting Started with PyTorch Distributed
16-
:shadow: none
17-
:link: https://example.com
18-
:link-type: url
19-
20-
This tutorial provides a gentle intro to the PyTorch
21-
DistributedData Parallel.
22-
23-
:octicon:`code;1em` Code
23+
Learn DDP
24+
---------
25+
26+
.. grid:: 3
2427

2528
.. grid-item-card:: :octicon:`file-code;1em`
26-
Single Machine Model Parallel Best Practices
29+
DDP Intro Video Tutorials
2730
:shadow: none
2831
:link: https://example.com
2932
:link-type: url
3033

31-
In this tutorial you will learn about best practices in
32-
using model parallel.
33-
34-
:octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video
34+
A step-by-step video series on how to get started with
35+
`DistributedDataParallel` and advance to more complex topics
36+
+++
37+
:octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video
3538

36-
.. grid-item-card:: :octicon:`file-code;1em` Writing Distributed Applications with PyTorch
39+
.. grid-item-card:: :octicon:`file-code;1em`
40+
Getting Started with PyTorch Distributed
3741
:shadow: none
3842
:link: https://example.com
3943
:link-type: url
4044

41-
This tutorial demonstrates how to write a distributed application
42-
with PyTorch.
45+
This tutorial provides a short and gentle intro to the PyTorch
46+
DistributedData Parallel.
47+
+++
48+
:octicon:`code;1em` Code
4349

44-
:octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video
50+
.. _learn-fsdp:
4551

4652
Learn FSDP
4753
----------
@@ -53,8 +59,93 @@ models.
5359

5460
.. grid:: 3
5561

62+
.. grid-item-card:: :octicon:`file-code;1em`
63+
Getting Started with FSDP
64+
:shadow: none
65+
:link: https://example.com
66+
:link-type: url
67+
68+
This tutorial demonstrates how you can perform distributed training
69+
with FSDP on a MNIST dataset.
70+
+++
71+
:octicon:`code;1em` Code
72+
73+
.. grid-item-card:: :octicon:`file-code;1em`
74+
FSDP Advanced
75+
:shadow: none
76+
:link: https://example.com
77+
:link-type: url
78+
79+
In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
80+
model with FSDP for text summarization.
81+
+++
82+
:octicon:`code;1em` Code
83+
84+
.. _learn-rpc:
85+
5686
Learn RPC
5787
---------
5888

5989
Distributed Remote Procedure Call (RPC) framework provides
6090
mechanisms for multi-machine model training
91+
92+
.. grid:: 3
93+
94+
.. grid-item-card:: :octicon:`file-code;1em`
95+
Getting Started with Distributed RPC Framework
96+
:shadow: none
97+
:link: https://example.com
98+
:link-type: url
99+
100+
This tutorial demonstrates how to get started with RPC-based distributed
101+
training.
102+
+++
103+
:octicon:`code;1em` Code
104+
105+
.. grid-item-card:: :octicon:`file-code;1em`
106+
Implementing a Parameter Server Using Distributed RPC Framework
107+
:shadow: none
108+
:link: https://example.com
109+
:link-type: url
110+
111+
This tutorial walks you through a simple example of implementing a
112+
parameter server using PyTorch’s Distributed RPC framework.
113+
+++
114+
:octicon:`code;1em` Code
115+
116+
.. grid-item-card:: :octicon:`file-code;1em`
117+
Distributed Pipeline Parallelism Using RPC
118+
:shadow: none
119+
:link: https://example.com
120+
:link-type: url
121+
122+
Learn how to use a Resnet50 model for distributed pipeline parallelism
123+
with the Distributed RPC APIs.
124+
+++
125+
:octicon:`code;1em` Code
126+
127+
.. grid:: 3
128+
129+
.. grid-item-card:: :octicon:`file-code;1em`
130+
Implementing Batch RPC Processing Using Asynchronous Executions
131+
:shadow: none
132+
:link: https://example.com
133+
:link-type: url
134+
135+
In this tutorial you will build batch-processing RPC applications
136+
with the @rpc.functions.async_execution decorator.
137+
+++
138+
:octicon:`code;1em` Code
139+
140+
.. grid-item-card:: :octicon:`file-code;1em`
141+
Combining Distributed DataParallel with Distributed RPC Framework
142+
:shadow: none
143+
:link: https://example.com
144+
:link-type: url
145+
146+
In this tutorial you will learn how to combine distributed data
147+
parallelism with distributed model parallelism.
148+
+++
149+
:octicon:`code;1em` Code
150+
151+
.. _learn-pipeline-parallelism:

0 commit comments

Comments
 (0)