Skip to content

Commit 1c11ecf

Browse files
author
Svetlana Karslioglu
authored
Add Distributed tutorials landing page (#1997)
* Add Distributed tutorials landing page
1 parent 0452dd8 commit 1c11ecf

File tree

4 files changed

+155
-4
lines changed

4 files changed

+155
-4
lines changed

_static/css/custom.css

100644100755
Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,3 @@
7171
.sd-card:hover:after {
7272
transform: scaleX(1);
7373
}
74-

conf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -275,16 +275,16 @@ def setup(app):
275275
# and can be moved outside of this function (and the setup(app) function
276276
# can be deleted).
277277
#html_css_files = [
278-
# 'https://cdn.jsdelivr.net/npm/katex@0.10.0-beta/dist/katex.min.css'
278+
# 'https://cdn.jsdelivr.net/npm/katex@0.10.0-beta/dist/katex.min.css',
279+
# 'css/custom.css'
279280
#]
280281
# In Sphinx 1.8 it was renamed to `add_css_file`, 1.7 and prior it is
281282
# `add_stylesheet` (deprecated in 1.8).
282283
#add_css = getattr(app, 'add_css_file', app.add_stylesheet)
283284
#for css_file in html_css_files:
284285
# add_css(css_file)
285-
286286
# Custom CSS
287-
# app.add_stylesheet('css/pytorch_theme.css')
287+
#app.add_stylesheet('css/pytorch_theme.css')
288288
# app.add_stylesheet('https://fonts.googleapis.com/css?family=Lato')
289289
# Custom directives
290290
app.add_directive('includenodoc', IncludeDirective)

distributed/home.rst

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
Distributed and Parallel Training Tutorials
2+
===========================================
3+
4+
Distributed training is a model training paradigm that involves
5+
spreading training workload across multiple worker nodes, therefore
6+
significantly improving the speed of training and model accuracy. While
7+
distributed training can be used for any type of ML model training, it
8+
is most beneficial to use it for large models and compute demanding
9+
tasks as deep learning.
10+
11+
There are a few ways you can perform distributed training in
12+
PyTorch with each method having their advantages in certain use cases:
13+
14+
* `DistributedDataParallel (DDP) <#learn-ddp>`__
15+
* `Fully Sharded Data Parallel (FSDP) <#learn-fsdp>`__
16+
* `Remote Procedure Call (RPC) distributed training <#learn-rpc>`__
17+
* `Custom Extensions <#custom-extensions>`__
18+
19+
Read more about these options in `Distributed Overview <../beginner/dist_overview.html>`__.
20+
21+
.. _learn-ddp:
22+
23+
Learn DDP
24+
---------
25+
26+
.. grid:: 3
27+
28+
.. grid-item-card:: :octicon:`file-code;1em`
29+
DDP Intro Video Tutorials
30+
:link: https://pytorch.org/tutorials/beginner/ddp_series_intro.html?utm_source=distr_landing&utm_medium=ddp_series_intro
31+
:link-type: url
32+
33+
A step-by-step video series on how to get started with
34+
`DistributedDataParallel` and advance to more complex topics
35+
+++
36+
:octicon:`code;1em` Code :octicon:`square-fill;1em` :octicon:`video;1em` Video
37+
38+
.. grid-item-card:: :octicon:`file-code;1em`
39+
Getting Started with Distributed Data Parallel
40+
:link: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html?utm_source=distr_landing&utm_medium=intermediate_ddp_tutorial
41+
:link-type: url
42+
43+
This tutorial provides a short and gentle intro to the PyTorch
44+
DistributedData Parallel.
45+
+++
46+
:octicon:`code;1em` Code
47+
48+
.. grid-item-card:: :octicon:`file-code;1em`
49+
Distributed Training with Uneven Inputs Using
50+
the Join Context Manager
51+
:link: https://pytorch.org/tutorials/advanced/generic_join.html?utm_source=distr_landing&utm_medium=generic_join
52+
:link-type: url
53+
54+
This tutorial provides a short and gentle intro to the PyTorch
55+
DistributedData Parallel.
56+
+++
57+
:octicon:`code;1em` Code
58+
59+
.. _learn-fsdp:
60+
61+
Learn FSDP
62+
----------
63+
64+
.. grid:: 3
65+
66+
.. grid-item-card:: :octicon:`file-code;1em`
67+
Getting Started with FSDP
68+
:link: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_getting_started
69+
:link-type: url
70+
71+
This tutorial demonstrates how you can perform distributed training
72+
with FSDP on a MNIST dataset.
73+
+++
74+
:octicon:`code;1em` Code
75+
76+
.. grid-item-card:: :octicon:`file-code;1em`
77+
FSDP Advanced
78+
:link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
79+
:link-type: url
80+
81+
In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
82+
model with FSDP for text summarization.
83+
+++
84+
:octicon:`code;1em` Code
85+
86+
.. _learn-rpc:
87+
88+
Learn RPC
89+
---------
90+
91+
.. grid:: 3
92+
93+
.. grid-item-card:: :octicon:`file-code;1em`
94+
Getting Started with Distributed RPC Framework
95+
:link: https://pytorch.org/tutorials/intermediate/rpc_tutorial.html?utm_source=distr_landing&utm_medium=rpc_getting_started
96+
:link-type: url
97+
98+
This tutorial demonstrates how to get started with RPC-based distributed
99+
training.
100+
+++
101+
:octicon:`code;1em` Code
102+
103+
.. grid-item-card:: :octicon:`file-code;1em`
104+
Implementing a Parameter Server Using Distributed RPC Framework
105+
:link: https://pytorch.org/tutorials/intermediate/rpc_param_server_tutorial.html?utm_source=distr_landing&utm_medium=rpc_param_server_tutorial
106+
:link-type: url
107+
108+
This tutorial walks you through a simple example of implementing a
109+
parameter server using PyTorch’s Distributed RPC framework.
110+
+++
111+
:octicon:`code;1em` Code
112+
113+
.. grid-item-card:: :octicon:`file-code;1em`
114+
Implementing Batch RPC Processing Using Asynchronous Executions
115+
:link: https://pytorch.org/tutorials/intermediate/rpc_async_execution.html?utm_source=distr_landing&utm_medium=rpc_async_execution
116+
:link-type: url
117+
118+
In this tutorial you will build batch-processing RPC applications
119+
with the @rpc.functions.async_execution decorator.
120+
+++
121+
:octicon:`code;1em` Code
122+
123+
.. grid:: 3
124+
125+
.. grid-item-card:: :octicon:`file-code;1em`
126+
Combining Distributed DataParallel with Distributed RPC Framework
127+
:link: https://pytorch.org/tutorials/advanced/rpc_ddp_tutorial.html?utm_source=distr_landing&utm_medium=rpc_plus_ddp
128+
:link-type: url
129+
130+
In this tutorial you will learn how to combine distributed data
131+
parallelism with distributed model parallelism.
132+
+++
133+
:octicon:`code;1em` Code
134+
135+
.. _custom-extensions:
136+
137+
Custom Extensions
138+
-----------------
139+
140+
.. grid:: 3
141+
142+
.. grid-item-card:: :octicon:`file-code;1em`
143+
Customize Process Group Backends Using Cpp Extensions
144+
:link: https://pytorch.org/tutorials/intermediate/process_group_cpp_extension_tutorial.html?utm_source=distr_landing&utm_medium=custom_extensions_cpp
145+
:link-type: url
146+
147+
In this tutorial you will learn to implement a custom `ProcessGroup`
148+
backend and plug that into PyTorch distributed package using
149+
cpp extensions.
150+
+++
151+
:octicon:`code;1em` Code

index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -891,6 +891,7 @@ Additional Resources
891891
:hidden:
892892
:caption: Parallel and Distributed Training
893893

894+
distributed/home
894895
beginner/dist_overview
895896
beginner/ddp_series_intro
896897
intermediate/model_parallel_tutorial

0 commit comments

Comments
 (0)