You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: intermediate_source/dist_tuto.rst
+48-99Lines changed: 48 additions & 99 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,10 @@ Writing Distributed Applications with PyTorch
2
2
=============================================
3
3
**Author**: `Séb Arnold <https://seba1511.com>`_
4
4
5
-
In this short tutorial, we will be going over the distributed package of PyTorch. We'll see how to set up the distributed setting, use the different communication strategies, and go over some the internals of the package.
5
+
In this short tutorial, we will be going over the distributed package
6
+
of PyTorch. We'll see how to set up the distributed setting, use the
7
+
different communication strategies, and go over some the internals of
8
+
the package.
6
9
7
10
Setup
8
11
-----
@@ -17,7 +20,7 @@ Setup
17
20
The distributed package included in PyTorch (i.e.,
18
21
``torch.distributed``) enables researchers and practitioners to easily
19
22
parallelize their computations across processes and clusters of
20
-
machines. To do so, it leverages the messaging passing semantics
23
+
machines. To do so, it leverages messaging passing semantics
21
24
allowing each process to communicate data to any of the other processes.
22
25
As opposed to the multiprocessing (``torch.multiprocessing``) package,
23
26
processes can use different communication backends and are not
@@ -45,7 +48,7 @@ the following template.
45
48
""" Distributed function to be implemented later. """
46
49
pass
47
50
48
-
definit_processes(rank, size, fn, backend='tcp'):
51
+
definit_process(rank, size, fn, backend='gloo'):
49
52
""" Initialize the distributed environment. """
50
53
os.environ['MASTER_ADDR'] ='127.0.0.1'
51
54
os.environ['MASTER_PORT'] ='29500'
@@ -57,7 +60,7 @@ the following template.
57
60
size =2
58
61
processes = []
59
62
for rank inrange(size):
60
-
p = Process(target=init_processes, args=(rank, size, run))
63
+
p = Process(target=init_process, args=(rank, size, run))
61
64
p.start()
62
65
processes.append(p)
63
66
@@ -69,12 +72,10 @@ distributed environment, initialize the process group
69
72
(``dist.init_process_group``), and finally execute the given ``run``
70
73
function.
71
74
72
-
Let's have a look at the ``init_processes`` function. It ensures that
75
+
Let's have a look at the ``init_process`` function. It ensures that
73
76
every process will be able to coordinate through a master, using the
74
-
same ip address and port. Note that we used the TCP backend, but we
75
-
could have used
76
-
`MPI <https://en.wikipedia.org/wiki/Message_Passing_Interface>`__ or
Of course, this will be a didactic example and in a real-world
260
-
situtation you should use the official, well-tested and well-optimized
261
+
situation you should use the official, well-tested and well-optimized
261
262
version linked above.
262
263
263
264
Quite simply we want to implement a distributed version of stochastic
@@ -443,43 +444,27 @@ Communication Backends
443
444
444
445
One of the most elegant aspects of ``torch.distributed`` is its ability
445
446
to abstract and build on top of different backends. As mentioned before,
446
-
there are currently three backends implemented in PyTorch: TCP, MPI, and
447
-
Gloo. They each have different specifications and tradeoffs, depending
448
-
on the desired use-case. A comparative table of supported functions can
447
+
there are currently three backends implemented in PyTorch: Gloo, NCCL, and
448
+
MPI. They each have different specifications and tradeoffs, depending
449
+
on the desired usecase. A comparative table of supported functions can
449
450
be found
450
-
`here <https://pytorch.org/docs/stable/distributed.html#module-torch.distributed>`__. Note that a fourth backend, NCCL, has been added since the creation of this tutorial. See `this section <https://pytorch.org/docs/stable/distributed.html#multi-gpu-collective-functions>`__ of the ``torch.distributed`` docs for more information about its use and value.
451
-
452
-
**TCP Backend**
453
-
454
-
So far we have made extensive usage of the TCP backend. It is quite
455
-
handy as a development platform, as it is guaranteed to work on most
456
-
machines and operating systems. It also supports all point-to-point and
457
-
collective functions on CPU. However, there is no support for GPUs and
458
-
its communication routines are not as optimized as the MPI one.
0 commit comments