Skip to content

Commit 2f86fcc

Browse files
authored
Merge branch 'main' into redirecting-former-torchies
2 parents 2880d6b + 76bd6d3 commit 2f86fcc

File tree

2 files changed

+60
-24
lines changed

2 files changed

+60
-24
lines changed

.github/workflows/link_checkPR.yml

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,52 @@
11
#Checks links in a PR to ensure they are valid. If link is valid but failing, it can be added to the .lycheeignore file
2+
#Use the skip-link-check label on a PR to skip checking links on a PR
23

34
name: link check on PR
45

56
on:
67
pull_request:
78
branches: [main]
9+
810
jobs:
911
linkChecker:
1012
runs-on: ubuntu-latest
13+
1114
steps:
1215
- uses: actions/checkout@v4
1316
with:
1417
fetch-depth: 1
18+
1519
- name: Get Changed Files
1620
id: changed-files
1721
uses: tj-actions/changed-files@v41
22+
23+
- name: Check for Skip Label
24+
id: skip-label
25+
uses: actions/github-script@v6
26+
with:
27+
script: |
28+
const labels = await github.rest.issues.listLabelsOnIssue({
29+
owner: context.repo.owner,
30+
repo: context.repo.repo,
31+
issue_number: context.issue.number
32+
});
33+
return labels.data.some(label => label.name === 'skip-link-check');
34+
1835
- name: Check Links
36+
if: steps.skip-label.outputs.result == 'false'
1937
uses: lycheeverse/lychee-action@v1
2038
with:
2139
args: --accept=200,403,429 --base . --verbose --no-progress ${{ steps.changed-files.outputs.all_changed_files }}
2240
token: ${{ secrets.CUSTOM_TOKEN }}
2341
fail: true
42+
43+
- name: Skip Message
44+
if: steps.skip-label.outputs.result == 'true'
45+
run: echo "Link check was skipped due to the presence of the 'skip-link-check' label."
46+
2447
- name: Suggestions
2548
if: failure()
2649
run: |
2750
echo -e "\nPlease review the links reported in the Check links step above."
28-
echo -e "If a link is valid but fails due to a CAPTCHA challenge, IP blocking, login requirements, etc.,
29-
consider adding such links to .lycheeignore file to bypass future checks.\n"
51+
echo -e "If a link is valid but fails due to a CAPTCHA challenge, IP blocking, login requirements, etc., consider adding such links to .lycheeignore file to bypass future checks.\n"
3052
exit 1

intermediate_source/dist_tuto.rst

Lines changed: 36 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ simultaneously. If you have access to compute cluster you should check
3838
with your local sysadmin or use your favorite coordination tool (e.g.,
3939
`pdsh <https://linux.die.net/man/1/pdsh>`__,
4040
`clustershell <https://cea-hpc.github.io/clustershell/>`__, or
41-
`others <https://slurm.schedmd.com/>`__). For the purpose of this
41+
`slurm <https://slurm.schedmd.com/>`__). For the purpose of this
4242
tutorial, we will use a single machine and spawn multiple processes using
4343
the following template.
4444

@@ -64,11 +64,11 @@ the following template.
6464
6565
6666
if __name__ == "__main__":
67-
size = 2
67+
world_size = 2
6868
processes = []
6969
mp.set_start_method("spawn")
70-
for rank in range(size):
71-
p = mp.Process(target=init_process, args=(rank, size, run))
70+
for rank in range(world_size):
71+
p = mp.Process(target=init_process, args=(rank, world_size, run))
7272
p.start()
7373
processes.append(p)
7474
@@ -125,7 +125,7 @@ process 0 increments the tensor and sends it to process 1 so that they
125125
both end up with 1.0. Notice that process 1 needs to allocate memory in
126126
order to store the data it will receive.
127127

128-
Also notice that ``send``/``recv`` are **blocking**: both processes stop
128+
Also notice that ``send/recv`` are **blocking**: both processes block
129129
until the communication is completed. On the other hand immediates are
130130
**non-blocking**; the script continues its execution and the methods
131131
return a ``Work`` object upon which we can choose to
@@ -219,16 +219,23 @@ to obtain the sum of all tensors on all processes, we can use the
219219
Since we want the sum of all tensors in the group, we use
220220
``dist.ReduceOp.SUM`` as the reduce operator. Generally speaking, any
221221
commutative mathematical operation can be used as an operator.
222-
Out-of-the-box, PyTorch comes with 4 such operators, all working at the
222+
Out-of-the-box, PyTorch comes with many such operators, all working at the
223223
element-wise level:
224224

225225
- ``dist.ReduceOp.SUM``,
226226
- ``dist.ReduceOp.PRODUCT``,
227227
- ``dist.ReduceOp.MAX``,
228-
- ``dist.ReduceOp.MIN``.
228+
- ``dist.ReduceOp.MIN``,
229+
- ``dist.ReduceOp.BAND``,
230+
- ``dist.ReduceOp.BOR``,
231+
- ``dist.ReduceOp.BXOR``,
232+
- ``dist.ReduceOp.PREMUL_SUM``.
229233

230-
In addition to ``dist.all_reduce(tensor, op, group)``, there are a total
231-
of 6 collectives currently implemented in PyTorch.
234+
The full list of supported operators is
235+
`here <https://pytorch.org/docs/stable/distributed.html#torch.distributed.ReduceOp>`__.
236+
237+
In addition to ``dist.all_reduce(tensor, op, group)``, there are many additional collectives currently implemented in
238+
PyTorch. Here are a few supported collectives.
232239

233240
- ``dist.broadcast(tensor, src, group)``: Copies ``tensor`` from
234241
``src`` to all other processes.
@@ -244,6 +251,12 @@ of 6 collectives currently implemented in PyTorch.
244251
- ``dist.all_gather(tensor_list, tensor, group)``: Copies ``tensor``
245252
from all processes to ``tensor_list``, on all processes.
246253
- ``dist.barrier(group)``: Blocks all processes in `group` until each one has entered this function.
254+
- ``dist.all_to_all(output_tensor_list, input_tensor_list, group)``: Scatters list of input tensors to all processes in
255+
a group and return gathered list of tensors in output list.
256+
257+
The full list of supported collectives can be found by looking at the latest documentation for PyTorch Distributed
258+
`(link) <https://pytorch.org/docs/stable/distributed.html>`__.
259+
247260

248261
Distributed Training
249262
--------------------
@@ -275,7 +288,7 @@ gradients of their model on their batch of data and then average their
275288
gradients. In order to ensure similar convergence results when changing
276289
the number of processes, we will first have to partition our dataset.
277290
(You could also use
278-
`tnt.dataset.SplitDataset <https://github.com/pytorch/tnt/blob/master/torchnet/dataset/splitdataset.py#L4>`__,
291+
`torch.utils.data.random_split <https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split>`__,
279292
instead of the snippet below.)
280293

281294
.. code:: python
@@ -389,7 +402,7 @@ could train any model on a large computer cluster.
389402
lot more tricks <https://seba-1511.github.io/dist_blog>`__ required to
390403
implement a production-level implementation of synchronous SGD. Again,
391404
use what `has been tested and
392-
optimized <https://pytorch.org/docs/stable/nn.html#torch.nn.parallel.DistributedDataParallel>`__.
405+
optimized <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`__.
393406

394407
Our Own Ring-Allreduce
395408
~~~~~~~~~~~~~~~~~~~~~~
@@ -451,8 +464,9 @@ Communication Backends
451464

452465
One of the most elegant aspects of ``torch.distributed`` is its ability
453466
to abstract and build on top of different backends. As mentioned before,
454-
there are currently three backends implemented in PyTorch: Gloo, NCCL, and
455-
MPI. They each have different specifications and tradeoffs, depending
467+
there are multiple backends implemented in PyTorch.
468+
Some of the most popular ones are Gloo, NCCL, and MPI.
469+
They each have different specifications and tradeoffs, depending
456470
on the desired use case. A comparative table of supported functions can
457471
be found
458472
`here <https://pytorch.org/docs/stable/distributed.html#module-torch.distributed>`__.
@@ -544,15 +558,15 @@ NCCL backend is included in the pre-built binaries with CUDA support.
544558
Initialization Methods
545559
~~~~~~~~~~~~~~~~~~~~~~
546560

547-
To finish this tutorial, let's talk about the very first function we
548-
called: ``dist.init_process_group(backend, init_method)``. In
549-
particular, we will go over the different initialization methods which
550-
are responsible for the initial coordination step between each process.
551-
Those methods allow you to define how this coordination is done.
552-
Depending on your hardware setup, one of these methods should be
553-
naturally more suitable than the others. In addition to the following
554-
sections, you should also have a look at the `official
555-
documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__.
561+
To conclude this tutorial, let's examine the initial function we invoked:
562+
``dist.init_process_group(backend, init_method)``. Specifically, we will discuss the various
563+
initialization methods responsible for the preliminary coordination step between each process.
564+
These methods enable you to define how this coordination is accomplished.
565+
566+
The choice of initialization method depends on your hardware setup, and one method may be more
567+
suitable than others. In addition to the following sections, please refer to the `official
568+
documentation <https://pytorch.org/docs/stable/distributed.html#initialization>`__ for further information.
569+
556570

557571
**Environment Variable**
558572

0 commit comments

Comments
 (0)