Skip to content

Commit 2fdff89

Browse files
authored
Merge branch 'main' into batchdata
2 parents 80c0860 + 9ec2625 commit 2fdff89

File tree

8 files changed

+26
-21
lines changed

8 files changed

+26
-21
lines changed

.circleci/config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,14 +142,14 @@ pytorch_tutorial_build_defaults: &pytorch_tutorial_build_defaults
142142
143143
pytorch_tutorial_build_worker_defaults: &pytorch_tutorial_build_worker_defaults
144144
environment:
145-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7"
145+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7"
146146
CUDA_VERSION: "9"
147147
resource_class: gpu.nvidia.small
148148
<<: *pytorch_tutorial_build_defaults
149149

150150
pytorch_tutorial_build_manager_defaults: &pytorch_tutorial_build_manager_defaults
151151
environment:
152-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7"
152+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7"
153153
resource_class: medium
154154
<<: *pytorch_tutorial_build_defaults
155155

.circleci/config.yml.in

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,14 +142,14 @@ pytorch_tutorial_build_defaults: &pytorch_tutorial_build_defaults
142142

143143
pytorch_tutorial_build_worker_defaults: &pytorch_tutorial_build_worker_defaults
144144
environment:
145-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7"
145+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7"
146146
CUDA_VERSION: "9"
147147
resource_class: gpu.nvidia.small
148148
<<: *pytorch_tutorial_build_defaults
149149

150150
pytorch_tutorial_build_manager_defaults: &pytorch_tutorial_build_manager_defaults
151151
environment:
152-
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7"
152+
DOCKER_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7"
153153
resource_class: medium
154154
<<: *pytorch_tutorial_build_defaults
155155
{% raw %}

.jenkins/get_docker_tag.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,14 @@
55
}
66

77
if __name__ == "__main__":
8-
url = "https://api.github.com/repos/pytorch/pytorch/contents/.circleci"
8+
url = "https://api.github.com/repos/pytorch/pytorch/contents/.ci"
99

1010
response = requests.get(url, headers=REQUEST_HEADERS)
11-
for file in response.json():
12-
if file["name"] == "docker":
13-
print(file["sha"])
11+
docker_sha = None
12+
for finfo in response.json():
13+
if finfo["name"] == "docker":
14+
docker_sha = finfo["sha"]
15+
break
16+
if docker_sha is None:
17+
raise RuntimeError("Can't find sha sum of docker folder")
18+
print(docker_sha)

beginner_source/ddp_series_fault_tolerance.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,8 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
4242

4343
In distributed training, a single process failure can
4444
disrupt the entire training job. Since the susceptibility for failure can be higher here, making your training
45-
script robust is particularly important here. You might also prefer your training job to be *elastic* i.e.
46-
45+
script robust is particularly important here. You might also prefer your training job to be *elastic*, for example,
46+
compute resources can join and leave dynamically over the course of the job.
4747

4848
PyTorch offers a utility called ``torchrun`` that provides fault-tolerance and
4949
elastic training. When a failure occurs, ``torchrun`` logs the errors and
@@ -60,7 +60,7 @@ Why use ``torchrun``
6060
``torchrun`` handles the minutiae of distributed training so that you
6161
don't need to. For instance,
6262

63-
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; torchrun assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
63+
- You don't need to set environment variables or explicitly pass the ``rank`` and ``world_size``; ``torchrun`` assigns this along with several other `environment variables <https://pytorch.org/docs/stable/elastic/run.html#environment-variables>`__.
6464
- No need to call ``mp.spawn`` in your script; you only need a generic ``main()`` entrypoint, and launch the script with ``torchrun``. This way the same script can be run in non-distributed as well as single-node and multinode setups.
6565
- Gracefully restarting training from the last saved training snapshot
6666

beginner_source/ddp_series_multigpu.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ In this tutorial, we start with a single-GPU training script and migrate that to
4141
Along the way, we will talk through important concepts in distributed training while implementing them in our code.
4242

4343
.. note::
44-
If your model contains any ``BatchNorm`` layer, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
44+
If your model contains any ``BatchNorm`` layers, it needs to be converted to ``SyncBatchNorm`` to sync the running stats of ``BatchNorm``
4545
layers across replicas.
4646

4747
Use the helper function
@@ -57,7 +57,7 @@ Imports
5757
~~~~~~~
5858
- ``torch.multiprocessing`` is a PyTorch wrapper around Python's native
5959
multiprocessing
60-
- The dsitributed process group contains all the processes that can
60+
- The distributed process group contains all the processes that can
6161
communicate and synchronize with each other.
6262

6363
.. code:: diff

beginner_source/ddp_series_theory.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,8 +54,8 @@ DDP improves upon the architecture in a few ways:
5454
| | machines |
5555
+---------------------------------------+------------------------------+
5656
| Slower; uses multithreading on a | Faster (no GIL contention) |
57-
| single process and runs into GIL | because it uses |
58-
| contention | multiprocessing |
57+
| single process and runs into Global | because it uses |
58+
| Interpreter Lock (GIL) contention | multiprocessing |
5959
+---------------------------------------+------------------------------+
6060

6161
Further Reading

intermediate_source/ddp_series_minGPT.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ Files used for training
4848
~~~~~~~~~~~~~~~~~~~~~~~~
4949
- `trainer.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/trainer.py>`__ includes the Trainer class that runs the distributed training iterations on the model with the provided dataset.
5050
- `model.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/model.py>`__ defines the model architecture.
51-
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the `Dataset`class for a character-level dataset.
51+
- `char_dataset.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/char_dataset.py>`__ contains the ``Dataset`` class for a character-level dataset.
5252
- `gpt2_train_cfg.yaml <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/gpt2_train_cfg.yaml>`__ contains the configurations for data, model, optimizer, and training run.
53-
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the trainig job. It sets up the DDP process group, reads all the configurations and runs the training job.
53+
- `main.py <https://github.com/pytorch/examples/blob/main/distributed/minGPT-ddp/mingpt/main.py>`__ is the entry point to the training job. It sets up the DDP process group, reads all the configurations and runs the training job.
5454

5555

5656
Saving and Loading from the cloud
@@ -72,8 +72,8 @@ A typical training run's memory footprint consists of model weights, activations
7272
Since DDP replicates the model on each GPU, it only works when GPUs have sufficient capacity to accomodate the full footprint.
7373
When models grow larger, more aggressive techniques might be useful:
7474

75-
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76-
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
75+
- `activation checkpointing <https://pytorch.org/docs/stable/checkpoint.html>`__: Instead of saving intermediate activations during the forward pass, the activations are recomputed during the backward pass. In this approach, we run more compute but save on memory footprint.
76+
- `Fully-Sharded Data Parallel <https://pytorch.org/docs/stable/fsdp.html>`__: Here the model is not replicated but "sharded" across all the GPUs, and computation is overlapped with communication in the forward and backward passes. Read our `blog <https://medium.com/pytorch/training-a-1-trillion-parameter-model-with-pytorch-fully-sharded-data-parallel-on-aws-3ac13aa96cff>`__ to learn how we trained a 1 Trillion parameter model with FSDP.
7777

7878

7979
Further Reading

intermediate_source/ddp_series_multinode.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Follow along with the video below or on `youtube <https://www.youtube.com/watch/
3838
Multinode training involves deploying a training job across several
3939
machines. There are two ways to do this:
4040

41-
- running a torchrun command on each machine with identical rendezvous arguments, or
41+
- running a ``torchrun`` command on each machine with identical rendezvous arguments, or
4242
- deploying it on a compute cluster using a workload manager (like SLURM)
4343

4444
In this video we will go over the (minimal) code changes required to move from single-node multigpu to
@@ -50,7 +50,7 @@ on 4 GPUs on a single node will be faster than running it on 4 nodes with 1 GPU
5050
Local and Global ranks
5151
~~~~~~~~~~~~~~~~~~~~~~~~
5252
In single-node settings, we were tracking the
53-
``gpu_id``s of the devices running our training processes. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
53+
``gpu_id`` of each device running our training process. ``torchrun`` tracks this value in an environment variable ``LOCAL_RANK``
5454
which uniquely identifies each GPU-process on a node. For a unique identifier across all the nodes, ``torchrun`` provides another variable
5555
``RANK`` which refers to the global rank of a process.
5656

0 commit comments

Comments
 (0)