Skip to content

Commit c8f7e41

Browse files
author
Vincent Moens
committed
Merge remote-tracking branch 'origin/main' into pinmem-nonblock-tuto
2 parents bff42d1 + c3882db commit c8f7e41

File tree

8 files changed

+21
-18
lines changed

8 files changed

+21
-18
lines changed

.ci/docker/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ tqdm==4.66.1
1313
numpy==1.24.4
1414
matplotlib
1515
librosa
16-
torch==2.3
16+
torch==2.4
1717
torchvision
1818
torchtext
1919
torchdata

.jenkins/build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@ sudo apt-get install -y pandoc
2222
#Install PyTorch Nightly for test.
2323
# Nightly - pip install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu102/torch_nightly.html
2424
# Install 2.4 to merge all 2.4 PRs - uncomment to install nightly binaries (update the version as needed).
25-
pip uninstall -y torch torchvision torchaudio torchtext torchdata
26-
pip3 install torch==2.4.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
25+
# pip uninstall -y torch torchvision torchaudio torchtext torchdata
26+
# pip3 install torch==2.4.0 torchvision torchaudio --no-cache-dir --index-url https://download.pytorch.org/whl/test/cu124
2727

2828
# Install two language tokenizers for Translation with TorchText tutorial
2929
python -m spacy download en_core_web_sm

beginner_source/knowledge_distillation_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -352,7 +352,7 @@ def train_knowledge_distillation(teacher, student, train_loader, epochs, learnin
352352
# Cosine loss minimization run
353353
# ----------------------------
354354
# Feel free to play around with the temperature parameter that controls the softness of the softmax function and the loss coefficients.
355-
# In neural networks, it is easy to include to include additional loss functions to the main objectives to achieve goals like better generalization.
355+
# In neural networks, it is easy to include additional loss functions to the main objectives to achieve goals like better generalization.
356356
# Let's try including an objective for the student, but now let's focus on their hidden states rather than their output layers.
357357
# Our goal is to convey information from the teacher's representation to the student by including a naive loss function,
358358
# whose minimization implies that the flattened vectors that are subsequently passed to the classifiers have become more *similar* as the loss decreases.

index.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ Welcome to PyTorch Tutorials
33

44
**What's new in PyTorch tutorials?**
55

6-
* `Using User-Defined Triton Kernels with torch.compile <https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html>`__
7-
* `Large Scale Transformer model training with Tensor Parallel (TP) <https://pytorch.org/tutorials/intermediate/TP_tutorial.html>`__
8-
* `Accelerating BERT with semi-structured (2:4) sparsity <https://pytorch.org/tutorials/advanced/semi_structured_sparse.html>`__
9-
* `torch.export Tutorial with torch.export.Dim <https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html>`__
10-
* `Extension points in nn.Module for load_state_dict and tensor subclasses <https://pytorch.org/tutorials/recipes/recipes/swap_tensors.html>`__
6+
* `Introduction to Distributed Pipeline Parallelism <https://pytorch.org/tutorials/intermediate/pipelining_tutorial.html>`__
7+
* `Introduction to Libuv TCPStore Backend <https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html>`__
8+
* `Asynchronous Saving with Distributed Checkpoint (DCP) <https://pytorch.org/tutorials/recipes/distributed_async_checkpoint_recipe.html>`__
9+
* `Python Custom Operators <https://pytorch.org/tutorials/advanced/python_custom_ops.html>`__
10+
* Updated `Getting Started with DeviceMesh <https://pytorch.org/tutorials/recipes/distributed_device_mesh.html>`__
1111

1212
.. raw:: html
1313

intermediate_source/FSDP_adavnced_tutorial.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -502,7 +502,7 @@ layer class (holding MHSA and FFN).
502502
503503
504504
model = FSDP(model,
505-
fsdp_auto_wrap_policy=t5_auto_wrap_policy)
505+
auto_wrap_policy=t5_auto_wrap_policy)
506506
507507
To see the wrapped model, you can easily print the model and visually inspect
508508
the sharding and FSDP units as well.

intermediate_source/FSDP_tutorial.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ We add the following code snippets to a python script “FSDP_mnist.py”.
7070
1.2 Import necessary packages
7171

7272
.. note::
73-
This tutorial is intended for PyTorch versions 1.12 and later. If you are using an earlier version, replace all instances of `size_based_auto_wrap_policy` with `default_auto_wrap_policy`.
73+
This tutorial is intended for PyTorch versions 1.12 and later. If you are using an earlier version, replace all instances of `size_based_auto_wrap_policy` with `default_auto_wrap_policy` and `fsdp_auto_wrap_policy` with `auto_wrap_policy`.
7474

7575
.. code-block:: python
7676
@@ -308,7 +308,7 @@ We have recorded cuda events to measure the time of FSDP model specifics. The CU
308308
CUDA event elapsed time on training loop 40.67462890625sec
309309
310310
Wrapping the model with FSDP, the model will look as follows, we can see the model has been wrapped in one FSDP unit.
311-
Alternatively, we will look at adding the fsdp_auto_wrap_policy next and will discuss the differences.
311+
Alternatively, we will look at adding the auto_wrap_policy next and will discuss the differences.
312312

313313
.. code-block:: bash
314314
@@ -335,12 +335,12 @@ The following is the peak memory usage from FSDP MNIST training on g4dn.12.xlarg
335335

336336
FSDP Peak Memory Usage
337337

338-
Applying *fsdp_auto_wrap_policy* in FSDP otherwise, FSDP will put the entire model in one FSDP unit, which will reduce computation efficiency and memory efficiency.
338+
Applying *auto_wrap_policy* in FSDP otherwise, FSDP will put the entire model in one FSDP unit, which will reduce computation efficiency and memory efficiency.
339339
The way it works is that, suppose your model contains 100 Linear layers. If you do FSDP(model), there will only be one FSDP unit which wraps the entire model.
340340
In that case, the allgather would collect the full parameters for all 100 linear layers, and hence won't save CUDA memory for parameter sharding.
341341
Also, there is only one blocking allgather call for the all 100 linear layers, there will not be communication and computation overlapping between layers.
342342

343-
To avoid that, you can pass in an fsdp_auto_wrap_policy, which will seal the current FSDP unit and start a new one automatically when the specified condition is met (e.g., size limit).
343+
To avoid that, you can pass in an auto_wrap_policy, which will seal the current FSDP unit and start a new one automatically when the specified condition is met (e.g., size limit).
344344
In that way you will have multiple FSDP units, and only one FSDP unit needs to collect full parameters at a time. E.g., suppose you have 5 FSDP units, and each wraps 20 linear layers.
345345
Then, in the forward, the 1st FSDP unit will allgather parameters for the first 20 linear layers, do computation, discard the parameters and then move on to the next 20 linear layers. So, at any point in time, each rank only materializes parameters/grads for 20 linear layers instead of 100.
346346

@@ -358,9 +358,9 @@ Finding an optimal auto wrap policy is challenging, PyTorch will add auto tuning
358358
model = Net().to(rank)
359359
360360
model = FSDP(model,
361-
fsdp_auto_wrap_policy=my_auto_wrap_policy)
361+
auto_wrap_policy=my_auto_wrap_policy)
362362
363-
Applying the fsdp_auto_wrap_policy, the model would be as follows:
363+
Applying the auto_wrap_policy, the model would be as follows:
364364

365365
.. code-block:: bash
366366
@@ -411,7 +411,7 @@ In 2.4 we just add it to the FSDP wrapper
411411
.. code-block:: python
412412
413413
model = FSDP(model,
414-
fsdp_auto_wrap_policy=my_auto_wrap_policy,
414+
auto_wrap_policy=my_auto_wrap_policy,
415415
cpu_offload=CPUOffload(offload_params=True))
416416
417417

intermediate_source/TCPStore_libuv_backend.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ Introduction to Libuv TCPStore Backend
88
.. grid:: 2
99

1010
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
11-
:class-card: card-prerequisites
11+
:class-card: card-prerequisites
12+
1213
* What is the new TCPStore backend
1314
* Compare the new libuv backend against the legacy backend
1415
* How to enable to use the legacy backend

intermediate_source/pipelining_tutorial.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,15 @@ APIs.
1212
.. grid:: 2
1313

1414
.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
15+
:class-card: card-prerequisites
1516

1617
* How to use ``torch.distributed.pipelining`` APIs
1718
* How to apply pipeline parallelism to a transformer model
1819
* How to utilize different schedules on a set of microbatches
1920

2021

2122
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
23+
:class-card: card-prerequisites
2224

2325
* Familiarity with `basic distributed training <https://pytorch.org/tutorials/beginner/dist_overview.html>`__ in PyTorch
2426

0 commit comments

Comments
 (0)