Skip to content

Commit 89f791c

Browse files
authored
Merge branch 'main' into issue2346
2 parents 0ee1636 + dc448c2 commit 89f791c

File tree

4 files changed

+20
-4
lines changed

4 files changed

+20
-4
lines changed
91 KB
Loading

beginner_source/transformer_tutorial.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929

3030
######################################################################
3131
# In this tutorial, we train a ``nn.TransformerEncoder`` model on a
32-
# language modeling task. Please note that this tutorial does not cover
32+
# causal language modeling task. Please note that this tutorial does not cover
3333
# the training of `nn.TransformerDecoder <https://pytorch.org/docs/stable/generated/torch.nn.TransformerDecoder.html#torch.nn.TransformerDecoder>`__, as depicted in
3434
# the right half of the diagram above. The language modeling task is to assign a
3535
# probability for the likelihood of a given word (or a sequence of words)
@@ -41,8 +41,10 @@
4141
# Along with the input sequence, a square attention mask is required because the
4242
# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
4343
# the earlier positions in the sequence. For the language modeling task, any
44-
# tokens on the future positions should be masked. To produce a probability
45-
# distribution over output words, the output of the ``nn.TransformerEncoder``
44+
# tokens on the future positions should be masked. This masking, combined with fact that
45+
# the output embeddings are offset with later positions ensures that the
46+
# predictions for position i can depend only on the known outputs at positions less than i.
47+
# To produce a probability distribution over output words, the output of the ``nn.TransformerEncoder``
4648
# model is passed through a linear layer to output unnormalized logits.
4749
# The log-softmax function isn't applied here due to the later use of
4850
# `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__,
@@ -91,6 +93,11 @@ def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
9193
"""
9294
src = self.embedding(src) * math.sqrt(self.d_model)
9395
src = self.pos_encoder(src)
96+
if src_mask is None:
97+
"""Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
98+
Unmasked positions are filled with float(0.0).
99+
"""
100+
src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
94101
output = self.transformer_encoder(src, src_mask)
95102
output = self.linear(output)
96103
return output

intermediate_source/FSDP_tutorial.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,15 @@ At a high level FSDP works as follow:
4646
* Run reduce_scatter to sync gradients
4747
* Discard parameters.
4848

49+
One way to view FSDP's sharding is to decompose the DDP gradient all-reduce into reduce-scatter and all-gather. Specifically, during the backward pass, FSDP reduces and scatters gradients, ensuring that each rank possesses a shard of the gradients. Then it updates the corresponding shard of the parameters in the optimizer step. Finally, in the subsequent forward pass, it performs an all-gather operation to collect and combine the updated parameter shards.
50+
51+
.. figure:: /_static/img/distributed/fsdp_sharding.png
52+
:width: 100%
53+
:align: center
54+
:alt: FSDP allreduce
55+
56+
FSDP Allreduce
57+
4958
How to use FSDP
5059
--------------
5160
Here we use a toy model to run training on the MNIST dataset for demonstration purposes. The APIs and logic can be applied to training larger models as well.

intermediate_source/tensorboard_profiler_tutorial.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ def train(data):
178178
#
179179

180180
######################################################################
181-
# Open the TensorBoard profile URL in Google Chrome browser or Microsoft Edge browser.
181+
# Open the TensorBoard profile URL in Google Chrome browser or Microsoft Edge browser (**Safari is not supported**).
182182
#
183183
# .. code-block::
184184
#

0 commit comments

Comments
 (0)