Skip to content

Commit a8dd4b0

Browse files
author
Svetlana Karslioglu
authored
Merge branch 'main' into leslie/add_quantization_2_doc
2 parents ceb9843 + 3b20fe6 commit a8dd4b0

File tree

1 file changed

+9
-2
lines changed

1 file changed

+9
-2
lines changed

beginner_source/transformer_tutorial.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,15 @@
3838
# of the word (see the next paragraph for more details). The
3939
# ``nn.TransformerEncoder`` consists of multiple layers of
4040
# `nn.TransformerEncoderLayer <https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html>`__.
41-
# To produce a probability distribution over output words, the output of
42-
# the ``nn.TransformerEncoder`` model is passed through a linear layer.
41+
# Along with the input sequence, a square attention mask is required because the
42+
# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
43+
# the earlier positions in the sequence. For the language modeling task, any
44+
# tokens on the future positions should be masked. To produce a probability
45+
# distribution over output words, the output of the ``nn.TransformerEncoder``
46+
# model is passed through a linear layer to output unnormalized logits.
47+
# The log-softmax function isn't applied here due to the later use of
48+
# `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__,
49+
# which requires the inputs to be unnormalized logits.
4350
#
4451

4552
import math

0 commit comments

Comments
 (0)