Skip to content

Commit 29496ca

Browse files
authored
Update transformer_tutorial.py with description about the need for masking
1 parent 73ef5d0 commit 29496ca

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

beginner_source/transformer_tutorial.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,8 +41,10 @@
4141
# Along with the input sequence, a square attention mask is required because the
4242
# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend
4343
# the earlier positions in the sequence. For the language modeling task, any
44-
# tokens on the future positions should be masked. To produce a probability
45-
# distribution over output words, the output of the ``nn.TransformerEncoder``
44+
# tokens on the future positions should be masked. This masking, combined with fact that
45+
# the output embeddings are offset with later positions ensures that the
46+
# predictions for position i can depend only on the known outputs at positions less than i.
47+
# To produce a probability distribution over output words, the output of the ``nn.TransformerEncoder``
4648
# model is passed through a linear layer to output unnormalized logits.
4749
# The log-softmax function isn't applied here due to the later use of
4850
# `CrossEntropyLoss <https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html>`__,
@@ -92,6 +94,9 @@ def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
9294
src = self.embedding(src) * math.sqrt(self.d_model)
9395
src = self.pos_encoder(src)
9496
if src_mask is None:
97+
"""Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
98+
Unmasked positions are filled with float(0.0).
99+
"""
95100
src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
96101
output = self.transformer_encoder(src, src_mask)
97102
output = self.linear(output)

0 commit comments

Comments
 (0)