From 550a28f0edcbcc0fc06759ee262dc849c468b239 Mon Sep 17 00:00:00 2001 From: frasertajima <69366820+frasertajima@users.noreply.github.com> Date: Wed, 31 May 2023 10:59:40 -0700 Subject: [PATCH] Update transformer_tutorial.py MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit fix to "perhaps there is a misprint at line 40 #2111"; review of referenced paper https://arxiv.org/pdf/1706.03762.pdf section 3.2.3 suggests: "Similarly, self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections. See Figure 2." Thus the suggested change in reference from nn.Transform.Encoder to nn.Transform.Decoder seems reasonable. --- beginner_source/transformer_tutorial.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/beginner_source/transformer_tutorial.py b/beginner_source/transformer_tutorial.py index d93b3d55fe7..57d1f8d8591 100644 --- a/beginner_source/transformer_tutorial.py +++ b/beginner_source/transformer_tutorial.py @@ -37,7 +37,7 @@ # ``nn.TransformerEncoder`` consists of multiple layers of # `nn.TransformerEncoderLayer `__. # Along with the input sequence, a square attention mask is required because the -# self-attention layers in ``nn.TransformerEncoder`` are only allowed to attend +# self-attention layers in ``nn.TransformerDecoder`` are only allowed to attend # the earlier positions in the sequence. For the language modeling task, any # tokens on the future positions should be masked. To produce a probability # distribution over output words, the output of the ``nn.TransformerEncoder``