Closed
Description
Following is the attention decoder network used in this tutorial.
But I find it different from the model on this paper, which is the improvement by the attention mechanism.
For example, on the first picture, the embedding vectors go through the attention layer, but in the second one, the embedding vectors (Wich I consider to be the Xs) start first from the bidirectionnal RNN network.
Same for dropout before embedded, I don't see any dropout on the paper.
cc @pytorch/team-text-core @Nayef211