-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Remove legacy torchtext code from translation tutorial #1250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploy preview for pytorch-tutorials-preview ready! Built with commit 087fe98 https://deploy-preview-1250--pytorch-tutorials-preview.netlify.app |
data_.append((de_tensor_, en_tensor_)) | ||
return data_ | ||
|
||
train_data = data_process(iter(io.open(train_filepaths[0])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think merging this into the data_process function and then only passing _filepaths cleans this up a bit
de_vocab = build_vocab(train_filepaths[0], de_tokenizer) | ||
en_vocab = build_vocab(train_filepaths[1], en_tokenizer) | ||
|
||
def data_process(raw_de_iter, raw_en_iter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this function take? How long does the entire prepreocessing step take? Should we tokenizer async while training instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offline discussion: we will have a follow-up PR since this PR doesn't focus on the dataloader.
en_vocab = build_vocab(train_filepaths[1], en_tokenizer) | ||
|
||
def data_process(raw_de_iter, raw_en_iter): | ||
data_ = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the trailing underscores
cc @brianjo |
data from a well-known dataset containing sentences in both English and German and use it to | ||
train a sequence-to-sequence model with attention that can translate German sentences | ||
into English. | ||
|
||
It is based off of | ||
`this tutorial <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__ | ||
from PyTorch community member `Ben Trevett <https://github.com/bentrevett>`__ | ||
and was created by `Seth Weidman <https://github.com/SethHWeidman/>`__ with Ben's permission. | ||
with Ben's permission. We update the tutorials by removing some legecy code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: legacy instead of legecy
0d504ea
to
8c817f1
Compare
a3b3e27
to
2f43983
Compare
* checkpoint * checkpoint * minor changes with review's feedback * fix typo * Fix ascii decode error Co-authored-by: Guanheng Zhang <zhangguanheng@devfair0197.h2.fair> Co-authored-by: Brian Johnson <brianjo@fb.com>
No description provided.