Skip to content

Remove legacy torchtext code from translation tutorial #1250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 3, 2020

Conversation

zhangguanheng66
Copy link
Contributor

No description provided.

@netlify
Copy link

netlify bot commented Nov 18, 2020

Deploy preview for pytorch-tutorials-preview ready!

Built with commit 087fe98

https://deploy-preview-1250--pytorch-tutorials-preview.netlify.app

data_.append((de_tensor_, en_tensor_))
return data_

train_data = data_process(iter(io.open(train_filepaths[0])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think merging this into the data_process function and then only passing _filepaths cleans this up a bit

de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

def data_process(raw_de_iter, raw_en_iter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this function take? How long does the entire prepreocessing step take? Should we tokenizer async while training instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offline discussion: we will have a follow-up PR since this PR doesn't focus on the dataloader.

en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

def data_process(raw_de_iter, raw_en_iter):
data_ = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove the trailing underscores

@zhangguanheng66 zhangguanheng66 changed the title [WIP] Remove legacy torchtext code from translation tutorial Remove legacy torchtext code from translation tutorial Nov 29, 2020
@zhangguanheng66
Copy link
Contributor Author

cc @brianjo

data from a well-known dataset containing sentences in both English and German and use it to
train a sequence-to-sequence model with attention that can translate German sentences
into English.

It is based off of
`this tutorial <https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb>`__
from PyTorch community member `Ben Trevett <https://github.com/bentrevett>`__
and was created by `Seth Weidman <https://github.com/SethHWeidman/>`__ with Ben's permission.
with Ben's permission. We update the tutorials by removing some legecy code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: legacy instead of legecy

@zhangguanheng66 zhangguanheng66 force-pushed the translation_tutorial branch 2 times, most recently from 0d504ea to 8c817f1 Compare December 2, 2020 14:58
@brianjo brianjo merged commit 133e5b6 into pytorch:master Dec 3, 2020
rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021
* checkpoint

* checkpoint

* minor changes with review's feedback

* fix typo

* Fix ascii decode error

Co-authored-by: Guanheng Zhang <zhangguanheng@devfair0197.h2.fair>
Co-authored-by: Brian Johnson <brianjo@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants