-
Notifications
You must be signed in to change notification settings - Fork 4.2k
A torchtext tutorial to pre-process a non-built-in dataset #2307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-tutorials-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
Hi @pytorch/team-text-core, @Nayef211, I request you to kindly review and approve the pull request. If you have any feedback or suggestions, I would be grateful to hear them. Best Regards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anp-scp thanks so much for contributing this detailed tutorial. I've left some nits around variable naming and sentence restructuring. Once these are addressed, I'm happy to accept and merge this tutorial.
|
||
Let us assume that we need to prepare a dataset to train a model that can perform English to | ||
German translation. We will use a tab-delimited German - English sentence pairs provided by | ||
the `Tatoeba Project <https://tatoeba.org/en>`_ which can be downloaded from this link: `Click |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just add the download link to the "this link" text? "Click Here" comes off a bit phishy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
the `Tatoeba Project <https://tatoeba.org/en>`_ which can be downloaded from this link: `Click | ||
Here <https://www.manythings.org/anki/deu-eng.zip>`__. | ||
|
||
Sentence pairs for other languages can be found in this link: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same comment as above. Let's hyperlink the "this link" text and get rid of the line below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# * `Torchdata 0.6.0 <https://pytorch.org/data/beta/index.html>`_ (Installation instructions: `C\ | ||
# lick here <https://github.com/pytorch/data>`__) | ||
# * `Torchtext 0.15.0 <https://pytorch.org/text/stable/index.html>`_ (Installation instructions:\ | ||
# `Click here <https://github.com/pytorch/text>`__) | ||
# * Spacy (Docs: `Click here <https://spacy.io/usage>`__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: same as above. Get rid of "Click here" and hyperlink the text directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
dataPipe = dp.iter.IterableWrapper([FILE_PATH]) | ||
dataPipe = dp.iter.FileOpener(dataPipe, mode='rb') | ||
dataPipe = dataPipe.parse_csv(skip_lines=0, delimiter='\t', as_tuple=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use snake case for variable names to be consistent with other tutorials. Change datePipe
to data_pipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# Data pipes can be thought of something like a dataset object, on which | ||
# we can perform various operations. | ||
# Check `this tutorial <https://pytorch.org/data/beta/dp_tutorial.html>`_ for more details on | ||
# data pipes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to DataPipes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested. Also I have added the word "DataPipe" and "DataPipes" in en-wordlist.txt
.
# We will build vocabulary for both our source and target now. | ||
# | ||
# Let us define a function to get tokens from elements of tuples in the iterator. | ||
# The comments within the function specifies the need and working of it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence doesn't add too much value. Let's remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# which we will use on our sentence. Let us take a random sentence and check the working of | ||
# the transform: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rephrase to "and check how the transform works."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# * At line 2, we take a source sentence from list that we created from dataPipe at line 1 | ||
# * At line 5, we get a transform based on a source vocabulary and apply it to a tokenized | ||
# sentence. Note that transforms take list of words and not a sentence. | ||
# * At line 8, we get the mapping of index to string and then use it get the transformed | ||
# sentence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove 2 spaces in front of these bullets so the indentation is consistent with the rest of the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# * At line 8, we get the mapping of index to string and then use it get the transformed | ||
# sentence | ||
# | ||
# Now we will use functions of `dataPipe` to apply transform to all our sentences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrase to "DataPipe functions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
# Some parts of this tutorial was inspired from this article: | ||
# Link: `https://medium.com/@bitdribble/migrate-torchtext-to-the-new-0-9-0-api-1ff1472b5d71\ | ||
# <https://medium.com/@bitdribble/migrate-torchtext-to-the-new-0-9-0-api-1ff1472b5d71>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: directly link the hyperlink to "this article".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made changes as suggested.
beginner_source/torchtext_custom_dataset_tutorial.py
Hi @Nayef211, I have updated the tutorial as per the suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks for updating the tutorial with the suggestions. Will merge once all CI jobs complete! 😄
This tutorial illustrates the usage of torchtext (0.15.0) on a dataset that is not a built-in dataset in torchtext.
This tutorial shows how to:
cc @pytorch/team-text-core @Nayef211