-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Add torchmultimodal tutorial for flava finetuning #2054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-tutorials-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
|
||
###################################################################### | ||
# Installations | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# | |
# ----------------- | |
# |
# Installations | ||
# | ||
# We will use TextVQA dataset from HuggingFace for this | ||
# tutorial. So we install datasets in addition to TorchMultimodal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# tutorial. So we install datasets in addition to TorchMultimodal | |
# tutorial. We install datasets in addition to TorchMultimodal. |
# | ||
|
||
!wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz | ||
!tar xf vocab.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added this to the Makefile above I believe
|
||
# TODO: replace with install from pip when binary is ready | ||
!git clone https://github.com/facebookresearch/multimodal.git | ||
!pip install -r multimodal/requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should go to requirements.txt - I see only two items in that file - you can add them to the requirements.txt
sys.path.append(os.path.join(os.getcwd(),"multimodal")) | ||
sys.path.append(os.getcwd()) | ||
!pip install datasets | ||
!pip install transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this instead of lines 30 - 34:
# .. note::
#
# When running this tutorial in Google Colab, install the required packages by
# creating a new cell and running the following commands:
#
# .. code-block::
#
# !pip install torchmultimodal-nightly
# !pip install datasets
# !pip install transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we want it to be present in the notebook
!tar xf vocab.tar.gz | ||
|
||
|
||
with open("vocabs/answers_textvqa_more_than_1.txt") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should go to where you have downloaded your data - probably 'data/'
requirements.txt
Outdated
@@ -45,3 +48,7 @@ wget | |||
gym==0.24.0 | |||
gym-super-mario-bros==7.3.0 | |||
timm | |||
|
|||
# flava tutorial - multimodal | |||
packaging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need packaging
anymore. I removed it in this PR.
# which is a multimodal model for object detection and | ||
# `Omnivore <https://github.com/facebookresearch/multimodal/blob/main/torchmultimodal/models/omnivore.py>`__ | ||
# which is multitask model spanning image, video and 3d classification. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can add it in follow up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some feedback for more detail, and some suggested changes
for _ in range(epochs): | ||
for idx, batch in enumerate(train_dataloader): | ||
optimizer.zero_grad() | ||
out = model(text = batch["input_ids"], image = batch["image"], labels = batch["answers"], required_embedding="mm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the required_embedding
arg doing? It is not as obvious as the other params, maybe add a note in the plaintext above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, its not required
for _ in range(epochs): | ||
for idx, batch in enumerate(train_dataloader): | ||
optimizer.zero_grad() | ||
out = model(text = batch["input_ids"], image = batch["image"], labels = batch["answers"], required_embedding="mm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need retraining the encoders too, or just the head?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it finetunes the encoders as well
Co-authored-by: Nikita Shulga <nshulga@fb.com>
@ankitade what's the status on this? Can you resolve the merge conflict? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple nits
# end examples, aiming to enable and accelerate research in | ||
# multimodality**. | ||
# | ||
# In this tutorial, we will demonstrate how to use a **pretrained SoTA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we just say state-of-the-art here?
# TorchMultimodal library to finetune on a multimodal task i.e. visual | ||
# question answering** (VQA). The model consists of two unimodal transformer | ||
# based encoders for text and image and a multimodal encoder to combine | ||
# the two embeddings. It is pretrained using contrastive, image text matching and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the losses be enumerated in a different way here? I feel the comma placement makes this kinda confusing
requirements.txt
Outdated
@@ -27,6 +26,9 @@ pytorch-lightning | |||
torchx | |||
ax-platform | |||
nbformat>=4.2.0 | |||
datasets | |||
transformers | |||
torchmultimodal-nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be updated to use stable?
Adding first tutorial for torchmultimodal around how to finetune FLAVA for vqa.