-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Adding model parallel tutorial #436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Deploy preview for pytorch-tutorials-preview ready! Built with commit 43ed1b9 https://deploy-preview-436--pytorch-tutorials-preview.netlify.com |
CI hits the following error:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interesting.
class ToyModel(nn.Module): | ||
def __init__(self): | ||
super(ToyModel, self).__init__() | ||
self.net1 = torch.nn.Linear(10, 10).cuda(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding self.relu=torch.nn.ReLU()
to be consistent with other toy linear models of tutorial.
Also using .to(device)
instead of .cuda(0)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree with the to(device)
-- it's easier to adapt it to non-cuda examples.
###################################################################### | ||
# Note that, the above ``ToyModel`` looks very similar to how one would | ||
# implement it on a single GPU, except the four ``cuda(device)`` calls which | ||
# place linear layers and tensors to on proper devices. That is the only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to on
--> on
?
some insights on how to speed up model parallel training. | ||
|
||
The high-level idea of model parallel is to place different sub-networks of a | ||
model onto different devices. All input data will run through all devices, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the "All input data..." sentence is trying to get across. Can you clarify?
It also doesn't seem true, all input data only runs through all devices if that's how the model is set up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also might be clearer to reverse the perspective here, i.e. the only part of the model operates on a device, not each devices only operates on a part of the model
class ToyModel(nn.Module): | ||
def __init__(self): | ||
super(ToyModel, self).__init__() | ||
self.net1 = torch.nn.Linear(10, 10).cuda(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree with the to(device)
-- it's easier to adapt it to non-cuda examples.
# place in the model that requires changes. The ``backward()`` and | ||
# ``torch.optim`` will automatically take care of gradients as if the | ||
# model is on one GPU. You only need to make sure that the labels are on the | ||
# same device as the outputs when calling the loss function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to highlight the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brianjo is there a way to highlight a line in the code? Thanks!
# :alt: | ||
# | ||
# The result shows that the execution time of model parallel implementation is | ||
# ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. There |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might write something like, "so we can conclude there is roughly 7% overhead in copying tensors back and forth across the GPUs" (or similar).
cc @brianjo We have a new tutorial.. :) |
@mrshenli Could you rebase so we can se how this rebuilds? Thanks! |
Hi @brianjo, how is worker 16 different from others? It failed to import a local method in timeit setup statement:
|
Not sure. I haven't been able to test this locally as I don't have a multi-gpu machine. Does the source file run for you without errors? |
The build is sharded so the tutorials are run on different machines to save time on the build job. Something isn't working in this specific tutorial I believe. |
I see. Let me try not using string setup commands for timeit. |
Pasting CI error here:
It might be that |
Hey all, if you want to test this locally, I have a harness project that will allow you to build tutorials with just a single file https://github.com/brianjo/pttutorialtest (Instructions are in the readme.) This should make it easier to debug this. |
Hi @brianjo, thanks for sharing the tool. Local build works for me, but worker_0 failed with the following error. The
Local build outputs:
|
Woohoo! @brianjo - are you happy with the copy on this? If so, let’s merge. |
Adding model parallel tutorial
I feel model parallel itself already has enough meat to make a separate tutorial page. The
DistributedDataParallel
wrappingModelParallel
tutorial will come in my next PR. Let me know if I should merge them into one instead.