Skip to content

Adding model parallel tutorial #436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 16, 2019
Merged

Adding model parallel tutorial #436

merged 6 commits into from
Apr 16, 2019

Conversation

mrshenli
Copy link
Contributor

I feel model parallel itself already has enough meat to make a separate tutorial page. The DistributedDataParallel wrapping ModelParallel tutorial will come in my next PR. Let me know if I should merge them into one instead.

@netlify
Copy link

netlify bot commented Feb 26, 2019

Deploy preview for pytorch-tutorials-preview ready!

Built with commit 43ed1b9

https://deploy-preview-436--pytorch-tutorials-preview.netlify.com

@mrshenli
Copy link
Contributor Author

CC: @soumith @gchanan @dzhulgakov

@mrshenli
Copy link
Contributor Author

CI hits the following error:

W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://packagecloud.io trusty InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 4E6910DFCB68C9CD

W: Failed to fetch https://packagecloud.io/circleci/trusty/ubuntu/dists/trusty/InRelease  

Copy link

@Zhaoyi-Yan Zhaoyi-Yan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting.

class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10).cuda(0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding self.relu=torch.nn.ReLU() to be consistent with other toy linear models of tutorial.
Also using .to(device) instead of .cuda(0) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree with the to(device) -- it's easier to adapt it to non-cuda examples.

######################################################################
# Note that, the above ``ToyModel`` looks very similar to how one would
# implement it on a single GPU, except the four ``cuda(device)`` calls which
# place linear layers and tensors to on proper devices. That is the only

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to on --> on ?

some insights on how to speed up model parallel training.

The high-level idea of model parallel is to place different sub-networks of a
model onto different devices. All input data will run through all devices, but
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the "All input data..." sentence is trying to get across. Can you clarify?

It also doesn't seem true, all input data only runs through all devices if that's how the model is set up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also might be clearer to reverse the perspective here, i.e. the only part of the model operates on a device, not each devices only operates on a part of the model

class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = torch.nn.Linear(10, 10).cuda(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree with the to(device) -- it's easier to adapt it to non-cuda examples.

# place in the model that requires changes. The ``backward()`` and
# ``torch.optim`` will automatically take care of gradients as if the
# model is on one GPU. You only need to make sure that the labels are on the
# same device as the outputs when calling the loss function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to highlight the changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brianjo is there a way to highlight a line in the code? Thanks!

# :alt:
#
# The result shows that the execution time of model parallel implementation is
# ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. There
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might write something like, "so we can conclude there is roughly 7% overhead in copying tensors back and forth across the GPUs" (or similar).

@jspisak
Copy link
Contributor

jspisak commented Feb 28, 2019

cc @brianjo We have a new tutorial.. :)

@brianjo
Copy link
Contributor

brianjo commented Mar 27, 2019

@mrshenli Could you rebase so we can se how this rebuilds? Thanks!

@mrshenli
Copy link
Contributor Author

mrshenli commented Apr 9, 2019

Hi @brianjo, how is worker 16 different from others? It failed to import a local method in timeit setup statement:

Apr 01 20:14:00 Unexpected failing examples:
Apr 01 20:14:00 /var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py failed leaving traceback:
Apr 01 20:14:00 Traceback (most recent call last):
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 394, in _memory_usage
Apr 01 20:14:00     out = func()
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 382, in __call__
Apr 01 20:14:00     exec(self.code, self.globals)
Apr 01 20:14:00   File "/var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py", line 178, in <module>
Apr 01 20:14:00     mp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 238, in repeat
Apr 01 20:14:00     return Timer(stmt, setup, timer, globals).repeat(repeat, number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 206, in repeat
Apr 01 20:14:00     t = self.timeit(number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
Apr 01 20:14:00     timing = self.inner(it, self.timer)
Apr 01 20:14:00   File "<timeit-src>", line 3, in inner
Apr 01 20:14:00 ImportError: cannot import name 'train'

@brianjo
Copy link
Contributor

brianjo commented Apr 10, 2019

Hi @brianjo, how is worker 16 different from others? It failed to import a local method in timeit setup statement:

Not sure. I haven't been able to test this locally as I don't have a multi-gpu machine. Does the source file run for you without errors?

@mrshenli
Copy link
Contributor Author

@brianjo yes, the source file runs correctly for me. It seems worker #16 is the only that complains, and the other 19 workers didn't hit any error?

@brianjo
Copy link
Contributor

brianjo commented Apr 10, 2019

@brianjo yes, the source file runs correctly for me. It seems worker #16 is the only that complains, and the other 19 workers didn't hit any error?

The build is sharded so the tutorials are run on different machines to save time on the build job. Something isn't working in this specific tutorial I believe.

@mrshenli
Copy link
Contributor Author

The build is sharded so the tutorials are run on different machines to save time on the build job. Something isn't working in this specific tutorial I believe.

I see. Let me try not using string setup commands for timeit.

@yf225
Copy link
Contributor

yf225 commented Apr 11, 2019

Pasting CI error here:

Apr 01 20:14:00 Unexpected failing examples:
Apr 01 20:14:00 /var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py failed leaving traceback:
Apr 01 20:14:00 Traceback (most recent call last):
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 394, in _memory_usage
Apr 01 20:14:00     out = func()
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 382, in __call__
Apr 01 20:14:00     exec(self.code, self.globals)
Apr 01 20:14:00   File "/var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py", line 178, in <module>
Apr 01 20:14:00     mp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 238, in repeat
Apr 01 20:14:00     return Timer(stmt, setup, timer, globals).repeat(repeat, number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 206, in repeat
Apr 01 20:14:00     t = self.timeit(number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
Apr 01 20:14:00     timing = self.inner(it, self.timer)
Apr 01 20:14:00   File "<timeit-src>", line 3, in inner
Apr 01 20:14:00 ImportError: cannot import name 'train'

It might be that from __main__ import train doesn't work in the sphinx_gallery generation code, and we need to find another way to have the same intended functionality.

@brianjo
Copy link
Contributor

brianjo commented Apr 12, 2019

Hey all, if you want to test this locally, I have a harness project that will allow you to build tutorials with just a single file https://github.com/brianjo/pttutorialtest (Instructions are in the readme.) This should make it easier to debug this.

@mrshenli
Copy link
Contributor Author

Hi @brianjo, thanks for sharing the tool. Local build works for me, but worker_0 failed with the following error. The .py file takes around 3min to run in my local env, will that be too long for the CI env?

PyTorchDockerImageTag: 291
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:291
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:297: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown.
+ docker cp /home/circleci/project/. b9e0ad9ac27974a6a474476c391b8f87dd78db637963e4922d149cca451750e6:/var/lib/jenkins/workspace
+ export 'COMMAND=((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+ COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+ echo '((echo' '"source' './workspace/env"' '&&' echo '"sudo' chown -R jenkins workspace '&&' cd workspace '&&' './ci_build_script.sh")' '|' docker exec -u jenkins -i '"$id"' 'bash)' '2>&1'
+ unbuffer bash ./command.sh
+ ts
Apr 15 14:14:45 Error response from daemon: Container b9e0ad9ac27974a6a474476c391b8f87dd78db637963e4922d149cca451750e6 is not running
Exited with code 1

Local build outputs:

...
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 18 warnings.

The HTML pages are in _build/html.

Sphinx-gallery successfully executed 1 out of 1 file subselected by:

    gallery_conf["filename_pattern"] = 'tutorial.py'
    gallery_conf["ignore_pattern"]   = '__init__\\.py'

after excluding 0 files that had previously been run (based on MD5).

embedding documentation hyperlinks...
embedding documentation hyperlinks for beginner... [ 50%] model_parallel_tutorial.html                                                                                
embedding documentation hyperlinks for beginner... [100%] sg_execution_times.html

make[1]: Leaving directory `/data/users/shenli/pttutorialtest'
rm -rf docs
cp -r _build/html docs
touch docs/.nojekyll

@mrshenli
Copy link
Contributor Author

great, rebase worked! @brianjo @yf225 what is the approval and land process for tutorials?

@jspisak
Copy link
Contributor

jspisak commented Apr 16, 2019

Woohoo! @brianjo - are you happy with the copy on this? If so, let’s merge.

@brianjo brianjo merged commit f52c85d into pytorch:master Apr 16, 2019
rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request Nov 29, 2021
Adding model parallel tutorial
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants