Adding model parallel tutorial #436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

brianjo merged 6 commits into pytorch:master from mrshenli:mp

Apr 16, 2019

Contributor

mrshenli commented Feb 26, 2019

I feel model parallel itself already has enough meat to make a separate tutorial page. The DistributedDataParallel wrapping ModelParallel tutorial will come in my next PR. Let me know if I should merge them into one instead.


          adding model parallel tutorial

f68ecd6

netlify bot commented Feb 26, 2019 •

edited

Loading

Deploy preview for pytorch-tutorials-preview ready!

Built with commit 43ed1b9

https://deploy-preview-436--pytorch-tutorials-preview.netlify.com

Contributor Author

mrshenli commented Feb 26, 2019

CC: @soumith @gchanan @dzhulgakov

Contributor Author

mrshenli commented Feb 26, 2019

CI hits the following error:

W: An error occurred during the signature verification. The repository is not updated and the previous index files will be used. GPG error: https://packagecloud.io trusty InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 4E6910DFCB68C9CD

W: Failed to fetch https://packagecloud.io/circleci/trusty/ubuntu/dists/trusty/InRelease

Zhaoyi-Yan reviewed

View reviewed changes

Zhaoyi-Yan left a comment

It is interesting.

intermediate_source/model_parallel_tutorial.py Outdated

+              class ToyModel(nn.Module):
+                  def __init__(self):
+                      super(ToyModel, self).__init__()
+                      self.net1 = torch.nn.Linear(10, 10).cuda(0)

Zhaoyi-Yan Feb 27, 2019

I think adding self.relu=torch.nn.ReLU() to be consistent with other toy linear models of tutorial.
Also using .to(device) instead of .cuda(0) ?

Contributor

gchanan Feb 27, 2019

Yep, I agree with the to(device) -- it's easier to adapt it to non-cuda examples.

intermediate_source/model_parallel_tutorial.py Outdated

+              ######################################################################
+              # Note that, the above ``ToyModel`` looks very similar to how one would
+              # implement it on a single GPU, except the four ``cuda(device)`` calls which
+              # place linear layers and tensors to on proper devices. That is the only

Zhaoyi-Yan Feb 27, 2019

to on --> on ?

gchanan reviewed

View reviewed changes

intermediate_source/model_parallel_tutorial.py Outdated

+              some insights on how to speed up model parallel training.
+              The high-level idea of model parallel is to place different sub-networks of a
+              model onto different devices. All input data will run through all devices, but

Contributor

gchanan Feb 26, 2019

I'm not sure what the "All input data..." sentence is trying to get across. Can you clarify?

It also doesn't seem true, all input data only runs through all devices if that's how the model is set up.

Contributor

gchanan Feb 27, 2019

It also might be clearer to reverse the perspective here, i.e. the only part of the model operates on a device, not each devices only operates on a part of the model

intermediate_source/model_parallel_tutorial.py Outdated

+              class ToyModel(nn.Module):
+                  def __init__(self):
+                      super(ToyModel, self).__init__()
+                      self.net1 = torch.nn.Linear(10, 10).cuda(0)

Contributor

gchanan Feb 27, 2019

Yep, I agree with the to(device) -- it's easier to adapt it to non-cuda examples.

intermediate_source/model_parallel_tutorial.py Outdated

+              # place in the model that requires changes. The ``backward()`` and
+              # ``torch.optim`` will automatically take care of gradients as if the
+              # model is on one GPU. You only need to make sure that the labels are on the
+              # same device as the outputs when calling the loss function.

Contributor

gchanan Feb 27, 2019

is it possible to highlight the changes?

Contributor Author

mrshenli Apr 1, 2019

@brianjo is there a way to highlight a line in the code? Thanks!

intermediate_source/model_parallel_tutorial.py Outdated

+              #    :alt:
+              #
+              # The result shows that the execution time of model parallel implementation is
+              # ``4.02/3.75-1=7%`` longer than the existing single-GPU implementation. There

Contributor

gchanan Feb 27, 2019

you might write something like, "so we can conclude there is roughly 7% overhead in copying tensors back and forth across the GPUs" (or similar).

Contributor

jspisak commented Feb 28, 2019

cc @brianjo We have a new tutorial.. :)

Contributor

brianjo commented Mar 27, 2019

@mrshenli Could you rebase so we can se how this rebuilds? Thanks!

mrshenli added 2 commits

April 1, 2019 11:19


          Merge remote-tracking branch 'upstream/master' into mp

da23386


          address comments

83c527f

Contributor Author

mrshenli commented Apr 9, 2019

Hi @brianjo, how is worker 16 different from others? It failed to import a local method in timeit setup statement:

Apr 01 20:14:00 Unexpected failing examples:
Apr 01 20:14:00 /var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py failed leaving traceback:
Apr 01 20:14:00 Traceback (most recent call last):
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 394, in _memory_usage
Apr 01 20:14:00     out = func()
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 382, in __call__
Apr 01 20:14:00     exec(self.code, self.globals)
Apr 01 20:14:00   File "/var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py", line 178, in <module>
Apr 01 20:14:00     mp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 238, in repeat
Apr 01 20:14:00     return Timer(stmt, setup, timer, globals).repeat(repeat, number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 206, in repeat
Apr 01 20:14:00     t = self.timeit(number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
Apr 01 20:14:00     timing = self.inner(it, self.timer)
Apr 01 20:14:00   File "<timeit-src>", line 3, in inner
Apr 01 20:14:00 ImportError: cannot import name 'train'

Contributor

brianjo commented Apr 10, 2019

Hi @brianjo, how is worker 16 different from others? It failed to import a local method in timeit setup statement:

Not sure. I haven't been able to test this locally as I don't have a multi-gpu machine. Does the source file run for you without errors?

Contributor Author

mrshenli commented Apr 10, 2019

@brianjo yes, the source file runs correctly for me. It seems worker #16 is the only that complains, and the other 19 workers didn't hit any error?

Contributor

brianjo commented Apr 10, 2019

@brianjo yes, the source file runs correctly for me. It seems worker #16 is the only that complains, and the other 19 workers didn't hit any error?

The build is sharded so the tutorials are run on different machines to save time on the build job. Something isn't working in this specific tutorial I believe.

Contributor Author

mrshenli commented Apr 10, 2019

The build is sharded so the tutorials are run on different machines to save time on the build job. Something isn't working in this specific tutorial I believe.

I see. Let me try not using string setup commands for timeit.

Contributor

yf225 commented Apr 11, 2019

Pasting CI error here:

Apr 01 20:14:00 Unexpected failing examples:
Apr 01 20:14:00 /var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py failed leaving traceback:
Apr 01 20:14:00 Traceback (most recent call last):
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 394, in _memory_usage
Apr 01 20:14:00     out = func()
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/site-packages/sphinx_gallery/gen_rst.py", line 382, in __call__
Apr 01 20:14:00     exec(self.code, self.globals)
Apr 01 20:14:00   File "/var/lib/jenkins/workspace/intermediate_source/model_parallel_tutorial.py", line 178, in <module>
Apr 01 20:14:00     mp_run_times = timeit.repeat(stmt, setup, number=1, repeat=num_repeat)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 238, in repeat
Apr 01 20:14:00     return Timer(stmt, setup, timer, globals).repeat(repeat, number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 206, in repeat
Apr 01 20:14:00     t = self.timeit(number)
Apr 01 20:14:00   File "/opt/conda/lib/python3.6/timeit.py", line 178, in timeit
Apr 01 20:14:00     timing = self.inner(it, self.timer)
Apr 01 20:14:00   File "<timeit-src>", line 3, in inner
Apr 01 20:14:00 ImportError: cannot import name 'train'

It might be that from __main__ import train doesn't work in the sphinx_gallery generation code, and we need to find another way to have the same intended functionality.

Contributor

brianjo commented Apr 12, 2019

Hey all, if you want to test this locally, I have a harness project that will allow you to build tutorials with just a single file https://github.com/brianjo/pttutorialtest (Instructions are in the readme.) This should make it easier to debug this.

mrshenli added 2 commits

April 14, 2019 20:54


          fix timeit sphinx conflicts

dbf49e5


          Merge remote-tracking branch 'upstream/master' into mp

f011eae

Contributor Author

mrshenli commented Apr 15, 2019

Hi @brianjo, thanks for sharing the tool. Local build works for me, but worker_0 failed with the following error. The .py file takes around 3min to run in my local env, will that be too long for the CI env?

PyTorchDockerImageTag: 291
DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda9-cudnn7-py3:291
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:297: copying bootstrap data to pipe caused \"write init-p: broken pipe\"": unknown.
+ docker cp /home/circleci/project/. b9e0ad9ac27974a6a474476c391b8f87dd78db637963e4922d149cca451750e6:/var/lib/jenkins/workspace
+ export 'COMMAND=((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+ COMMAND='((echo "source ./workspace/env" && echo "sudo chown -R jenkins workspace && cd workspace && ./ci_build_script.sh") | docker exec -u jenkins -i "$id" bash) 2>&1'
+ echo '((echo' '"source' './workspace/env"' '&&' echo '"sudo' chown -R jenkins workspace '&&' cd workspace '&&' './ci_build_script.sh")' '|' docker exec -u jenkins -i '"$id"' 'bash)' '2>&1'
+ unbuffer bash ./command.sh
+ ts
Apr 15 14:14:45 Error response from daemon: Container b9e0ad9ac27974a6a474476c391b8f87dd78db637963e4922d149cca451750e6 is not running
Exited with code 1

Local build outputs:

...
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 18 warnings.

The HTML pages are in _build/html.

Sphinx-gallery successfully executed 1 out of 1 file subselected by:

    gallery_conf["filename_pattern"] = 'tutorial.py'
    gallery_conf["ignore_pattern"]   = '__init__\\.py'

after excluding 0 files that had previously been run (based on MD5).

embedding documentation hyperlinks...
embedding documentation hyperlinks for beginner... [ 50%] model_parallel_tutorial.html                                                                                
embedding documentation hyperlinks for beginner... [100%] sg_execution_times.html

make[1]: Leaving directory `/data/users/shenli/pttutorialtest'
rm -rf docs
cp -r _build/html docs
touch docs/.nojekyll


          Merge remote-tracking branch 'upstream/master' into mp

43ed1b9

Contributor Author

mrshenli commented Apr 16, 2019

great, rebase worked! @brianjo @yf225 what is the approval and land process for tutorials?

Contributor

jspisak commented Apr 16, 2019

Woohoo! @brianjo - are you happy with the copy on this? If so, let’s merge.

brianjo merged commit f52c85d into pytorch:master

rodrigo-techera pushed a commit to Experience-Monks/tutorials that referenced this pull request


          Merge pull request pytorch#436 from mrshenli/mp

8e76191

Adding model parallel tutorial

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet