Skip to content

[CI] Spawn docker container with 2Gb shmem #2475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2023
Merged

[CI] Spawn docker container with 2Gb shmem #2475

merged 1 commit into from
Jun 16, 2023

Conversation

malfet
Copy link
Contributor

@malfet malfet commented Jun 16, 2023

Should prevent crashes during NCCL initialization.

If data_parallel_tutorial.py is executed without this option it would segfault in ncclShmOpen while executing nn.DataParallel(model)

For posterity:

% nvidia-smi 
Fri Jun 16 20:46:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    37W / 150W |    752MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

% NCCL_DEBUG=INFO python data_parallel_tutorial.py 
Let's use 4 GPUs!
c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010
c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.14.3+cuda11.7
c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found.
c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
c825878acf65:32373:32443 [0] NCCL INFO Using network Socket
c825878acf65:32373:32445 [2] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Using network Socket
c825878acf65:32373:32444 [1] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 :    0   1   2   3
c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 :    0   1   2   3
c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
Bus error (core dumped)

(lldb) bt
* thread #1, name = 'python', stop reason = signal SIGBUS
  * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52
    frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61
    frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110
    frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33
    frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89
    frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790
    frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089
    frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62
    frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219
    frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95

Should prevent crashes during NCCL initialization.

If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing ` nn.DataParallel(model)`
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 16, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2475

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0ba3e39:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@netlify
Copy link

netlify bot commented Jun 16, 2023

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit 0ba3e39
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/648cca9d5e877a00089bf9de
😎 Deploy Preview https://deploy-preview-2475--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@malfet malfet merged commit 3eef691 into main Jun 16, 2023
@malfet malfet deleted the malfet-patch-1 branch June 16, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant