[CI] Spawn docker container with 2Gb shmem #2475

malfet · 2023-06-16T20:48:27Z

Should prevent crashes during NCCL initialization.

If data_parallel_tutorial.py is executed without this option it would segfault in ncclShmOpen while executing nn.DataParallel(model)

For posterity:

% nvidia-smi 
Fri Jun 16 20:46:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   41C    P0    37W / 150W |    752MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   36C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    38W / 150W |    418MiB /  7680MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

% NCCL_DEBUG=INFO python data_parallel_tutorial.py 
Let's use 4 GPUs!
c825878acf65:32373:32373 [0] NCCL INFO cudaDriverVersion 12010
c825878acf65:32373:32373 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
c825878acf65:32373:32373 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
NCCL version 2.14.3+cuda11.7
c825878acf65:32373:32443 [0] NCCL INFO NET/IB : No device found.
c825878acf65:32373:32443 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
c825878acf65:32373:32443 [0] NCCL INFO Using network Socket
c825878acf65:32373:32445 [2] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Using network Socket
c825878acf65:32373:32444 [1] NCCL INFO Using network Socket
c825878acf65:32373:32446 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
c825878acf65:32373:32445 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
c825878acf65:32373:32443 [0] NCCL INFO Channel 00/02 :    0   1   2   3
c825878acf65:32373:32444 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
c825878acf65:32373:32443 [0] NCCL INFO Channel 01/02 :    0   1   2   3
c825878acf65:32373:32443 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
Bus error (core dumped)

(lldb) bt
* thread #1, name = 'python', stop reason = signal SIGBUS
  * frame #0: 0x00007effcd6b0ded libc.so.6`__memset_avx2_erms at memset-vec-unaligned-erms.S:145
    frame #1: 0x00007eff3985e425 libnccl.so.2`ncclShmOpen(char*, int, void**, void**, int) at shmutils.cc:52
    frame #2: 0x00007eff3985e377 libnccl.so.2`ncclShmOpen(shmPath="/dev/shm/nccl-7dX4mg", shmSize=9637888, shmPtr=0x00007efe4a59ac30, devShmPtr=0x00007efe4a59ac38, create=1) at shmutils.cc:61
    frame #3: 0x00007eff39863322 libnccl.so.2`::shmRecvSetup(comm=<unavailable>, graph=<unavailable>, myInfo=<unavailable>, peerInfo=<unavailable>, connectInfo=0x00007efe57fe3fe0, recv=0x00007efe4a05d2f0, channelId=0, connIndex=0) at shm.cc:110
    frame #4: 0x00007eff398446a4 libnccl.so.2`ncclTransportP2pSetup(ncclComm*, ncclTopoGraph*, int, int*) at transport.cc:33
    frame #5: 0x00007eff398445c0 libnccl.so.2`ncclTransportP2pSetup(comm=0x0000000062355ab0, graph=0x00007efe57fe6a40, connIndex=0, highestTransportType=0x0000000000000000) at transport.cc:89
    frame #6: 0x00007eff398367cd libnccl.so.2`::initTransportsRank(comm=0x0000000062355ab0, commId=<unavailable>) at init.cc:790
    frame #7: 0x00007eff398383fe libnccl.so.2`::ncclCommInitRankFunc(job_=<unavailable>) at init.cc:1089
    frame #8: 0x00007eff3984de07 libnccl.so.2`ncclAsyncJobMain(arg=0x000000006476e6d0) at group.cc:62
    frame #9: 0x00007effce0bf6db libpthread.so.0`start_thread + 219
    frame #10: 0x00007effcd64361f libc.so.6`__GI___clone at clone.S:95

Should prevent crashes during NCCL initialization. If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing ` nn.DataParallel(model)`

pytorch-bot · 2023-06-16T20:48:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2475

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0ba3e39:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

netlify · 2023-06-16T20:52:46Z

✅ Deploy Preview for pytorch-tutorials-preview ready!

Name	Link
🔨 Latest commit	`0ba3e39`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/648cca9d5e877a00089bf9de
😎 Deploy Preview	https://deploy-preview-2475--pytorch-tutorials-preview.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

[CI] Spawn docker container with 2Gb shmem

0ba3e39

Should prevent crashes during NCCL initialization. If `data_parallel_tutorial.py` is executed without this option it would segfault in `ncclShmOpen` while executing ` nn.DataParallel(model)`

malfet merged commit 3eef691 into main Jun 16, 2023

malfet deleted the malfet-patch-1 branch June 16, 2023 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Spawn docker container with 2Gb shmem #2475

[CI] Spawn docker container with 2Gb shmem #2475

Uh oh!

malfet commented Jun 16, 2023

Uh oh!

pytorch-bot bot commented Jun 16, 2023 •

edited

Loading

Uh oh!

netlify bot commented Jun 16, 2023

Uh oh!

Uh oh!

[CI] Spawn docker container with 2Gb shmem #2475

[CI] Spawn docker container with 2Gb shmem #2475

Uh oh!

Conversation

malfet commented Jun 16, 2023

Uh oh!

pytorch-bot bot commented Jun 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/2475

✅ No Failures

Uh oh!

netlify bot commented Jun 16, 2023

✅ Deploy Preview for pytorch-tutorials-preview ready!

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 16, 2023 •

edited

Loading