Closed
Description
🐛 Bug
Cannot run DDP with 4 gpus. Get error:
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 277, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 187, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 3 terminated with the following error:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 210, in train
tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/miranda9/miniconda3/envs/automl-meta-learning/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
To Reproduce
To reproduce train any DDP model with 4 gpus of type ('gpu_name', 'GeForce GTX TITAN X')
Output from nvidia-smi
$ nvidia-smi
Tue Mar 23 17:13:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 00000000:04:00.0 Off | N/A |
| 22% 25C P8 14W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 00000000:05:00.0 Off | N/A |
| 22% 36C P2 69W / 250W | 5602MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX TIT... Off | 00000000:08:00.0 Off | N/A |
| 22% 54C P2 76W / 250W | 1752MiB / 12212MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX TIT... Off | 00000000:09:00.0 Off | N/A |
| 22% 29C P8 15W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 GeForce GTX TIT... Off | 00000000:84:00.0 Off | N/A |
| 22% 24C P8 14W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 GeForce GTX TIT... Off | 00000000:85:00.0 Off | N/A |
| 22% 24C P8 14W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 GeForce GTX TIT... Off | 00000000:88:00.0 Off | N/A |
| 22% 23C P8 15W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 GeForce GTX TIT... Off | 00000000:89:00.0 Off | N/A |
| 22% 24C P8 16W / 250W | 0MiB / 12212MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 1 N/A N/A 16689 C ...3/envs/repaint/bin/python 1610MiB |
| 1 N/A N/A 28647 C ...3/envs/repaint/bin/python 3988MiB |
| 2 N/A N/A 23374 C python 1749MiB |
+-----------------------------------------------------------------------------+
Expected behavior
Environment
$ python collect_env.py
Collecting environment information...
PyTorch version: 1.8.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: CentOS Linux 7 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
Clang version: 3.4.2 (tags/RELEASE_34/dot2-final)
CMake version: version 2.8.12.2
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
GPU 2: GeForce GTX TITAN X
GPU 3: GeForce GTX TITAN X
GPU 4: GeForce GTX TITAN X
GPU 5: GeForce GTX TITAN X
GPU 6: GeForce GTX TITAN X
GPU 7: GeForce GTX TITAN X
Nvidia driver version: 450.36.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.0
[pip3] torchaudio==0.8.0a0+a751e1d
[pip3] torchmeta==1.6.1
[pip3] torchtext==0.9.0
[pip3] torchvision==0.9.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.1 h6406543_8 conda-forge
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.2.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.19.2 py38h54aff64_0
[conda] numpy-base 1.19.2 py38hfa32c7d_0
[conda] pytorch 1.8.0 py3.8_cuda11.1_cudnn8.0.5_0 pytorch
[conda] torchaudio 0.8.0 py38 pytorch
[conda] torchmeta 1.6.1 pypi_0 pypi
[conda] torchtext 0.9.0 py38 pytorch
[conda] torchvision 0.9.0 py38_cu111 pytorch
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu