Auto3DSeg fails with 5 fold but not 2 on multi-GPU setup #1884
Unanswered
ashleymicuda
asked this question in
Q&A
Replies: 1 comment 1 reply
-
Hi @ashleymicuda, Thanks for sharing this. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been trying to run the https://github.com/Project-MONAI/tutorials/blob/main/auto3dseg/notebooks/auto_runner.ipynb tutorial on a dual GPU setup with my own data (~400 GB). The autorunner works and trains when I have my dataset split into 2 fold, however when I increase to 5 fold I get the following error:
Notably, the code runs for a few hours then crashes. It seems that Auto3DSeg tries to load in all the data before training starts, however I can't figure out why it would run with 2 fold and not 5. I can run 5 fold on one GPU which suggests it's a multiprocessing error (as the screenshot also suggests). Any ideas of how to run the 5-fold on multi-GPUs or if there's something in the default settings I need to change?
Update: Below is the full error code
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779]
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] *****************************************
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] *****************************************
[rank1]:[E110 18:13:16.313613784 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
[rank1]:[E110 18:13:16.327514822 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E110 18:13:16.356363179 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
[rank0]:[E110 18:13:16.356536197 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E110 18:13:16.006191477 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E110 18:13:16.006209361 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E110 18:13:16.006213178 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E110 18:13:16.018073828 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home//.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x77e2cf3c88f2 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77e2cf3cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x77e2cf3d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x77e2cf3c88f2 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77e2cf3cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x77e2cf3d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x77e2cf05aa84 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)
[rank1]:[E110 18:13:16.022142650 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E110 18:13:16.022159412 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E110 18:13:16.022163139 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E110 18:13:16.022933349 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x76fc3b5c88f2 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76fc3b5cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x76fc3b5d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x76fc3b5c88f2 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76fc3b5cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x76fc3b5d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home//.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x76fc3b25aa84 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)
W0110 18:13:16.992000 126811119820800 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 45121 closing signal SIGTERM
E0110 18:13:17.008000 126811119820800 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 45120) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/home/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/model/dints_0/scripts/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-01-10_18:13:16
host :
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 45120)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 45120
CalledProcessError Traceback (most recent call last)
Cell In[7], line 1
----> 1 runner.run()
File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/auto_runner.py:878, in AutoRunner.run(self)
876 if len(history) > 0:
877 if not self.hpo:
--> 878 self._train_algo_in_sequence(history)
879 else:
880 self._train_algo_in_nni(history)
File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/auto_runner.py:728, in AutoRunner._train_algo_in_sequence(self, history)
726 algo = algo_dict[AlgoKeys.ALGO]
727 if has_option(algo.train, "device_setting"):
--> 728 algo.train(self.train_params, self.device_setting)
729 else:
730 algo.train(self.train_params)
File ~/Desktop/model/algorithm_templates/dints/scripts/algo.py:496, in DintsAlgo.train(self, train_params, device_setting, search)
494 cmd, devices_info = self._create_cmd(dints_train_params)
495 cmd = "OMP_NUM_THREADS=1 " + cmd
--> 496 return self._run_cmd(cmd, devices_info)
File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/bundle_gen.py:273, in BundleAlgo._run_cmd(self, cmd, devices_info)
271 return _run_cmd_bcprun(cmd, n=self.device_setting["NUM_NODES"], p=self.device_setting["n_devices"])
272 elif int(self.device_setting["n_devices"]) > 1:
--> 273 return _run_cmd_torchrun(
274 cmd, nnodes=1, nproc_per_node=self.device_setting["n_devices"], env=ps_environ, check=True
275 )
276 else:
277 return run_cmd(cmd.split(), run_cmd_verbose=True, env=ps_environ, check=True)
File ~/.local/lib/python3.10/site-packages/monai/auto3dseg/utils.py:502, in _run_cmd_torchrun(cmd, **kwargs)
500 torchrun_list += [f"--{arg}", str(params.pop(arg))]
501 torchrun_list += cmd_list
--> 502 return run_cmd(torchrun_list, run_cmd_verbose=True, **params)
File ~/.local/lib/python3.10/site-packages/monai/utils/misc.py:892, in run_cmd(cmd_list, **kwargs)
890 monai.apps.utils.get_logger("run_cmd").info(f"{cmd_list}") # type: ignore[attr-defined]
891 try:
--> 892 return subprocess.run(cmd_list, **kwargs)
893 except subprocess.CalledProcessError as e:
894 if not debug:
File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
524 retcode = process.poll()
525 if check and retcode:
--> 526 raise CalledProcessError(retcode, process.args,
527 output=stdout, stderr=stderr)
528 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/home/model/dints_0/scripts/train.py', 'run', "--config_file='/home/model/dints_0/configs/hyper_parameters.yaml,/home/model/dints_0/configs/hyper_parameters_search.yaml,/home/model/dints_0/configs/network.yaml,/home/model/dints_0/configs/network_search.yaml,/home/model/dints_0/configs/transforms_infer.yaml,/home/model/dints_0/configs/transforms_train.yaml,/home/model/dints_0/configs/transforms_validate.yaml'"]' returned non-zero exit status 1.
Beta Was this translation helpful? Give feedback.
All reactions