Auto3DSeg fails with 5 fold but not 2 on multi-GPU setup #1884

ashleymicuda · 2024-11-26T13:46:49Z

ashleymicuda
Nov 26, 2024

I've been trying to run the https://github.com/Project-MONAI/tutorials/blob/main/auto3dseg/notebooks/auto_runner.ipynb tutorial on a dual GPU setup with my own data (~400 GB). The autorunner works and trains when I have my dataset split into 2 fold, however when I increase to 5 fold I get the following error:

Notably, the code runs for a few hours then crashes. It seems that Auto3DSeg tries to load in all the data before training starts, however I can't figure out why it would run with 2 fold and not 5. I can run 5 fold on one GPU which suggests it's a multiprocessing error (as the screenshot also suggests). Any ideas of how to run the 5-fold on multi-GPUs or if there's something in the default settings I need to change?

Update: Below is the full error code

W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779]
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] *****************************************
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0110 16:13:01.171000 126811119820800 torch/distributed/run.py:779] *****************************************

[rank1]:[E110 18:13:16.313613784 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
[rank1]:[E110 18:13:16.327514822 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E110 18:13:16.356363179 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
[rank0]:[E110 18:13:16.356536197 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.

[rank0]:[E110 18:13:16.006191477 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E110 18:13:16.006209361 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E110 18:13:16.006213178 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E110 18:13:16.018073828 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home//.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x77e2cf3c88f2 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77e2cf3cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x77e2cf3d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200098 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x77e2cf3c88f2 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77e2cf3cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x77e2cf3d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e31dc1ff86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x77e2cf05aa84 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x77e32eedc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x77e32fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x77e32fd26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E110 18:13:16.022142650 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E110 18:13:16.022159412 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E110 18:13:16.022163139 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E110 18:13:16.022933349 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x76fc3b5c88f2 in /home//.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76fc3b5cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x76fc3b5d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=7200000) ran for 7200056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x76fc3b5c88f2 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76fc3b5cf333 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x76fc3b5d171c in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76fc89d6df86 in /home//.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5aa84 (0x76fc3b25aa84 in /home/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xdc253 (0x76fc9b0dc253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: + 0x94ac3 (0x76fc9be94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: + 0x126850 (0x76fc9bf26850 in /lib/x86_64-linux-gnu/libc.so.6)

W0110 18:13:16.992000 126811119820800 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 45121 closing signal SIGTERM
E0110 18:13:17.008000 126811119820800 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 45120) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/home/.local/bin/torchrun", line 8, in
sys.exit(main())
File "/home/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
File "/home/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/model/dints_0/scripts/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-10_18:13:16
host :
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 45120)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 45120

CalledProcessError Traceback (most recent call last)
Cell In[7], line 1
----> 1 runner.run()

File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/auto_runner.py:878, in AutoRunner.run(self)
876 if len(history) > 0:
877 if not self.hpo:
--> 878 self._train_algo_in_sequence(history)
879 else:
880 self._train_algo_in_nni(history)

File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/auto_runner.py:728, in AutoRunner._train_algo_in_sequence(self, history)
726 algo = algo_dict[AlgoKeys.ALGO]
727 if has_option(algo.train, "device_setting"):
--> 728 algo.train(self.train_params, self.device_setting)
729 else:
730 algo.train(self.train_params)

File ~/Desktop/model/algorithm_templates/dints/scripts/algo.py:496, in DintsAlgo.train(self, train_params, device_setting, search)
494 cmd, devices_info = self._create_cmd(dints_train_params)
495 cmd = "OMP_NUM_THREADS=1 " + cmd
--> 496 return self._run_cmd(cmd, devices_info)

File ~/.local/lib/python3.10/site-packages/monai/apps/auto3dseg/bundle_gen.py:273, in BundleAlgo._run_cmd(self, cmd, devices_info)
271 return _run_cmd_bcprun(cmd, n=self.device_setting["NUM_NODES"], p=self.device_setting["n_devices"])
272 elif int(self.device_setting["n_devices"]) > 1:
--> 273 return _run_cmd_torchrun(
274 cmd, nnodes=1, nproc_per_node=self.device_setting["n_devices"], env=ps_environ, check=True
275 )
276 else:
277 return run_cmd(cmd.split(), run_cmd_verbose=True, env=ps_environ, check=True)

File ~/.local/lib/python3.10/site-packages/monai/auto3dseg/utils.py:502, in _run_cmd_torchrun(cmd, **kwargs)
500 torchrun_list += [f"--{arg}", str(params.pop(arg))]
501 torchrun_list += cmd_list
--> 502 return run_cmd(torchrun_list, run_cmd_verbose=True, **params)

File ~/.local/lib/python3.10/site-packages/monai/utils/misc.py:892, in run_cmd(cmd_list, **kwargs)
890 monai.apps.utils.get_logger("run_cmd").info(f"{cmd_list}") # type: ignore[attr-defined]
891 try:
--> 892 return subprocess.run(cmd_list, **kwargs)
893 except subprocess.CalledProcessError as e:
894 if not debug:

File /usr/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
524 retcode = process.poll()
525 if check and retcode:
--> 526 raise CalledProcessError(retcode, process.args,
527 output=stdout, stderr=stderr)
528 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command '['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/home/model/dints_0/scripts/train.py', 'run', "--config_file='/home/model/dints_0/configs/hyper_parameters.yaml,/home/model/dints_0/configs/hyper_parameters_search.yaml,/home/model/dints_0/configs/network.yaml,/home/model/dints_0/configs/network_search.yaml,/home/model/dints_0/configs/transforms_infer.yaml,/home/model/dints_0/configs/transforms_train.yaml,/home/model/dints_0/configs/transforms_validate.yaml'"]' returned non-zero exit status 1.

diazandr3s · 2024-11-26T14:46:37Z

diazandr3s
Nov 26, 2024
Collaborator

Hi @ashleymicuda,

Thanks for sharing this.
Could you please confirm the number of images/labels per fold when running the 2-fold and 5fold set up?

1 reply

ashleymicuda Dec 13, 2024
Author

Thanks for the quick reply. It's the same dataset used for both. For 5 fold I have 400 in each and for 2 fold I had 1600 in one and 400 in the other.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto3DSeg fails with 5 fold but not 2 on multi-GPU setup #1884

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Auto3DSeg fails with 5 fold but not 2 on multi-GPU setup #1884

Uh oh!

Uh oh!

ashleymicuda Nov 26, 2024

/home/model/dints_0/scripts/train.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-01-10_18:13:16 host : rank : 0 (local_rank: 0) exitcode : -6 (pid: 45120) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 45120

Replies: 1 comment · 1 reply

Uh oh!

diazandr3s Nov 26, 2024 Collaborator

Uh oh!

ashleymicuda Dec 13, 2024 Author

ashleymicuda
Nov 26, 2024

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-01-10_18:13:16
host :
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 45120)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 45120

Replies: 1 comment 1 reply

diazandr3s
Nov 26, 2024
Collaborator

ashleymicuda Dec 13, 2024
Author