Closed
Description
04:20:58 Running ./auto3dseg/notebooks/auto3dseg_autorunner_ref_api.ipynb
04:20:58 Checking PEP8 compliance...
04:20:58 Running notebook...
04:20:58 Before:
04:20:58 "max_epochs = 2000\n",
04:20:58 After:
04:20:58 "max_epochs = 1\n",
04:20:58 MONAI version: 1.0.0+13.g71041fd8
04:20:58 Numpy version: 1.22.4
04:20:58 Pytorch version: 1.10.2+cu102
04:20:58 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
04:20:58 MONAI rev id: 71041fd89e11104a4ea79f11687a2469f331679c
04:20:58 MONAI __file__: /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/__init__.py
04:20:58
04:20:58 Optional dependencies:
04:20:58 Pytorch Ignite version: 0.4.8
04:20:58 Nibabel version: 4.0.2
04:20:58 scikit-image version: 0.19.3
04:20:58 Pillow version: 7.0.0
04:20:58 Tensorboard version: 2.10.1
04:20:58 gdown version: 4.5.1
04:20:58 TorchVision version: 0.11.3+cu102
04:20:58 tqdm version: 4.64.0
04:20:58 lmdb version: 1.3.0
04:20:58 psutil version: 5.9.1
04:20:58 pandas version: 1.1.5
04:20:58 einops version: 0.4.1
04:20:58 transformers version: 4.21.3
04:20:58 mlflow version: 1.29.0
04:20:58 pynrrd version: 0.4.3
04:20:58
04:20:58 For details about installing the optional dependencies, please visit:
04:20:58 https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
04:20:58
04:20:58 /opt/conda/lib/python3.8/site-packages/papermill/iorw.py:153: UserWarning: the file is not specified with any extension : -
04:20:58 warnings.warn(
05:39:06
Executing: 0%| | 0/24 [00:00<?, ?cell/s]
Executing: 4%|▍ | 1/24 [00:01<00:33, 1.45s/cell]
Executing: 12%|█▎ | 3/24 [00:06<00:44, 2.11s/cell]
Executing: 21%|██ | 5/24 [00:09<00:37, 1.96s/cell]
Executing: 46%|████▌ | 11/24 [01:39<02:21, 10.86s/cell]
Executing: 58%|█████▊ | 14/24 [01:48<01:23, 8.32s/cell]
Executing: 88%|████████▊ | 21/24 [1:18:07<17:19, 346.35s/cell]
Executing: 88%|████████▊ | 21/24 [1:18:09<11:09, 223.33s/cell]
05:39:06 /opt/conda/lib/python3.8/site-packages/papermill/iorw.py:153: UserWarning: the file is not specified with any extension : -
05:39:06 warnings.warn(
05:39:06 Traceback (most recent call last):
05:39:06 File "/opt/conda/bin/papermill", line 8, in <module>
05:39:06 sys.exit(papermill())
05:39:06 File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
05:39:06 return self.main(*args, **kwargs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1053, in main
05:39:06 rv = self.invoke(ctx)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
05:39:06 return ctx.invoke(self.callback, **ctx.params)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
05:39:06 return __callback(*args, **kwargs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
05:39:06 return f(get_current_context(), *args, **kwargs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/papermill/cli.py", line 250, in papermill
05:39:06 execute_notebook(
05:39:06 File "/opt/conda/lib/python3.8/site-packages/papermill/execute.py", line 128, in execute_notebook
05:39:06 raise_for_execution_errors(nb, output_path)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
05:39:06 raise error
05:39:06 papermill.exceptions.PapermillExecutionError:
05:39:06 ---------------------------------------------------------------------------
05:39:06 Exception encountered at "In [9]":
05:39:06 ---------------------------------------------------------------------------
05:39:06 CalledProcessError Traceback (most recent call last)
05:39:06 File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:183, in BundleAlgo._run_cmd(self, cmd, devices_info)
05:39:06 182 ps_environ["CUDA_VISIBLE_DEVICES"] = devices_info
05:39:06 --> 183 normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True, capture_output=True)
05:39:06 184 logger.info(repr(normal_out).replace("\\n", "\n").replace("\\t", "\t"))
05:39:06
05:39:06 File /opt/conda/lib/python3.8/subprocess.py:516, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
05:39:06 515 if check and retcode:
05:39:06 --> 516 raise CalledProcessError(retcode, process.args,
05:39:06 517 output=stdout, stderr=stderr)
05:39:06 518 return CompletedProcess(process.args, retcode, stdout, stderr)
05:39:06
05:39:06 CalledProcessError: Command '['python', './ref_api_work_dir/segresnet_4/scripts/train.py', 'run', "--config_file='./ref_api_work_dir/segresnet_4/configs/transforms_train.yaml','./ref_api_work_dir/segresnet_4/configs/transforms_infer.yaml','./ref_api_work_dir/segresnet_4/configs/hyper_parameters.yaml','./ref_api_work_dir/segresnet_4/configs/network.yaml','./ref_api_work_dir/segresnet_4/configs/transforms_validate.yaml'", '--num_iterations=32', '--num_iterations_per_validation=16', '--num_images_per_batch=2', '--num_epochs=2', '--num_warmup_iterations=16']' returned non-zero exit status 1.
05:39:06
05:39:06 The above exception was the direct cause of the following exception:
05:39:06
05:39:06 RuntimeError Traceback (most recent call last)
05:39:06 Input In [9], in <cell line: 2>()
05:39:06 2 for task in history:
05:39:06 3 for _, algo in task.items():
05:39:06 ----> 4 algo.train(train_param) # can use default params by `algo.train()`
05:39:06 5 acc = algo.get_score()
05:39:06 6 algo_to_pickle(algo, template_path=algo.template_path, best_metrics=acc)
05:39:06
05:39:06 File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:200, in BundleAlgo.train(self, train_params)
05:39:06 192 """
05:39:06 193 Load the run function in the training script of each model. Training parameter is predefined by the
05:39:06 194 algo_config.yaml file, which is pre-filled by the fill_template_config function in the same instance.
05:39:06 (...)
05:39:06 197 train_params: to specify the devices using a list of integers: ``{"CUDA_VISIBLE_DEVICES": [1,2,3]}``.
05:39:06 198 """
05:39:06 199 cmd, devices_info = self._create_cmd(train_params)
05:39:06 --> 200 return self._run_cmd(cmd, devices_info)
05:39:06
05:39:06 File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:188, in BundleAlgo._run_cmd(self, cmd, devices_info)
05:39:06 186 output = repr(e.stdout).replace("\\n", "\n").replace("\\t", "\t")
05:39:06 187 errors = repr(e.stderr).replace("\\n", "\n").replace("\\t", "\t")
05:39:06 --> 188 raise RuntimeError(f"subprocess call error {e.returncode}: {errors}, {output}") from e
05:39:06 189 return normal_out
05:39:06
05:39:06 RuntimeError: subprocess call error 1: b'2022-09-30 04:38:21.612872: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
05:39:06 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
05:39:06 2022-09-30 04:38:21.757287: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
05:39:06 2022-09-30 04:38:21.790175: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
05:39:06 2022-09-30 04:38:22.460274: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library \'libnvinfer.so.7\'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib/python3.8/site-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
05:39:06 2022-09-30 04:38:22.460358: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library \'libnvinfer_plugin.so.7\'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib/python3.8/site-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
05:39:06 2022-09-30 04:38:22.460366: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
05:39:06 Traceback (most recent call last):
05:39:06 File "./ref_api_work_dir/segresnet_4/scripts/train.py", line 405, in <module>
05:39:06 fire.Fire()
05:39:06 File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
05:39:06 component_trace = _Fire(component, args, parsed_flag_args, context, name)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
05:39:06 component, remaining_args = _CallAndUpdateTrace(
05:39:06 File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
05:39:06 component = fn(*varargs, **kwargs)
05:39:06 File "./ref_api_work_dir/segresnet_4/scripts/train.py", line 246, in run
05:39:06 scaler.scale(loss).backward()
05:39:06 File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 299, in backward
05:39:06 return handle_torch_function(
05:39:06 File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1355, in handle_torch_function
05:39:06 result = torch_func_method(public_api, types, args, kwargs)
05:39:06 File "/home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/data/meta_tensor.py", line 249, in __torch_function__
05:39:06 ret = super().__torch_function__(func, types, args, kwargs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 1051, in __torch_function__
05:39:06 ret = func(*args, **kwargs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
05:39:06 torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
05:39:06 File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
05:39:06 Variable._execution_engine.run_backward(
05:39:06 RuntimeError: CUDA out of memory. Tried to allocate 882.00 MiB (GPU 0; 31.75 GiB total capacity; 17.00 GiB already allocated; 232.50 MiB free; 17.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
05:39:06 ', b'[info] number of GPUs: 1
05:39:06 [info] world_size: 1
05:39:06 train_files: 32
05:39:06 val_files: 8
05:39:06 num_epochs 2
05:39:06 num_epochs_per_validation 1
05:39:06 [info] training from scratch
05:39:06 [info] amp enabled
05:39:06 ----------
05:39:06 epoch 1/2
05:39:06 learning rate is set to 2e-05
05:39:06 '
Metadata
Metadata
Assignees
Labels
No labels