Skip to content

test error: auto3dseg_autorunner_ref_api #967

Closed
@wyli

Description

@wyli
04:20:58  Running ./auto3dseg/notebooks/auto3dseg_autorunner_ref_api.ipynb
04:20:58  Checking PEP8 compliance...
04:20:58  Running notebook...
04:20:58  Before:
04:20:58      "max_epochs = 2000\n",
04:20:58  After:
04:20:58      "max_epochs = 1\n",
04:20:58  MONAI version: 1.0.0+13.g71041fd8
04:20:58  Numpy version: 1.22.4
04:20:58  Pytorch version: 1.10.2+cu102
04:20:58  MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
04:20:58  MONAI rev id: 71041fd89e11104a4ea79f11687a2469f331679c
04:20:58  MONAI __file__: /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/__init__.py
04:20:58  
04:20:58  Optional dependencies:
04:20:58  Pytorch Ignite version: 0.4.8
04:20:58  Nibabel version: 4.0.2
04:20:58  scikit-image version: 0.19.3
04:20:58  Pillow version: 7.0.0
04:20:58  Tensorboard version: 2.10.1
04:20:58  gdown version: 4.5.1
04:20:58  TorchVision version: 0.11.3+cu102
04:20:58  tqdm version: 4.64.0
04:20:58  lmdb version: 1.3.0
04:20:58  psutil version: 5.9.1
04:20:58  pandas version: 1.1.5
04:20:58  einops version: 0.4.1
04:20:58  transformers version: 4.21.3
04:20:58  mlflow version: 1.29.0
04:20:58  pynrrd version: 0.4.3
04:20:58  
04:20:58  For details about installing the optional dependencies, please visit:
04:20:58      https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
04:20:58  
04:20:58  /opt/conda/lib/python3.8/site-packages/papermill/iorw.py:153: UserWarning: the file is not specified with any extension : -
04:20:58    warnings.warn(
05:39:06  
Executing:   0%|          | 0/24 [00:00<?, ?cell/s]
Executing:   4%|▍         | 1/24 [00:01<00:33,  1.45s/cell]
Executing:  12%|█▎        | 3/24 [00:06<00:44,  2.11s/cell]
Executing:  21%|██        | 5/24 [00:09<00:37,  1.96s/cell]
Executing:  46%|████▌     | 11/24 [01:39<02:21, 10.86s/cell]
Executing:  58%|█████▊    | 14/24 [01:48<01:23,  8.32s/cell]
Executing:  88%|████████▊ | 21/24 [1:18:07<17:19, 346.35s/cell]
Executing:  88%|████████▊ | 21/24 [1:18:09<11:09, 223.33s/cell]
05:39:06  /opt/conda/lib/python3.8/site-packages/papermill/iorw.py:153: UserWarning: the file is not specified with any extension : -
05:39:06    warnings.warn(
05:39:06  Traceback (most recent call last):
05:39:06    File "/opt/conda/bin/papermill", line 8, in <module>
05:39:06      sys.exit(papermill())
05:39:06    File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
05:39:06      return self.main(*args, **kwargs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1053, in main
05:39:06      rv = self.invoke(ctx)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
05:39:06      return ctx.invoke(self.callback, **ctx.params)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
05:39:06      return __callback(*args, **kwargs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
05:39:06      return f(get_current_context(), *args, **kwargs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/papermill/cli.py", line 250, in papermill
05:39:06      execute_notebook(
05:39:06    File "/opt/conda/lib/python3.8/site-packages/papermill/execute.py", line 128, in execute_notebook
05:39:06      raise_for_execution_errors(nb, output_path)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
05:39:06      raise error
05:39:06  papermill.exceptions.PapermillExecutionError: 
05:39:06  ---------------------------------------------------------------------------
05:39:06  Exception encountered at "In [9]":
05:39:06  ---------------------------------------------------------------------------
05:39:06  CalledProcessError                        Traceback (most recent call last)
05:39:06  File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:183, in BundleAlgo._run_cmd(self, cmd, devices_info)
05:39:06      182     ps_environ["CUDA_VISIBLE_DEVICES"] = devices_info
05:39:06  --> 183 normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True, capture_output=True)
05:39:06      184 logger.info(repr(normal_out).replace("\\n", "\n").replace("\\t", "\t"))
05:39:06  
05:39:06  File /opt/conda/lib/python3.8/subprocess.py:516, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
05:39:06      515     if check and retcode:
05:39:06  --> 516         raise CalledProcessError(retcode, process.args,
05:39:06      517                                  output=stdout, stderr=stderr)
05:39:06      518 return CompletedProcess(process.args, retcode, stdout, stderr)
05:39:06  
05:39:06  CalledProcessError: Command '['python', './ref_api_work_dir/segresnet_4/scripts/train.py', 'run', "--config_file='./ref_api_work_dir/segresnet_4/configs/transforms_train.yaml','./ref_api_work_dir/segresnet_4/configs/transforms_infer.yaml','./ref_api_work_dir/segresnet_4/configs/hyper_parameters.yaml','./ref_api_work_dir/segresnet_4/configs/network.yaml','./ref_api_work_dir/segresnet_4/configs/transforms_validate.yaml'", '--num_iterations=32', '--num_iterations_per_validation=16', '--num_images_per_batch=2', '--num_epochs=2', '--num_warmup_iterations=16']' returned non-zero exit status 1.
05:39:06  
05:39:06  The above exception was the direct cause of the following exception:
05:39:06  
05:39:06  RuntimeError                              Traceback (most recent call last)
05:39:06  Input In [9], in <cell line: 2>()
05:39:06        2 for task in history:
05:39:06        3     for _, algo in task.items():
05:39:06  ----> 4         algo.train(train_param)  # can use default params by `algo.train()`
05:39:06        5         acc = algo.get_score()
05:39:06        6         algo_to_pickle(algo, template_path=algo.template_path, best_metrics=acc)
05:39:06  
05:39:06  File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:200, in BundleAlgo.train(self, train_params)
05:39:06      192 """
05:39:06      193 Load the run function in the training script of each model. Training parameter is predefined by the
05:39:06      194 algo_config.yaml file, which is pre-filled by the fill_template_config function in the same instance.
05:39:06     (...)
05:39:06      197     train_params:  to specify the devices using a list of integers: ``{"CUDA_VISIBLE_DEVICES": [1,2,3]}``.
05:39:06      198 """
05:39:06      199 cmd, devices_info = self._create_cmd(train_params)
05:39:06  --> 200 return self._run_cmd(cmd, devices_info)
05:39:06  
05:39:06  File /home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/apps/auto3dseg/bundle_gen.py:188, in BundleAlgo._run_cmd(self, cmd, devices_info)
05:39:06      186     output = repr(e.stdout).replace("\\n", "\n").replace("\\t", "\t")
05:39:06      187     errors = repr(e.stderr).replace("\\n", "\n").replace("\\t", "\t")
05:39:06  --> 188     raise RuntimeError(f"subprocess call error {e.returncode}: {errors}, {output}") from e
05:39:06      189 return normal_out
05:39:06  
05:39:06  RuntimeError: subprocess call error 1: b'2022-09-30 04:38:21.612872: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
05:39:06  To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
05:39:06  2022-09-30 04:38:21.757287: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
05:39:06  2022-09-30 04:38:21.790175: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
05:39:06  2022-09-30 04:38:22.460274: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library \'libnvinfer.so.7\'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib/python3.8/site-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
05:39:06  2022-09-30 04:38:22.460358: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library \'libnvinfer_plugin.so.7\'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/opt/conda/lib/python3.8/site-packages/torch/lib:/opt/conda/lib/python3.8/site-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
05:39:06  2022-09-30 04:38:22.460366: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
05:39:06  Traceback (most recent call last):
05:39:06    File "./ref_api_work_dir/segresnet_4/scripts/train.py", line 405, in <module>
05:39:06      fire.Fire()
05:39:06    File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
05:39:06      component_trace = _Fire(component, args, parsed_flag_args, context, name)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 466, in _Fire
05:39:06      component, remaining_args = _CallAndUpdateTrace(
05:39:06    File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
05:39:06      component = fn(*varargs, **kwargs)
05:39:06    File "./ref_api_work_dir/segresnet_4/scripts/train.py", line 246, in run
05:39:06      scaler.scale(loss).backward()
05:39:06    File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 299, in backward
05:39:06      return handle_torch_function(
05:39:06    File "/opt/conda/lib/python3.8/site-packages/torch/overrides.py", line 1355, in handle_torch_function
05:39:06      result = torch_func_method(public_api, types, args, kwargs)
05:39:06    File "/home/jenkins/agent/workspace/Monai-notebooks/MONAI/monai/data/meta_tensor.py", line 249, in __torch_function__
05:39:06      ret = super().__torch_function__(func, types, args, kwargs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 1051, in __torch_function__
05:39:06      ret = func(*args, **kwargs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/torch/_tensor.py", line 307, in backward
05:39:06      torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
05:39:06    File "/opt/conda/lib/python3.8/site-packages/torch/autograd/__init__.py", line 154, in backward
05:39:06      Variable._execution_engine.run_backward(
05:39:06  RuntimeError: CUDA out of memory. Tried to allocate 882.00 MiB (GPU 0; 31.75 GiB total capacity; 17.00 GiB already allocated; 232.50 MiB free; 17.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
05:39:06  ', b'[info] number of GPUs: 1
05:39:06  [info] world_size: 1
05:39:06  train_files: 32
05:39:06  val_files: 8
05:39:06  num_epochs 2
05:39:06  num_epochs_per_validation 1
05:39:06  [info] training from scratch
05:39:06  [info] amp enabled
05:39:06  ----------
05:39:06  epoch 1/2
05:39:06  learning rate is set to 2e-05
05:39:06  '

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions