Skip to content

[BUG] MAP (Docker image) fails to run on system with GPU of Ampere architecture  #249

Closed
@MMelQin

Description

@MMelQin

Describe the bug
Running Spleen Seg example app in JupyterLab, built the MONAI App Package Docker, and the MAP docker fails when running on the system with RTX RTX A6000 GPU with Ampere architecture.

Steps/Code to reproduce bug

  • On a system with Ampere GPU
  • Run the 5th MONAI Deploy tutorial in Jupyter Notebook in JupyterLab (with Python >= 3.7 as it is another known issue that's fixed)
  • See the error
app = AISpleenSegApp()

# app.run(input="dcm", output="output", model="model.ts")
app.run(input="/home/hju/monai-deploy-app-sdk/dcm", output="/home/hju/monai-deploy-app-sdk/output", model="/home/hju/monai-deploy-app-sdk/model.ts")
Going to initiate execution of operator DICOMDataLoaderOperator
Executing operator DICOMDataLoaderOperator (Process ID: 6802, Operator ID: cec9734d-e25e-4f14-b6f7-66bfd8f9b730)
[2022-01-21 20:34:04,558] [WARNING] (root) - No selection rules given; select all series.
[2022-01-21 20:34:04,558] [INFO] (root) - Working on study, instance UID: 1.2.826.0.1.3680043.2.1125.1.67295333199898911264201812221946213
[2022-01-21 20:34:04,559] [INFO] (root) - Working on series, instance UID: 1.2.826.0.1.3680043.2.1125.1.68102559796966796813942775094416763
Done performing execution of operator DICOMDataLoaderOperator

Going to initiate execution of operator DICOMSeriesSelectorOperator
Executing operator DICOMSeriesSelectorOperator (Process ID: 6802, Operator ID: 3445787f-7fed-4d96-84f7-7084edd57123)
Working on study, instance UID: 1.2.826.0.1.3680043.2.1125.1.67295333199898911264201812221946213
Working on series, instance UID: 1.2.826.0.1.3680043.2.1125.1.68102559796966796813942775094416763
Done performing execution of operator DICOMSeriesSelectorOperator

Going to initiate execution of operator DICOMSeriesToVolumeOperator
Executing operator DICOMSeriesToVolumeOperator (Process ID: 6802, Operator ID: 66f4d414-263c-4f79-9787-aff103886c7d)
Done performing execution of operator DICOMSeriesToVolumeOperator

Going to initiate execution of operator SpleenSegOperator
Executing operator SpleenSegOperator (Process ID: 6802, Operator ID: 5dc2958a-367d-4cc9-9d61-a20ce1c4f2d9)
Converted Image object metadata:
SeriesInstanceUID: 1.2.826.0.1.3680043.2.1125.1.68102559796966796813942775094416763, type <class 'str'>
Modality: CT, type <class 'str'>
SeriesDescription: No series description, type <class 'str'>
PatientPosition: HFS, type <class 'str'>
SeriesNumber: 1, type <class 'int'>
row_pixel_spacing: 1.0, type <class 'float'>
col_pixel_spacing: 1.0, type <class 'float'>
depth_pixel_spacing: 1.0, type <class 'float'>
row_direction_cosine: [-1.0, 0.0, 0.0], type <class 'list'>
col_direction_cosine: [0.0, -1.0, 0.0], type <class 'list'>
depth_direction_cosine: [0.0, 0.0, 1.0], type <class 'list'>
dicom_affine_transform: [[-1.  0.  0.  0.]
 [ 0. -1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]], type <class 'numpy.ndarray'>
nifti_affine_transform: [[ 1. -0. -0. -0.]
 [-0.  1. -0. -0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]], type <class 'numpy.ndarray'>
StudyInstanceUID: 1.2.826.0.1.3680043.2.1125.1.67295333199898911264201812221946213, type <class 'str'>
StudyID: SLICER10001, type <class 'str'>
StudyDate: 2019-09-16, type <class 'str'>
StudyTime: 010100.000000, type <class 'str'>
StudyDescription: spleen, type <class 'str'>
AccessionNumber: 1, type <class 'str'>
selection_name: 1.2.826.0.1.3680043.2.1125.1.68102559796966796813942775094416763, type <class 'str'>
/home/hju/anaconda3/envs/monai/lib/python3.7/site-packages/torch/cuda/__init__.py:143: UserWarning: 
NVIDIA RTX A6000 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA RTX A6000 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

  warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_6802/3567561955.py in <module>
      2 
      3 # app.run(input="dcm", output="output", model="model.ts")
----> 4 app.run(input="/home/hju/monai-deploy-app-sdk/dcm", output="/home/hju/monai-deploy-app-sdk/output", model="/home/hju/monai-deploy-app-sdk/model.ts")

~/anaconda3/envs/monai/lib/python3.7/site-packages/monai/deploy/core/application.py in run(self, log_level, input, output, model, workdir, datastore, executor)
    427         datastore_obj = DatastoreFactory.create(app_context.datastore)
    428         executor_obj = ExecutorFactory.create(app_context.executor, {"app": self, "datastore": datastore_obj})
--> 429         executor_obj.run()
    430 
    431     @abstractmethod

~/anaconda3/envs/monai/lib/python3.7/site-packages/monai/deploy/core/executors/single_process_executor.py in run(self)
    123                     + Fore.RESET
    124                 )
--> 125                 op.compute(op_exec_context.input_context, op_exec_context.output_context, op_exec_context)
    126 
    127                 # Execute post_compute()

/tmp/ipykernel_6802/1686202219.py in compute(self, op_input, op_output, context)
     42 
     43         # Now let the built-in operator handles the work with the I/O spec and execution context.
---> 44         infer_operator.compute(op_input, op_output, context)
     45 
     46     def pre_process(self, img_reader) -> Compose:

~/anaconda3/envs/monai/lib/python3.7/site-packages/monai/deploy/operators/monai_seg_inference_operator.py in compute(self, op_input, op_output, context)
    220                         sw_batch_size=sw_batch_size,
    221                         overlap=self.overlap,
--> 222                         predictor=model,
    223                     )
    224                     d = [post_transforms(i) for i in decollate_batch(d)]

~/anaconda3/envs/monai/lib/python3.7/site-packages/monai/inferers/utils.py in sliding_window_inference(inputs, roi_size, sw_batch_size, predictor, overlap, mode, sigma_scale, padding_mode, cval, sw_device, device, *args, **kwargs)
    115     # Create window-level importance map
    116     importance_map = compute_importance_map(
--> 117         get_valid_patch_size(image_size, roi_size), mode=mode, sigma_scale=sigma_scale, device=device
    118     )
    119 

~/anaconda3/envs/monai/lib/python3.7/site-packages/monai/data/utils.py in compute_importance_map(patch_size, mode, sigma_scale, device)
    761     device = torch.device(device)  # type: ignore[arg-type]
    762     if mode == BlendMode.CONSTANT:
--> 763         importance_map = torch.ones(patch_size, device=device).float()
    764     elif mode == BlendMode.GAUSSIAN:
    765         center_coords = [i // 2 for i in patch_size]

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Expected behavior
The example should work on system with Ampere GPU

Environment details (please complete the following information)

  • OS/Platform: Ubuntu 20.04
  • Python Version: 3.7
  • Method of MONAI Deploy App SDK install: Jupyter Notebook in JupyterLab
  • SDK Version: v0.2

Additional context
It is known that since v1.8, torch needs to pip installed with a command targeting a CUDA version. The App SDK packager uses a NVIDIA PyTorch base image, which has the torch pre-installed with CUDA version consideration, though the App SDK packager may not process the torch dependency from the app correctly or ensure torch is still properly installed after the MAP Docker image is built.

Note, the Spleen App does not pin the version of torch, e.g.
@md.env(pip_packages=["monai==0.6.0", "torch>=1.5", "numpy>=1.20", "nibabel"])

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions