add fast inference tutorial

yiheng-wang-nv · yiheng-wang-nv · commit 632c21216e85 · 2025-02-28T22:53:14.000+08:00
Signed-off-by: Yiheng Wang &lt;vennw@nvidia.com&gt;
diff --git a/acceleration/README.md b/acceleration/README.md
@@ -4,6 +4,8 @@ Typically, model training is a time-consuming step during deep learning developm
 ### List of notebooks and examples
 #### [fast_model_training_guide](./fast_model_training_guide.md)
 The document introduces details of how to profile the training pipeline, how to analyze the dataset and select suitable algorithms, and how to optimize GPU utilization in single GPU, multi-GPUs or even multi-nodes.
+#### [fast_inference_tutorial](./fast_inference_tutorial)
+The example introduces details of how to use GDS, GPU transforms and TensorRT to accelerate the inference.
 #### [distributed_training](./distributed_training)
 The examples show how to execute distributed training and evaluation based on 3 different frameworks:
 - PyTorch native `DistributedDataParallel` module with `torchrun`.
diff --git a/acceleration/fast_inference_tutorial/fast_inference_tutorial.ipynb b/acceleration/fast_inference_tutorial/fast_inference_tutorial.ipynb
@@ -0,0 +1,336 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright (c) MONAI Consortium  \n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\");  \n",
+    "you may not use this file except in compliance with the License.  \n",
+    "You may obtain a copy of the License at  \n",
+    "&nbsp;&nbsp;&nbsp;&nbsp;http://www.apache.org/licenses/LICENSE-2.0  \n",
+    "Unless required by applicable law or agreed to in writing, software  \n",
+    "distributed under the License is distributed on an \"AS IS\" BASIS,  \n",
+    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  \n",
+    "See the License for the specific language governing permissions and  \n",
+    "limitations under the License.\n",
+    "\n",
+    "# Fast Inference with MONAI features"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This tutorial demonstrates the performance comparison between a standard PyTorch training program and a MONAI-optimized inference program. The key features include:\n",
+    "\n",
+    "1. **Direct Data Loading**: Load data directly from disk to GPU memory, minimizing data transfer time and improving efficiency.\n",
+    "2. **GPU-based Preprocessing**: Execute preprocessing transforms directly on the GPU, leveraging its computational power for faster data preparation.\n",
+    "3. **TensorRT Inference**: Utilize TensorRT for running inference, which optimizes the model for high-performance execution on NVIDIA GPUs.\n",
+    "\n",
+    "This tutorial is modified from the `TensorRT_inference_acceleration` tutorial."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup environment"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Loading data directly from disk to GPU memory requires the `kvikio` library. In addition, this tutorial requires many other dependencies such as `monai`, `torch`, `torch_tensorrt`, `numpy`, `ignite`, `pandas`, `matplotlib`, etc. We recommend using the [MONAI Docker](https://docs.monai.io/en/latest/installation.html#from-dockerhub) image to run this tutorial, which includes pre-configured dependencies and allows you to skip manual installation.\n",
+    "\n",
+    "If not using MONAI Docker, install `kvikio` using one of these methods:\n",
+    "\n",
+    "- **PyPI Installation**  \n",
+    "  Use the appropriate package for your CUDA version:\n",
+    "  ```bash\n",
+    "  pip install kvikio-cu12  # For CUDA 12\n",
+    "  pip install kvikio-cu11  # For CUDA 11\n",
+    "  ```\n",
+    "\n",
+    "- **Conda/Mamba Installation**  \n",
+    "  Follow the official [KvikIO installation guide](https://docs.rapids.ai/api/kvikio/nightly/install/) for Conda/Mamba installations.\n",
+    "\n",
+    "For convenience, we provide the cell below to install all the dependencies (please modify the cell based on your actual CUDA version, and please note that only CUDA 11 and CUDA 12 are supported for now)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -c \"import monai\" || pip install -q \"monai-weekly[nibabel, pydicom, tqdm]\"\n",
+    "!python -c \"import matplotlib\" || pip install -q matplotlib\n",
+    "!python -c \"import torch_tensorrt\" || pip install torch_tensorrt\n",
+    "!python -c \"import kvikio\" || pip install kvikio-cu12\n",
+    "!python -c \"import ignite\" || pip install pytorch-ignite\n",
+    "!python -c \"import pandas\" || pip install pandas\n",
+    "!python -c \"import requests\" || pip install requests\n",
+    "!python -c \"import fire\" || pip install fire\n",
+    "!python -c \"import onnx\" || pip install nibaonnxbel\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "import torch\n",
+    "import torch_tensorrt\n",
+    "import matplotlib.pyplot as plt\n",
+    "import monai\n",
+    "from monai.config import print_config\n",
+    "from monai.transforms import (\n",
+    "    EnsureChannelFirstd,\n",
+    "    EnsureTyped,\n",
+    "    LoadImaged,\n",
+    "    Orientationd,\n",
+    "    Spacingd,\n",
+    "    ScaleIntensityRanged,\n",
+    "    Compose\n",
+    ")\n",
+    "from monai.data import Dataset,ThreadDataLoader\n",
+    "import torch\n",
+    "import numpy as np\n",
+    "import copy\n",
+    "\n",
+    "print(f\"Torch-TensorRT version: {torch_tensorrt.__version__}.\")\n",
+    "\n",
+    "print_config()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prepare Test Data, Bundle, and TensorRT Model\n",
+    "\n",
+    "We provide a helper script, [`prepare_data.py`](./prepare_data.py), to simplify the setup process. This script performs the following tasks:\n",
+    "\n",
+    "- **Test Data**: Downloads and extracts the [Medical Segmentation Decathlon Task09 Spleen dataset](http://medicaldecathlon.com/).\n",
+    "- **Bundle**: Downloads the required `spleen_ct_segmentation` bundle.\n",
+    "- **TensorRT Model**: Exports the downloaded bundle model to a TensorRT engine-based TorchScript model. By default, the script exports the model using `fp16` precision, but you can modify it to use `fp32` precision if desired.\n",
+    "\n",
+    "The script automatically checks for existing data, bundles, and exported models before downloading or exporting. This ensures that repeated executions of the notebook do not result in redundant operations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils import prepare_test_datalist, prepare_test_bundle, prepare_tensorrt_model\n",
+    "\n",
+    "root_dir = \".\"\n",
+    "\n",
+    "train_files = prepare_test_datalist(root_dir)\n",
+    "bundle_path = prepare_test_bundle(bundle_dir=root_dir, bundle_name=\"spleen_ct_segmentation\")\n",
+    "trt_model_name = \"model_trt.ts\"\n",
+    "prepare_tensorrt_model(bundle_path, trt_model_name)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark the end-to-end bundle inference\n",
+    "\n",
+    "A variable `benchmark_type` is defined to specify the type of benchmark to run. To have a fair comparison, each benchmark type should be run after restarting the notebook kernel.\n",
+    "\n",
+    "`benchmark_type` can be one of the following:\n",
+    "\n",
+    "- `\"original\"`: benchmark the original bundle inference.\n",
+    "- `\"trt\"`: benchmark the TensorRT accelerated bundle inference.\n",
+    "- `\"trt_gds\"`: benchmark the TensorRT accelerated bundle inference with GPU data loading and GPU transforms."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "benchmark_type = \"trt_gds\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A `TimerHandler` is defined to benchmark every part of the inference process.\n",
+    "\n",
+    "Please refer to `utils.py` for the implementation of `CUDATimer` and `TimerHandler`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from utils import TimerHandler, prepare_workflow, benchmark_workflow"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmark the Original Bundle Inference\n",
+    "\n",
+    "In this section, the `workflow`runs several iterations to benchmark the latency."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_weight = os.path.join(bundle_path, \"models\", \"model.pt\")\n",
+    "meta_config = os.path.join(bundle_path, \"configs\", \"metadata.json\")\n",
+    "inference_config = os.path.join(bundle_path, \"configs\", \"inference.json\")\n",
+    "\n",
+    "override = {\n",
+    "    \"dataset#data\": [{\"image\": i} for i in train_files],\n",
+    "    \"output_postfix\": benchmark_type,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if benchmark_type == \"original\":\n",
+    "\n",
+    "    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)\n",
+    "    torch_timer = TimerHandler()\n",
+    "    benchmark_df = benchmark_workflow(workflow, torch_timer, benchmark_type)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmark the TensorRT Accelerated Bundle Inference\n",
+    "In this part, the TensorRT accelerated model is loaded to the `workflow`. The updated `workflow` runs the same iterations as before to benchmark the latency difference. Since the TensorRT accelerated model cannot be loaded through the `CheckpointLoader` and don't have `amp` mode, disable the `CheckpointLoader` in the `initialize` of the `workflow` and the `amp` parameter in the `evaluator` of the `workflow` needs to be set to `False`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if benchmark_type == \"trt\":\n",
+    "    trt_model_path = os.path.join(bundle_path, \"models\", \"model_trt.ts\")\n",
+    "    trt_model = torch.jit.load(trt_model_path)\n",
+    "\n",
+    "    override[\"load_pretrain\"] = False\n",
+    "    override[\"network_def\"] = trt_model\n",
+    "    override[\"evaluator#amp\"] = False\n",
+    "\n",
+    "    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)\n",
+    "    trt_timer = TimerHandler()\n",
+    "    benchmark_df = benchmark_workflow(workflow, trt_timer, benchmark_type)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Benchmarking TensorRT Accelerated Bundle Inference with GPU Data Loading and GPU-based Transforms\n",
+    "\n",
+    "In the previous section, the inference workflow utilized CPU-based transforms. In this section, we enhance performance by leveraging GPU acceleration:\n",
+    "\n",
+    "- **GPU Direct Storage (GDS)**: The `LoadImaged` transform enables GDS on `.nii` and `.dcm` files via specifying `to_gpu=True`.\n",
+    "- **GPU-based Transforms**: After GDS, subsequent preprocessing transforms are executed directly on the GPU."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "transforms = Compose([\n",
+    "    LoadImaged(keys=\"image\", reader=\"NibabelReader\", to_gpu=False),\n",
+    "    EnsureTyped(keys=\"image\", device=torch.device(\"cuda:0\")),\n",
+    "    EnsureChannelFirstd(keys=\"image\"),\n",
+    "    Orientationd(keys=\"image\", axcodes=\"RAS\"),\n",
+    "    Spacingd(keys=\"image\", pixdim=[1.5, 1.5, 2.0], mode=\"bilinear\"),\n",
+    "    ScaleIntensityRanged(keys=\"image\", a_min=-57, a_max=164, b_min=0, b_max=1, clip=True),\n",
+    "])\n",
+    "\n",
+    "dataset = Dataset(data=[{\"image\": i} for i in train_files], transform=transforms)\n",
+    "dataloader = ThreadDataLoader(dataset, batch_size=1, shuffle=False, num_workers=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if benchmark_type == \"trt_gds\":\n",
+    "\n",
+    "    trt_model_path = os.path.join(bundle_path, \"models\", \"model_trt.ts\")\n",
+    "    trt_model = torch.jit.load(trt_model_path)\n",
+    "    override = {\n",
+    "        \"output_postfix\": benchmark_type,\n",
+    "        \"load_pretrain\": False,\n",
+    "        \"network_def\": trt_model,\n",
+    "        \"evaluator#amp\": False,\n",
+    "        \"preprocessing\": transforms,\n",
+    "        \"dataset\": dataset,\n",
+    "        \"dataloader\": dataloader,\n",
+    "    }\n",
+    "\n",
+    "    workflow = prepare_workflow(inference_config, meta_config, bundle_path, override)\n",
+    "    trt_gpu_trans_timer = TimerHandler()\n",
+    "    benchmark_df = benchmark_workflow(workflow, trt_gpu_trans_timer, benchmark_type)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "kvikio_env",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/acceleration/fast_inference_tutorial/utils.py b/acceleration/fast_inference_tutorial/utils.py