Skip to content

Commit 0391b9d

Browse files
zmsunnydaymenzhang
andauthored
Full gpu inference pipeline (#551)
* add full_gpu_inference_pipeline example * Update README.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README.md * Update config.pbtxt * Update config.pbtxt * Update model.py * Update README.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update README.md * Update client.ipynb * code style fix * fixed ipynb codestyle * Update README.md * replace personal repo links with projec-MONAI links * Delete full_gpu_inference_pipeline/triton_models directory * add download_model_repo.sh * Update README.md Co-authored-by: menzhang <menzhang@nvidia.com>
1 parent 74ff6f7 commit 0391b9d

File tree

6 files changed

+439
-0
lines changed

6 files changed

+439
-0
lines changed

full_gpu_inference_pipeline/README.md

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
# Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step
2+
- [Deploy MONAI pipeline by Triton and run the full pipeline on GPU step by step](#deploy-monai-pipeline-by-triton-and-run-the-full-pipeline-on-gpu-step-by-step)
3+
* [overview](#overview)
4+
* [Prepare the model repository](#prepare-the-model-repository)
5+
+ [Prepare the model repository file directories](#prepare-the-model-repository-file-directories)
6+
* [Environment Setup](#environment-setup)
7+
+ [Setup Triton environment](#setup-triton-environment)
8+
+ [Setup python execution environment](#setup-python-execution-environment)
9+
* [Run Triton server](#run-triton-server)
10+
* [Run Triton client](#run-triton-client)
11+
* [Benchmark](#benchmark)
12+
+ [Understanding the benchmark output](#understanding-the-benchmark-output)
13+
+ [HTTP vs. gRPC vs. shared memory](#http-vs-grpc-vs-shared-memory)
14+
+ [Pre/Post-processing on GPU vs. CPU](#pre-post-processing-on-gpu-vs-cpu)
15+
16+
## overview
17+
18+
This example is to implement a 3D medical imaging AI inference pipeline using the model and transforms of MONAI, and deploy the pipeline using Triton. the goal of it is to test the influence brought by different features of MONAI and Triton to medical imaging AI inference performance.
19+
20+
In this repository, I will try following features:
21+
- [Python backend BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting) (Triton), which allows you to execute inference requests on other models being served by Triton as a part of executing your Python model.
22+
- Transforms on GPU(MONAI), by using which, you can compose GPU accelerated pre/post processing chains.
23+
24+
Before starting, I highly recommand you to read the the following two links to get familiar with the basic features of Triton python backend and MONAI:
25+
- https://github.com/triton-inference-server/python_backend
26+
- https://github.com/Project-MONAI/tutorials/blob/master/acceleration/fast_model_training_guide.md
27+
28+
## Prepare the model repository
29+
The full pipeline is as below:
30+
31+
<img src="https://github.com/Project-MONAI/tutorials/raw/master/full_gpu_inference_pipeline/pics/Picture3.png">
32+
33+
### Prepare the model repository file directories
34+
The Triton model repository of the experiment can be fast set up by: 
35+
```bash
36+
git clone https://github.com/Project-MONAI/tutorials.git
37+
cd full_gpu_inference_pipeline
38+
bash download_model_repo.sh
39+
```
40+
The model repository is in folder triton_models. The file structure of the model repository should be:
41+
```
42+
triton_models/
43+
├── segmentation_3d
44+
│   ├── 1
45+
│   │   └── model.pt
46+
│   └── config.pbtxt
47+
└── spleen_seg
48+
├── 1
49+
│   └── model.py
50+
└── config.pbtxt
51+
```
52+
53+
## Environment Setup
54+
### Setup Triton environment
55+
Triton environment can be quickly setup by running a Triton docker container:
56+
```bash
57+
docker run --gpus=1 -it --name='triton_monai' --ipc=host -p18100:8000 -p18101:8001 -p18102:8002 --shm-size=1g -v /yourfolderpath:/triton_monai nvcr.io/nvidia/tritonserver:21.12-py3
58+
```
59+
Please note that when starting the docker container, --ipc=host should be set, so that shared memory can be used to do the data transmission between server and client. Also you should allocate a relatively large shared memory using --shm-size option, because starting from 21.04 release, Python backend uses shared memory to connect user's code to Triton.
60+
### Setup python execution environment
61+
Since we will use MONAI transforms in Triton python backend, we should set up the python execution environment in Triton container by following the instructions in [Triton python backend repository](https://github.com/triton-inference-server/python_backend#using-custom-python-execution-environments). For the installation steps of MONAI, you can refer to [monai install](https://docs.monai.io/en/latest/installation.html "monai install"). Below are the steps used to setup the proper environments for this experiment:
62+
63+
Install the software packages below:
64+
- conda
65+
- cmake
66+
- rapidjson and libarchive ([instructions](https://github.com/triton-inference-server/python_backend#building-from-source "instructions") for installing these packages in Ubuntu or Debian are included in Building from Source Section)
67+
- conda-pack
68+
69+
Create and activate a conda environment.
70+
```bash
71+
conda create -n monai python=3.8
72+
conda activate monai
73+
```
74+
Since Triton 21.12 NGC docker image is used, in which python version is 3.8, we can create a conda env of python3.8 for convenience. You can also specify other python versions. If the python version you use is not equal to that of triton container's, please make sure you go through these extra [steps](https://github.com/triton-inference-server/python_backend#1-building-custom-python-backend-stub "steps")
75+
Before installing the packages in your conda environment, make sure that you have exported PYTHONNOUSERSITE environment variable:
76+
```bash
77+
export PYTHONNOUSERSITE=True
78+
```
79+
If this variable is not exported and similar packages are installed outside your conda environment, your tar file may not contain all the dependencies required for an isolated Python environment.
80+
Install Pytorch with CUDA 11.3 support. 
81+
```bash
82+
pip install torch==1.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
83+
```
84+
Install MONAI and the recommended dependencies.
85+
```bash
86+
BUILD_MONAI=1 pip install --no-build-isolation git+https://github.com/Project-MONAI/MONAI#egg=monai
87+
```
88+
Then we can verify the installation of MONAI and all its dependencies:
89+
```bash
90+
python -c 'import monai; monai.config.print_config()'
91+
```
92+
You'll see the output below, which lists the versions of MONAI and relevant dependencies.
93+
94+
```bash
95+
MONAI version: 0.8.0+65.g4bd13fe
96+
Numpy version: 1.21.4
97+
Pytorch version: 1.10.1+cu113
98+
MONAI flags: HAS_EXT = True, USE_COMPILED = False
99+
MONAI rev id: 4bd13fefbafbd0076063201f0982a2af8b56ff09
100+
MONAI __file__: /usr/local/lib/python3.8/dist-packages/monai/__init__.py
101+
Optional dependencies:
102+
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
103+
Nibabel version: 3.2.1
104+
scikit-image version: 0.19.1
105+
Pillow version: 9.0.0
106+
Tensorboard version: NOT INSTALLED or UNKNOWN VERSION.
107+
gdown version: NOT INSTALLED or UNKNOWN VERSION.
108+
TorchVision version: NOT INSTALLED or UNKNOWN VERSION.
109+
tqdm version: NOT INSTALLED or UNKNOWN VERSION.
110+
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
111+
psutil version: NOT INSTALLED or UNKNOWN VERSION.
112+
pandas version: NOT INSTALLED or UNKNOWN VERSION.
113+
einops version: NOT INSTALLED or UNKNOWN VERSION.
114+
transformers version: NOT INSTALLED or UNKNOWN VERSION.
115+
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
116+
117+
For details about installing the optional dependencies, please visit:
118+
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
119+
```
120+
Install the dependencies of MONAI:
121+
```bash
122+
pip install nibabel scikit-image pillow tensorboard gdown ignite torchvision itk tqdm lmdb psutil cucim pandas einops transformers mlflow matplotlib tensorboardX tifffile cupy
123+
```
124+
Next, we should package the conda environment by using `conda-pack` command, which will produce a package of monai.tar.gz. This file contains all the environments needed by the python backend model and is portable. Then put the created monai.tar.gz under the spleen_seg folder, and the config.pbtxt should be set as:
125+
```bash
126+
parameters: {
127+
key: "EXECUTION_ENV_PATH",
128+
value: {string_value: "$$TRITON_MODEL_DIRECTORY/monai.tar.gz"}
129+
}
130+
```
131+
Also, please note that in the config.pbtxt, the parameter `FORCE_CPU_ONLY_INPUT_TENSORS` is set to `no`, so that Triton will not move input tensors to CPU for the Python model. Instead, Triton will provide the input tensors to the Python model in either CPU or GPU memory, depending on how those tensors were last used.
132+
And now the file structure of the model repository should be:
133+
```
134+
triton_models/
135+
├── segmentation_3d
136+
│   ├── 1
137+
│   │   └── model.pt
138+
│   └── config.pbtxt
139+
└── spleen_seg
140+
├── 1
141+
│   └── model.py
142+
├── config.pbtxt
143+
└── monai.tar.gz
144+
```
145+
## Run Triton server
146+
Then you can start the triton server by the command:
147+
```bash
148+
tritonserver --model-repository=/ROOT_PATH_OF_YOUR_MODEL_REPOSITORY
149+
```
150+
## Run Triton Client
151+
We assume that the server and client are both on the same machine. Open a new bash terminal and run the commands below to setup the client environment.
152+
```bash
153+
nvidia-docker run -it --ipc=host --shm-size=1g --name=triton_client --net=host nvcr.io/nvidia/tritonserver:21.12-py3-sdk
154+
pip install monai
155+
pip install nibabel
156+
pip install jupyter
157+
```
158+
Then you can run the jupyter nootbook in the client folder of this example.
159+
Please note that when starting the docker container, --ipc=host should be set, so that we can use shared memory to do the data transmission between server and client.
160+
161+
## Benchmark
162+
The benchmark was run on RTX 8000 and tested by using perf_analyzer.
163+
```bash
164+
perf_analyzer -m spleen_seg -u localhost:18100 --input-data zero --shape "INPUT0":512,512,114 --shared-memory system
165+
```
166+
167+
### Understanding the benchmark output
168+
- HTTP: `send/recv` indicates the time on the client spent sending the request and receiving the response. `response wait` indicates time waiting for the response from the server.
169+
- GRPC: `(un)marshal request/response` indicates the time spent marshalling the request data into the GRPC protobuf and unmarshalling the response data from the GRPC protobuf. `response wait` indicates time writing the GRPC request to the network, waiting for the response, and reading the GRPC response from the network.
170+
- compute_input : The count and cumulative duration to prepare input tensor data as required by the model framework / backend. For example, this duration should include the time to copy input tensor data to the GPU.
171+
- compute_infer : The count and cumulative duration to execute the model.
172+
- compute_output : The count and cumulative duration to extract output tensor data produced by the model framework / backend. For example, this duration should include the time to copy output tensor data from the GPU.
173+
174+
### HTTP vs. gRPC vs. shared memory
175+
Since 3D medical images are generally big, the overhead brought by protocols cannot be ignored. For most common cases of medical image AI, the clients are on the same machine as the server, so shared memory is an appliable way to reduce the send/receive overhead. In this experiment, perf_analyzer is used to compare different ways of communicating between client and server.
176+
Note that all the processes (pre/post and AI inference) are on GPU.
177+
From the result, we can come to a conclusion that using shared memory will greatly reduce the latency when data transfer is huge.
178+
179+
![](https://github.com/Project-MONAI/tutorials/raw/master/full_gpu_inference_pipeline/pics/Picture2.png)
180+
181+
### Pre/Post-processing on GPU vs. CPU 
182+
After doing pre and post-processing on GPU, we can get a 12x speedup for the full pipeline.
183+
184+
![](https://github.com/Project-MONAI/tutorials/raw/master/full_gpu_inference_pipeline/pics/Picture1.png)

full_gpu_inference_pipeline/client/non_ensemble/client.ipynb

Lines changed: 252 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
wget https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/triton_models.zip
2+
unzip triton_models.zip
3+
rm triton_models.zip
91.4 KB
Loading
104 KB
Loading
49.4 KB
Loading

0 commit comments

Comments
 (0)