From d2dbd3cafc603fc3db45318edbb70c617ad3632e Mon Sep 17 00:00:00 2001 From: Benedikt Rollik Date: Wed, 4 Jun 2025 15:55:35 +0200 Subject: [PATCH 1/2] feat(infr): add quantization configuration for custom model deployment --- pages/managed-inference/how-to/create-deployment.mdx | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/pages/managed-inference/how-to/create-deployment.mdx b/pages/managed-inference/how-to/create-deployment.mdx index ad5ed35260..9732ff0aca 100644 --- a/pages/managed-inference/how-to/create-deployment.mdx +++ b/pages/managed-inference/how-to/create-deployment.mdx @@ -27,6 +27,10 @@ dates: Some models may require acceptance of an end-user license agreement. If prompted, review the terms and conditions and accept the license accordingly. - Choose the geographical **region** for the deployment. + - For custom models: Choose the model quantization. + + Each model comes with a default quantization. Select lower bits quantization to improve performance and enable model to run on smaller GPU Nodes, while potentially reducing precision. + - Specify the GPU Instance type to be used with your deployment. 4. Enter a **name** for the deployment, and optional tags. 5. Configure the **network connectivity** settings for the deployment: From aeb6e60f3f4f16e87ae1fac9f574f7a353c53671 Mon Sep 17 00:00:00 2001 From: Benedikt Rollik Date: Thu, 5 Jun 2025 14:57:39 +0200 Subject: [PATCH 2/2] Apply suggestions from code review Co-authored-by: Rowena Jones <36301604+RoRoJ@users.noreply.github.com> --- pages/managed-inference/how-to/create-deployment.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pages/managed-inference/how-to/create-deployment.mdx b/pages/managed-inference/how-to/create-deployment.mdx index 9732ff0aca..ecd53493ad 100644 --- a/pages/managed-inference/how-to/create-deployment.mdx +++ b/pages/managed-inference/how-to/create-deployment.mdx @@ -29,7 +29,7 @@ dates: - Choose the geographical **region** for the deployment. - For custom models: Choose the model quantization. - Each model comes with a default quantization. Select lower bits quantization to improve performance and enable model to run on smaller GPU Nodes, while potentially reducing precision. + Each model comes with a default quantization. Select lower bits quantization to improve performance and enable the model to run on smaller GPU nodes, while potentially reducing precision. - Specify the GPU Instance type to be used with your deployment. 4. Enter a **name** for the deployment, and optional tags.