Containers & Kubernetes

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

October 8, 2024

Brandon Royal

Senior Product Manager

Robert Bailey

Staff Software Engineer

Try Gemini 3

Our most intelligent model is now available on Vertex AI and Gemini Enterprise

Try now

In the rapidly evolving landscape of AI, efficiently serving AI models is critical to ensure the platform delivers value at optimal performance and cost. But the complexities of optimizing and operating an increasing variety of AI models prevents many organizations from fully realizing AI’s value. We’ve been partnering closely with NVIDIA to bring the power of the NVIDIA AI accelerated computing platform to Google Kubernetes Engine (GKE) to address these complexities. Today, we’re thrilled to announce the availability of NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, on GKE and discoverable via Google Cloud Marketplace, letting you deploy NIM microservices directly from the Google Cloud console.

NVIDIA NIM containerized microservices for accelerated computing optimize deployment for common AI models that can run across various environments, including Kubernetes clusters, with a single command, providing standard APIs for seamless integration into generative AI applications and workflows.

The combination of NVIDIA NIM and GKE unlocks new potential for AI model inference, helping organizations to deliver optimal latency and throughput with the scale and operational efficiency of GKE. And deploying these powerful capabilities on GKE is easier than ever. With NVIDIA NIM microservices available directly in the GKE console, you can deploy the latest NIM-optimized models including meta/llama-3.1-70b-instruct, mistralai/mixtral-8x7b-instruct-v0.1 and nvidia/nv-embedqa-e5-v5 on GKE with just a few clicks. This deployment experience expands upon the previously available helm-based deployment and enables customers to seamlessly deploy the latest NIM models from the NVIDIA API catalog on NVIDIA GPUs orchestrated by GKE, and integrated high-performance storage for model weights.

Writer is transforming work with enterprise-grade AI models optimized for NVIDIA NIM and delivered on GKE:

"Writer is excited to expand our partnership with Google Cloud and NVIDIA to enable us to deliver Writer’s advanced AI models in a highly performant, scalable and efficient way. Together, NVIDIA NIMs and GKE provide outstanding inference performance, making it easy to integrate and scale across different applications. This collaboration improves our deployment abilities and uses advanced technology to ensure top performance and reliability." - Waseem Alshikh, CTO, Writer, Inc.

The ability to deploy NVIDIA NIM microservices directly to GKE marks an important milestone in Google Cloud’s partnership with NVIDIA.

“With NVIDIA NIM microservices integrated as ready to deploy solution in Google Kubernetes Engine, organizations can bring AI to market faster with models optimized for NVIDIA GPUs that can be efficiently scaled and operated with GKE,” said Abhishek Sawarkar, product manager for NVIDIA AI Enterprise. “We're seeing significant latency and throughput improvements on popular GenAI models, which can be deployed and scaled on GKE’s production-ready platform in minutes rather than hours.”

Get started with NVIDIA NIM on GKE

Navigate to the Google Kubernetes Engine in the Google Cloud console and select NVIDIA NIM, then launch it to configure your deployment.

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_7nRtpZB.max-1800x1800.png

In the UI, specify the deployment name, service account information and confirm APIs are enabled. Specify a unique cluster name and GPU type and shape for the cluster. Select your model from the drop-down and click Deploy. The deployment will create a new GKE cluster and deploy the specified NIM.

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_G31cxvY.max-1100x1100.png

After deployment has successfully completed, connect to your NIM endpoint with the following commands, where $CLUSTER is the GKE cluster name, $DEPLOYMENT is the deployment name and $PROJECT is the GCP project in which it was deployed.

Send a test inference to your NIM endpoints with a curl command, specifying the model previously selected (this example shows how to query a llama-3.1-8b-instruct model)

And that’s it! Now you know how to deploy an NVIDIA NIM microservice to GKE from the console, with direct integration into Google Cloud Marketplace, making it easy to enjoy the power of NVIDIA GPUs and software on Google Cloud’s high-performance, reliable containerized infrastructure. Make sure to find NVIDIA NIMs on GKE at Google Cloud Marketplace here. Learn more about the Google Cloud and NVIDIA partnership at cloud.google.com/NVIDIA.

Posted in

Containers & Kubernetes

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

By Peter Schuurman • 4-minute read

Containers & Kubernetes

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes

By Besher Massri • 10-minute read

Containers & Kubernetes

GKE: From containers to agents, the unified platform for every modern workload

By Drew Bradstock • 9-minute read

Containers & Kubernetes

Introducing Agent Sandbox: Strong guardrails for agentic AI on Kubernetes and GKE

By Brandon Royal • 4-minute read

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Brandon Royal

Robert Bailey

Try Gemini 3

Get started with NVIDIA NIM on GKE

Related articles

Accelerate model downloads on GKE with NVIDIA Run:ai Model Streamer

How Google Does It: Building the largest known Kubernetes cluster, with 130,000 nodes

GKE: From containers to agents, the unified platform for every modern workload

Introducing Agent Sandbox: Strong guardrails for agentic AI on Kubernetes and GKE