Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Google Kubernetes Engine (GKE) provides a single, unified platform to orchestrate your entire AI/ML lifecycle. It gives you the power and flexibility to supercharge your training, inference, and agentic workloads, so you can streamline your infrastructure and start delivering results. GKE's state-of-the-art orchestration capabilities provide the following:

Hardware accelerators: access and manage the high-powered GPUs and TPUs you need, for both training and inference, at scale.
Stack flexibility: integrate with the distributed computing, data processing, and model serving frameworks you already know and trust.
Managed Kubernetes simplicity: get all the benefits of a managed platform to automate, scale, and enhance the security of your entire AI/ML lifecycle while maintaining flexibility.

Explore our blogs, tutorials, and best practices to see how GKE can optimize your AI/ML workloads. For more information about benefits and available features, see the Introduction to AI/ML workloads on GKE overview.

Get started for free

Start your proof of concept with $300 in free credit

Get access to Gemini 2.0 Flash Thinking
Free monthly usage of popular products, including AI APIs and BigQuery
No automatic charges, no commitment

View free product offers

Keep exploring with 20+ always-free products

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Documentation resources

Find quickstarts and guides, review key references, and get help with common issues.

Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Start your proof of concept with $300 in free credit

Keep exploring with 20+ always-free products

Manage AI infrastructure and accelerators

Train AI models at scale

Serve AI models for Inference

Deploy an agentic AI application on GKE with the Agent Development Kit (ADK) and a self-hosted LLM

Deploy an agentic AI application on GKE with the Agent Development Kit (ADK) and Vertex AI

Serve open source models using TPUs on GKE with Optimum TPU

Create and use a volume backed by a Parallelstore instance in GKE

Serve LLMs on GKE with a cost-optimized and high-availability GPU provisioning strategy

Serving Large Language Models with KubeRay on TPUs

Accelerate AI/ML data loading with Hyperdisk ML

Serve an LLM using TPUs on GKE with JetStream and PyTorch

Best practices for optimizing LLM inference with GPUs on GKE

Manage the GPU Stack with the NVIDIA GPU Operator on GKE

Configure autoscaling for LLM workloads on TPUs

Fine-tune Gemma open models using multiple GPUs on GKE

Deploy a Ray Serve application with a Stable Diffusion model on GKE with TPUs

Configure autoscaling for LLM workloads on GPUs with GKE

Train Llama2 with Megatron-LM on A3 Mega virtual machines

Deploy GPU workloads in Autopilot

Serve an LLM with multiple GPUs in GKE

Getting started with Ray on GKE

Serve an LLM on L4 GPUs with Ray

Orchestrate TPU Multislice workloads using JobSet and Kueue

Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)

Quickstart: Train a model with GPUs on GKE Standard clusters

Running large-scale machine learning on GKE

TensorFlow on GKE Autopilot with GPU acceleration

Implement a Job queuing system with quota sharing between namespaces on GKE

Build a RAG chatbot with GKE and Cloud Storage

Analyze data on GKE using BigQuery, Cloud Run, and Gemma

Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Data loading best practices for AI/ML inference on GKE

Save on GPUs: Smarter autoscaling for your GKE inferencing workloads

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Accelerate Ray in production with new Ray Operator on GKE

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

Best practices for running batch workloads on GKE

High performance AI/ML storage through Local SSD support on GKE

Machine learning with JAX on Kubernetes with NVIDIA GPUs

Search engines made simple: A low-code approach with GKE and Vertex AI Agent Builder

LiveX AI reduces customer support costs with AI agents trained and served on GKE and NVIDIA AI

Infrastructure for a RAG-capable generative AI application using GKE and Cloud SQL

Reference architecture for a batch processing platform on GKE

Innovating in patent search: How IPRally leverages AI with GKE and Ray

Performance deep dive of Gemma on Google Cloud

Gemma on GKE deep dive: New innovations to serve open generative AI models

Advanced scheduling for AI/ML with Ray and Kueue

How to secure Ray on Google Kubernetes Engine

Design storage for AI and ML workloads in Google Cloud

Automatic driver installation simplifies using NVIDIA GPUs in GKE

Accelerate your generative AI journey with NVIDIA NeMo framework on GKEE

Why GKE for your Ray AI workloads?

Simplifying MLOps using Weights & Biases with Google Kubernetes Engine

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

How SEEN scaled output 89x and reduced GPU costs by 66% using GKE

How Spotify is unleashing ML Innovation with Ray and GKE

How Ordaōs Bio takes advantage of generative AI on GKE

GKE from a growing startup powered by ML

Improving launch time of Stable Diffusion on GKE by 4x

Google Kubernetes Engine (GKE) Samples

GKE AI Labs Samples

GKE Accelerated Platforms

Related videos