Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. With Google Kubernetes Engine (GKE), you can implement a robust, production-ready AI/ML platform with all the benefits of managed Kubernetes and these capabilities:

Infrastructure orchestration that supports GPUs and TPUs for training and serving workloads at scale.
Flexible integration with distributed computing and data processing frameworks.
Support for multiple teams on the same infrastructure to maximize utilization of resources

This page provides an overview of the AI/ML capabilities of GKE and how to get started running optimized AI/ML workloads on GKE with GPUs, TPUs, and frameworks like Hugging Face TGI, vLLM, and JetStream.

Get started for free

Start your proof of concept with $300 in free credit

Get access to Gemini 2.0 Flash Thinking
Free monthly usage of popular products, including AI APIs and BigQuery
No automatic charges, no commitment

View free product offers

Keep exploring with 20+ always-free products

Access 20+ free products for common use cases, including AI APIs, VMs, data warehouses, and more.

Documentation resources

Find quickstarts and guides, review key references, and get help with common issues.

Serve open models using GKE Gen AI capabilities

New!
About model inference on GKE
New!
Run best practice inference with GKE Inference Quickstart recipes
New!
Serve LLMs like Deepseek-R1 671B or Llama 3.1 405B on GKE
Tutorial
Serve Gemma using GPUs on GKE with vLLM
Tutorial
Serve an LLM using TPU Trillium on GKE with vLLM
Tutorial
Discover more tutorials for model inference on GKE

Orchestrate TPUs and GPUs at large scale

Cost optimization and job orchestration

New!
Reference architecture for a batch processing platform on GKE
Best practice
Optimize GPU obtainability with flex-start provisioning mode
Blog
High performance AI/ML storage through Local SSD support on GKE
Blog
Simplifying MLOps using Weights & Biases with Google Kubernetes Engine
Best practice
Best practices for running batch workloads on GKE
Best practice
Run cost-optimized Kubernetes applications on GKE
Best practice
Improving launch time of Stable Diffusion on GKE by 4x

Get started with AI model inference using GKE Gen AI capabilities!

AI/ML orchestration on GKE documentation

Start your proof of concept with $300 in free credit

Keep exploring with 20+ always-free products

Serve open models using GKE Gen AI capabilities

Orchestrate TPUs and GPUs at large scale

Cost optimization and job orchestration

Serve open source models using TPUs on GKE with Optimum TPU

Create and use a volume backed by a Parallelstore instance in GKE

Accelerate AI/ML data loading with Hyperdisk ML

Serve an LLM using TPUs on GKE with JetStream and PyTorch

Best practices for optimizing LLM inference with GPUs on GKE

Manage the GPU Stack with the NVIDIA GPU Operator on GKE

Configure autoscaling for LLM workloads on TPUs

Fine-tune Gemma open models using multiple GPUs on GKE

Deploy a Ray Serve application with a Stable Diffusion model on GKE with TPUs

Configure autoscaling for LLM workloads on GPUs with GKE

Train Llama2 with Megatron-LM on A3 Mega virtual machines

Deploy GPU workloads in Autopilot

Serve a LLM with multiple GPUs in GKE

Getting started with Ray on GKE

Serve an LLM on L4 GPUs with Ray

Orchestrate TPU Multislice workloads using JobSet and Kueue

Monitoring GPU workloads on GKE with NVIDIA Data Center GPU Manager (DCGM)

Quickstart: Train a model with GPUs on GKE Standard clusters

Running large-scale machine learning on GKE

TensorFlow on GKE Autopilot with GPU acceleration

Implement a Job queuing system with quota sharing between namespaces on GKE

Build a RAG chatbot with GKE and Cloud Storage

Analyze data on GKE using BigQuery, Cloud Run, and Gemma

Distributed data preprocessing with GKE and Ray: Scaling for the enterprise

Data loading best practices for AI/ML inference on GKE

Save on GPUs: Smarter autoscaling for your GKE inferencing workloads

Efficiently serve optimized AI models with NVIDIA NIM microservices on GKE

Accelerate Ray in production with new Ray Operator on GKE

Maximize your LLM serving throughput for GPUs on GKE — a practical guide

Search engines made simple: A low-code approach with GKE and Vertex AI Agent Builder

LiveX AI reduces customer support costs with AI agents trained and served on GKE and NVIDIA AI

Infrastructure for a RAG-capable generative AI application using GKE and Cloud SQL

Innovating in patent search: How IPRally leverages AI with GKE and Ray

Performance deep dive of Gemma on Google Cloud

Gemma on GKE deep dive: New innovations to serve open generative AI models

Advanced scheduling for AI/ML with Ray and Kueue

How to secure Ray on Google Kubernetes Engine

Design storage for AI and ML workloads in Google Cloud

Automatic driver installation simplifies using NVIDIA GPUs in GKE

Accelerate your generative AI journey with NVIDIA NeMo framework on GKEE

Why GKE for your Ray AI workloads?

Running AI on fully managed GKE, now with new compute options, pricing and resource reservations

How SEEN scaled output 89x and reduced GPU costs by 66% using GKE

How Spotify is unleashing ML Innovation with Ray and GKE

How Ordaōs Bio takes advantage of generative AI on GKE

GKE from a growing startup powered by ML

Google Kubernetes Engine (GKE) Samples

GKE AI Labs Samples

Related videos