How can I optimize for an inference application?

Inference optimization is the practice of improving the performance and efficiency of running AI models in production. As large language models (LLMs) grow to tens or hundreds of billions of parameters, and inference architectures become more complex, the difficulty of designing and maintaining applications only grows. Optimization is the act of managing, monitoring and updating these compute-intensive workloads, enabling sub-second response times and higher throughput at a lower cost.

It involves a set of techniques—ranging from model compression to advanced memory management—that shift the focus from simply "running a model" to "scaling an intelligence service." This allows developers to build more responsive applications while maintaining a sustainable infrastructure footprint.

BLOG

Scaling high-performance inference cost-effectively

In practice, inference optimization is generally applied in two main ways:

Infrastructure-level optimization: This focuses on how the model is executed on the hardware. It includes using optimized runtimes (like NVIDIA NIM or vLLM), managing GPU memory with techniques such as PagedAttention, and using in-flight batching to process multiple requests simultaneously. This is often the most practical path for developers using open-source or proprietary models.

Model-level optimization: This involves modifying the model itself to reduce its size or complexity. Techniques like quantization (reducing precision from 16-bit to 4-bit), distillation (training a smaller "student" model to mimic a larger "teacher"), and sparsity (pruning unimportant parameters) can drastically reduce the memory and compute required for each token.

Understanding how the inference process works

The code-level workflow

To optimize effectively, you must understand the two distinct phases of LLM inference:

Phase	Description	Key characteristic
Prefill	The model processes the entire input prompt to compute intermediate states.	Highly parallelized; compute-bound (saturates GPU).
Decode	The model generates output tokens one by one, autoregressively.	Sequential; memory-bound (limited by data transfer speed).

Phase

Description

Key characteristic

Prefill

The model processes the entire input prompt to compute intermediate states.

Highly parallelized; compute-bound (saturates GPU).

Decode

The model generates output tokens one by one, autoregressively.

Sequential; memory-bound (limited by data transfer speed).

Describe the goal: You start with an unoptimized model deployment
Apply quantization: Reduce model weights (such as, to 4-bit) to fit larger batches in memory
Optimize attention: Use FlashAttention or Grouped-Query Attention (GQA) to minimize memory movement costs
Manage memory: Implement PagedAttention to store KV caches in non-contiguous blocks, eliminating fragmentation
Execute and monitor: Deploy with in-flight batching to immediately start new requests as others finish

Inference optimization versus standard deployment

Here’s how optimized inference compares to traditional "naive" model serving:

Feature	Standard deployment	Optimized inference
Throughput	Limited by static batch sizes and idle time.	High; utilizes in-flight batching and continuous iteration.
Latency	Linear growth with sequence length; high TTFT (Time to First Token).	Optimized; utilizes prefill acceleration and speculative decoding.
Memory management	Static allocation (over-provisioning for max length).	Dynamic (paging); minimal wastage through PagedAttention.
Hardware efficiency	Often underutilizes GPU/TPU compute capabilities.	Maximized; uses optimized kernels (TFE-IE, XLA).
Cost per request	Higher; requires more hardware for the same load.	Lower; packs more requests into the same infrastructure.

Feature

Standard deployment

Optimized inference

Throughput

Limited by static batch sizes and idle time.

High; utilizes in-flight batching and continuous iteration.

Latency

Linear growth with sequence length; high TTFT (Time to First Token).

Optimized; utilizes prefill acceleration and speculative decoding.

Memory management

Static allocation (over-provisioning for max length).

Dynamic (paging); minimal wastage through PagedAttention.

Hardware efficiency

Often underutilizes GPU/TPU compute capabilities.

Maximized; uses optimized kernels (TFE-IE, XLA).

Cost per request

Higher; requires more hardware for the same load.

Lower; packs more requests into the same infrastructure.

Getting started: Choosing your orchestration environment

Google Cloud offers a range of tools designed for different skill levels and architectural needs.

Tool	Starting point	Skill level	Approach	Key feature
Cloud Run (with GPUs)	A lightweight, event-driven AI service	Beginner	Serverless	Scale-to-zero inference for bursty, low-latency workloads
Gemini Enterprise Agent Platform Model Garden	An OSS model (such as, Llama 3)	Beginner to intermediate	Managed / low-code	One-click deployment with optimized vLLM or NVIDIA NIM runtimes
NVIDIA NIM on GCP	High-performance, production workloads	Intermediate to advanced	Accelerated inference	Pre-built microservices with state-of-the-art TensorRT-LLM optimizations
Google Kubernetes Engine (GKE)	A custom, multi-model infrastructure	Advanced	Cloud-Native / custom	Full control over GPU sharding, orchestration, and custom inference servers
Hex-LLM on TPUs	Large-scale TPU-first development	Advanced	TPU-Optimized / XLA	Tailored for XLA with continuous batching and PagedAttention on Cloud TPU

Tool

Starting point

Skill level

Approach

Key feature

Cloud Run (with GPUs)

A lightweight, event-driven AI service

Beginner

Serverless

Scale-to-zero inference for bursty, low-latency workloads

Gemini Enterprise Agent Platform Model Garden

An OSS model (such as, Llama 3)

Beginner to intermediate

Managed / low-code

One-click deployment with optimized vLLM or NVIDIA NIM runtimes

NVIDIA NIM on GCP

High-performance, production workloads

Intermediate to advanced

Accelerated inference

Pre-built microservices with state-of-the-art TensorRT-LLM optimizations

Google Kubernetes Engine (GKE)

A custom, multi-model infrastructure

Advanced

Cloud-Native / custom

Full control over GPU sharding, orchestration, and custom inference servers

Hex-LLM on TPUs

Large-scale TPU-first development

Advanced

TPU-Optimized / XLA

Tailored for XLA with continuous batching and PagedAttention on Cloud TPU

How to optimize inference with Gemini Enterprise Agent Platform Model Garden

Model Garden is the fastest path to deploying optimized versions of leading open models like Llama, Gemma, and Mistral.

Step 1: Select and configure your model

Go to Model Garden and find a supported OSS model. Click Deploy. In the configuration, select an Optimized Runtime such as vLLM or NVIDIA NIM.

Step 2: Apply quantization

Choose a quantized version of the model (for example, 4-bit or 8-bit) to reduce its memory footprint. This allows you to serve larger batch sizes on the same GPU, directly increasing throughput.

Step 3: Enable advanced memory management

Ensure the serving container is configured to use PagedAttention. This technique allows the model to store its "memory" (Key-Value cache) in non-contiguous blocks, preventing memory wastage and allowing for longer context windows.

Step 4: Deploy and monitor

Once deployed, Gemini Enterprise Agent Platform automatically handles in-flight batching, processing new requests as soon as an existing request completes a token. Use Model Monitoring on Gemini Enterprise Agent Platform to track latency and ensure the "vibe" of the output remains high quality.

How to optimize inference with GKE

For teams needing granular control over their orchestration and custom inference kernels, GKE is the industry-standard choice.

Step 1: Initialize your cluster with NVIDIA GPUs or Cloud TPUs

Provision a GKE cluster with specialized GPU nodes (such as, L4 or H100). Install the NVIDIA GPU Operator to handle driver management and performance tuning automatically.

Step 2: Deploy an optimized inference server

Use a containerized inference engine like vLLM or Triton Inference Server. These servers support continuous batching and tensor parallelism, allowing you to shard large models across multiple GPUs. vLLM also gives you the ability to switch between TPUs and GPUs, with minimal additional coding.

Step 3: Implement speculative decoding

For mission-critical latency needs, configure speculative decoding. This involves using a smaller, faster "draft" model to predict tokens, which are then verified in parallel by your larger "target" model, often providing a 2x-3x speedup.

Step 4: Streamline deployment with GKE Inference Quickstart

GKE Inference Quickstart acts as a pre-configured database of tested inference stack configurations. By specifying your model, latency requirements, and cost priorities, the tool provides a set of recommendations based on best practices and the latest benchmarks. This allows you to monitor inference-specific performance metrics and dynamically fine-tune your deployment to ensure it always runs on optimized technology.

Step 5: Scale with GKE Inference Gateway

GKE Inference Gateway is now generally available, introducing two advanced capabilities for managing complex GenAI applications.

Prefix-aware routing: For applications like multi-turn chat or document analysis, this feature routes requests to the same accelerators that already have the context cached, improving response times.
Disaggregated serving: This technique separates the initial "prefill" phase (prompt processing) from the "decode" phase (token generation). Because these stages have different resource needs, you can run them on separate, optimized machine pools to maximize efficiency.

Step 6: Accelerate data access with Anywhere cache

Anywhere cache is a new, fully consistent read cache that works with existing Google Cloud Storage (GCS) buckets to cache data within the same zone as your accelerators. This reduces read latency by up to 96% and minimizes the network costs associated with read-heavy workloads.

Step 7: Connect global workloads with Cloud WAN

Tying the entire infrastructure together is Cloud WAN, a fully managed global network built on Google's planet-scale infrastructure. Cloud WAN connects AI computing resources across different regions, clouds, and on-premises environments, delivering a 40% improvement in inference application experience and 40% lower TCO compared to traditional WAN solutions.

Solve your business challenges with Google Cloud

New customers get $300 in free credits to spend on Google Cloud.

Take the next step

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Need help getting started?
Contact sales
Work with a trusted partner
Find a partner
Continue browsing
See all products