AI & Machine Learning

How to deploy serverless AI with Gemma 3 on Cloud Run

March 12, 2025

James Ma

Sr Product Manager

Vlad Kolesnikov

Developer Relations Engineer

Join us at Google Cloud Next

April 9-11 in Las Vegas

Today, we introduced Gemma 3, a family of lightweight, open models built with the cutting-edge technology behind Gemini 2.0. The Gemma 3 family of models have been designed for speed and portability, empowering developers to build sophisticated AI applications at scale. Combined with Cloud Run, it has never been easier to deploy your serverless workloads with AI models.

In this post, we’ll explore the functionalities of Gemma 3, and how you can run it on Cloud Run.

Gemma 3: Power and efficiency for Cloud deployments

Gemma 3 is engineered for exceptional performance with lower memory footprints, making it ideal for cost-effective inference workloads.

Built with the world's best single-accelerator model: Gemma 3 delivers optimal performance for its size, outperforming Llama-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena’s leaderboard. This helps you to create engaging user experiences that can fit on a single GPU or TPU.
Create AI with advanced text and visual reasoning capabilities: Easily build applications that analyze images, text and short videos, opening up possibilities for interactive applications.
Handle complex tasks with a large context window: Gemma 3 offers a 128k-token context window to let your applications process and understand vast amounts of information — even entire novels — enabling more sophisticated AI capabilities..

Serverless inference with Gemma 3 and Cloud Run

Gemma 3 is a great fit for inference workloads on Cloud Run using Nvidia L4 GPUs. Cloud Run is Google Cloud's fully managed serverless platform, helping developers leverage container runtimes without having to concern themselves with the underlying infrastructure. Models scale to zero when inactive, and scale dynamically with demand. Not only does this optimize costs and performance, but you only pay for what you use.

For example, you could host an LLM on one Cloud Run service and a chat agent on another, enabling independent scaling and management. And with GPU acceleration, a Cloud Run service can be ready with the first AI inference results in under 30 seconds, with only 5 seconds to start an instance. This rapid deployment ensures that your applications deliver responsive user experiences. We also reduced the GPU price in Cloud Run down to ~$0.6/hr. And of course, if your service isn't receiving requests, it will scale down to zero.

Get started today

Cloud Run and Gemma 3 combine to create a powerful, cost-effective, and scalable solution for deploying advanced AI applications. Gemma 3 is supported by a variety of tools and frameworks, such as Hugging Face Transformers, Ollama, and vLLM.

To get started, visit this guide which will show you how to build a service with Gemma 3 on Cloud Run with Ollama.

Posted in

AI & Machine Learning

Companies achieve stronger results with Customer Engagement Suite, plus new AI-enabled capabilities

By Antony Passemard • 8-minute read

Sustainability

How Google Cloud measures its climate impact through Life Cycle Assessment (LCA)

By Malcolm Hegeman • 3-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/1-Blog_thumbnail.max-700x700.png

AI & Machine Learning

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

By Duncan Campbell • 7-minute read

AI & Machine Learning

Introducing built-in performance monitoring for Vertex AI Model Garden

By Kate Brea • 2-minute read

How to deploy serverless AI with Gemma 3 on Cloud Run

James Ma

Vlad Kolesnikov

Join us at Google Cloud Next

Gemma 3: Power and efficiency for Cloud deployments

Serverless inference with Gemma 3 and Cloud Run

Get started today

Related articles

Companies achieve stronger results with Customer Engagement Suite, plus new AI-enabled capabilities

How Google Cloud measures its climate impact through Life Cycle Assessment (LCA)

Guide: Our top four AI Hypercomputer use cases, reference architectures and tutorials

Introducing built-in performance monitoring for Vertex AI Model Garden