How to deploy serverless AI with Gemma 3 on Cloud Run
James Ma
Sr Product Manager
Vlad Kolesnikov
Developer Relations Engineer
Today, we introduced Gemma 3, a family of lightweight, open models built with the cutting-edge technology behind Gemini 2.0. The Gemma 3 family of models have been designed for speed and portability, empowering developers to build sophisticated AI applications at scale. Combined with Cloud Run, it has never been easier to deploy your serverless workloads with AI models.
In this post, we’ll explore the functionalities of Gemma 3, and how you can run it on Cloud Run.
Gemma 3: Power and efficiency for Cloud deployments
Gemma 3 is engineered for exceptional performance with lower memory footprints, making it ideal for cost-effective inference workloads.
-
Built with the world's best single-accelerator model: Gemma 3 delivers optimal performance for its size, outperforming Llama-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena’s leaderboard. This helps you to create engaging user experiences that can fit on a single GPU or TPU.
-
Create AI with advanced text and visual reasoning capabilities: Easily build applications that analyze images, text and short videos, opening up possibilities for interactive applications.
-
Handle complex tasks with a large context window: Gemma 3 offers a 128k-token context window to let your applications process and understand vast amounts of information — even entire novels — enabling more sophisticated AI capabilities..
Serverless inference with Gemma 3 and Cloud Run
Gemma 3 is a great fit for inference workloads on Cloud Run using Nvidia L4 GPUs. Cloud Run is Google Cloud's fully managed serverless platform, helping developers leverage container runtimes without having to concern themselves with the underlying infrastructure. Models scale to zero when inactive, and scale dynamically with demand. Not only does this optimize costs and performance, but you only pay for what you use.
For example, you could host an LLM on one Cloud Run service and a chat agent on another, enabling independent scaling and management. And with GPU acceleration, a Cloud Run service can be ready with the first AI inference results in under 30 seconds, with only 5 seconds to start an instance. This rapid deployment ensures that your applications deliver responsive user experiences. We also reduced the GPU price in Cloud Run down to ~$0.6/hr. And of course, if your service isn't receiving requests, it will scale down to zero.
Get started today
Cloud Run and Gemma 3 combine to create a powerful, cost-effective, and scalable solution for deploying advanced AI applications. Gemma 3 is supported by a variety of tools and frameworks, such as Hugging Face Transformers, Ollama, and vLLM.
To get started, visit this guide which will show you how to build a service with Gemma 3 on Cloud Run with Ollama.