Skip to main content
Google Cloud
Documentation Technology areas
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
Cross-product tools
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
Related sites
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
/
  • English
  • Deutsch
  • Español – América Latina
  • Français
  • Português – Brasil
  • 中文 – 简体
  • 日本語
  • 한국어
Console Sign in
  • Cloud Run
Guides Reference Samples Resources
Contact Us Start free
Google Cloud
  • Documentation
    • Guides
    • Reference
    • Samples
    • Resources
  • Technology areas
    • More
  • Cross-product tools
    • More
  • Related sites
    • More
  • Console
  • Contact Us
  • Start free
  • Discover
  • Product overview
  • Cloud Run resource model
  • Container runtime contract
  • Use cases
    • Is my app a good fit for a Cloud Run service?
    • When should I deploy a function?
    • AI use cases in Cloud Run
  • Get started
  • Overview
  • Deploy a sample web service
    • Deploy a sample container
    • Create template repository and deploy from a git repository
    • Deploy a Hello World service from source code
      • Go
      • Node.js
      • Python
        • Flask
        • FastAPI
        • Gradio
        • Streamlit
      • Java
      • Kotlin
      • C#
      • C++
      • PHP
      • Ruby
      • Other
      • Frameworks
        • Overview
        • Angular SSR
        • Next.js
        • Nuxt.js
        • SvelteKit
  • Deploy a sample worker pool container
  • Execute a sample job
    • Execute a job
    • Execute a job from source code
      • Go
      • Node.js
      • Python
      • Java
      • Shell
  • Deploy a sample function
    • Deploy a function using the console
    • Deploy a function using gcloud
  • Develop
  • Set up your environment
  • Plan and prepare your service
    • Develop your service
    • Containerize your code
    • Connect to Google Cloud services
    • Install a system package in your container
    • Run gcloud commands within your container
  • Plan and prepare your function
    • Overview
    • Compare Cloud Run functions
    • Write Cloud Run functions
    • Runtimes
      • Overview
      • Node.js
        • Overview
        • Node.js dependencies
      • Python
        • Overview
        • Python dependencies
      • Go
        • Overview
        • Go dependencies
      • Java
        • Overview
        • Java dependencies
      • .NET
      • Ruby
      • PHP
    • Local functions development
    • Function triggers
    • Tutorials
      • Create a function that returns BigQuery results
      • Create a function that returns Spanner results
      • Integrate with Cloud databases
      • Codelabs
  • Build and test
    • Build sources to containers
    • Build functions to containers
    • Local testing
  • Serve HTTP requests
  • Deploy services
    • Deploy container images
    • Continuous deployment from git
    • Deploy from source code
    • Deploy functions
  • Serve web traffic
    • Mapping custom domains
    • Serving static assets with CDN
    • Serving traffic from multiple regions
    • Enable session affinity
    • Frontend proxying using Nginx
  • Manage services
    • View, copy, or delete services
    • View or delete revisions
    • Traffic migration, gradual rollouts, rollbacks
  • Configure services
    • Overview
    • Capacity
      • Memory limits
      • CPU limits
      • GPU
        • GPU configuration
        • GPU performance best practices
      • Request timeout
      • Maximum concurrent requests
        • About maximum concurrent requests per instance
        • Configure maximum concurrent requests
      • Billing
      • Optimize service configurations with Recommender
    • Environment
      • Container port and entrypoint
      • Environment variables
      • Volume mounts
        • Cloud Storage volumes
        • NFS volumes
        • In-memory volumes
      • Execution environment
        • Overview
        • Select an execution environment
      • Container health checks
      • HTTP/2 requests
      • Secrets
      • Service identity
    • Scaling
      • About instance autoscaling for services
      • Maximum instances
        • About maximum instances for services
        • Configure maximum instances
      • Minimum instances
      • Manual scaling
    • Metadata
      • Description
      • Labels
      • Tags
    • Source deploy configurations
      • Supported language runtimes and base images
      • Configure automatic base image updates
      • Build environment variables
      • Build service account
      • Build worker pools
  • Invoke and trigger services
    • Invoke with HTTPS requests
    • Host a webhook target
    • Stream with WebSockets
      • Overview
      • Build a WebSocket Chat service tutorial
    • Invoke asynchronously
      • Invoke services on a schedule
      • Create a workflow
        • Invoke services as part of a Workflow
        • Connect a series of services from Cloud Functions and Cloud Run tutorial
      • Execute asynchronous tasks
      • Call a service from a Pub/Sub push subscription
        • Trigger service from Pub/Sub
        • Integrate image processing into Pub/Sub sample tutorial
    • Trigger from events
      • Create triggers with Eventarc
      • Pub/Sub triggers
        • Create Pub/Sub EventArc triggers
        • Trigger functions from Pub/Sub using Eventarc
        • Trigger functions from routed log entries
      • Cloud Storage triggers
        • Create triggers with Cloud Storage
        • Trigger services from Cloud Storage using Eventarc
        • Trigger functions from Cloud Storage using Eventarc
      • Firestore triggers
        • Create triggers with Firestore
        • Trigger functions from events in a Firestore database
    • Connect with other services using gRPC
  • Best practices
    • General development tips for services
    • Optimize Java services
    • Optimize Python services
    • Optimize Node.js services
    • Load testing best practices
    • Understand zonal redundancy
    • Functions best practices
      • Overview
      • Enable event-driven function retries
  • Execute job tasks to completion
  • Create jobs
  • Execute jobs
    • Execute jobs
    • Execute scheduled jobs
    • Execute scheduled jobs in a VPC SC perimeter
    • Execute jobs from Workflows
  • Configure jobs
    • Container entrypoint
    • CPU limits
    • Memory limits
    • GPU
      • GPU configuration
      • GPU best practices
    • Environment variables
    • Container health checks
    • Volume mounts
      • Cloud Storage volumes
      • NFS volumes
      • In-memory volumes
      • Other network file systems
    • Labels
    • Maximum retries
    • Parallelism
    • Secrets
    • Service identity
    • Task timeout
    • Tags
  • Manage jobs
    • View or delete jobs
    • View or stop job executions
  • Best practices
  • Perform continuous background work
  • Deploy worker pools
    • Deploy worker pools
    • Deploy worker pools from source code
  • Manage worker pools
    • View or delete worker pools
    • View or delete worker pool revisions
  • Configure worker pools
    • Capacity
      • Memory limits
      • CPU limits
      • GPU
        • GPU configuration
        • GPU best practices
    • Environment
      • Container and entrypoint
      • Environment variables
      • Volume mounts
        • Cloud Storage volumes
        • NFS volumes
        • In-memory volumes
        • Other network file systems
      • Container health checks
      • Secrets
      • Service identity
    • Instance count
    • Metadata
      • Description
      • Labels
      • Tags
  • Scale based on external metrics
    • Kafka autoscaler
    • Host GitHub runners with worker pools
  • Configure networking
  • Best practices for Cloud Run networking
  • Configure private networking
  • Send traffic to VPC network
    • Overview
    • Direct VPC egress
    • Dual-stack services and jobs
    • Migrate standard VPC connector to Direct VPC egress
    • VPC connectors
  • Send traffic to Shared VPC network
    • Overview
    • Direct VPC egress
    • Migrate Shared VPC connector to Direct VPC egress
    • Connectors in service projects
    • Connectors in host project
  • Static outbound IP address
  • Network security
    • Restrict ingress (services)
    • Use VPC Service Controls (VPC SC)
  • Cloud Service Mesh
  • Secure
  • Security design overview
  • Authenticate requests
    • Overview
    • Allow public access
    • Custom audiences
    • Authenticate developers
    • Service-to-service
    • Authenticate users
    • End user authentication tutorial
  • Secure your resources
    • Access control with IAM
    • Configure IAP for Cloud Run
    • Introduction to service identity
    • Protect services with Cloud Armor
    • Use Binary Authorization
    • Use Cloud Run Threat Detection
    • Use customer managed encryption keys
    • Manage custom constraints for projects
    • View software supply chain security insights
    • Secure Cloud Run services tutorial
  • Monitor and log
  • Monitoring and logging overview
  • View built-in metrics
  • Write Prometheus metrics
  • Write OpenTelemetry metrics
  • Log and view logs
  • Audit logging
  • Error reporting
  • Use distributed tracing for services
  • Run AI solutions
  • Overview
  • Explore resources
  • AI agents
    • Overview
    • Build and deploy A2A agents
      • Host A2A agents
      • Deploy an A2A agent
      • Test and monitor A2A agent deployment
    • Build and deploy ADK agents
    • Build and deploy n8n agents
  • MCP servers
    • Overview
    • Build and deploy a remote MCP server
  • Tools
    • Code execution
    • Browser automation
  • Inference with GPUs
    • Services
      • Run LLM inference on Cloud Run GPUs with Ollama
      • Run Gemma 3 models on Cloud Run
      • Run LLM inference on Cloud Run GPUs with vLLM
      • Run OpenCV on Cloud Run with GPU acceleration
      • Run LLM inference on Cloud Run GPUs with Hugging Face Transformers.js
      • Run LLM inference on Cloud Run GPUs with Hugging Face TGI
    • Jobs
      • Fine tune LLMs using GPUs with Cloud Run jobs
      • Run batch inference using GPUs with Cloud Run jobs
      • GPU-accelerated video transcoding with FFmpeg
  • AI-assisted development and vibe coding
    • Introduction to Cloud Run for AI-assisted developers
  • Migrate
  • An existing web service
  • From App Engine
  • From Cloud Run functions (1st gen)
  • From AWS Lambda
  • From Heroku
  • From Cloud Foundry
    • Migration overview
    • Choose an OCI-compliant-strategy
    • Migrate to OCI containers
    • Migrate configuration
    • Sample migration: Spring Music
  • From VMWare Tanzu
  • From a VM using Migrate to Containers
  • From Kubernetes
  • To GKE
  • Troubleshoot
  • Introduction
  • Troubleshoot errors
  • Local troubleshooting tutorial
  • Known issues
  • Samples
  • All Cloud Run code samples
  • All Cloud Run functions code samples
  • Code samples for all products
  • AI and ML
  • Application development
  • Application hosting
  • Compute
  • Data analytics and pipelines
  • Databases
  • Distributed, hybrid, and multicloud
  • Generative AI
  • Industry solutions
  • Networking
  • Observability and monitoring
  • Security
  • Storage
  • Access and resources management
  • Costs and usage management
  • Google Cloud SDK, languages, frameworks, and tools
  • Infrastructure as code
  • Migration
  • Google Cloud Home
  • Free Trial and Free Tier
  • Architecture Center
  • Blog
  • Contact Sales
  • Google Cloud Developer Center
  • Google Developer Center
  • Google Cloud Marketplace
  • Google Cloud Marketplace Documentation
  • Google Cloud Skills Boost
  • Google Cloud Solution Center
  • Google Cloud Support
  • Google Cloud Tech Youtube Channel
  • Home
  • Documentation
  • Application hosting
  • Cloud Run
  • Guides

Run LLM inference on Cloud Run GPUs with vLLM

The following codelab shows how to run a backend service that runs vLLM, which is an inference engine for production systems, along with Google's Gemma 2, which is a 2 billion parameters instruction-tuned model.

See the entire codelab at Run LLM inference on Cloud Run GPUs with vLLM.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-10-25 UTC.

  • Why Google

    • Choosing Google Cloud
    • Trust and security
    • Modern Infrastructure Cloud
    • Multicloud
    • Global infrastructure
    • Customers and case studies
    • Analyst reports
    • Whitepapers
  • Products and pricing

    • See all products
    • See all solutions
    • Google Cloud for Startups
    • Google Cloud Marketplace
    • Google Cloud pricing
    • Contact sales
  • Support

    • Community forums
    • Support
    • Release Notes
    • System status
  • Resources

    • GitHub
    • Getting Started with Google Cloud
    • Google Cloud documentation
    • Code samples
    • Cloud Architecture Center
    • Training and Certification
    • Developer Center
  • Engage

    • Blog
    • Events
    • X (Twitter)
    • Google Cloud on YouTube
    • Google Cloud Tech on YouTube
    • Become a Partner
    • Google Cloud Affiliate Program
    • Press Corner
  • About Google
  • Privacy
  • Site terms
  • Google Cloud terms
  • Manage cookies
  • Our third decade of climate action: join us
  • Sign up for the Google Cloud newsletter Subscribe
  • English
  • Deutsch
  • Español – América Latina
  • Français
  • Português – Brasil
  • 中文 – 简体
  • 日本語
  • 한국어