Best practices: AI inference on Cloud Run with GPUs

This page provides best practices for optimizing performance when using a Cloud Run service with a GPU for AI inference, focusing on large language models (LLMs).

You need to build and deploy a Cloud Run service that can respond in real time to scaling events. This means you need to:

  • Use models that load fast and require minimal transformation into GPU-ready structures, and optimize how they are loaded.
  • Use configurations that allow for maximum, efficient, concurrent execution to reduce the number of GPUs needed to serve a target request per second while keeping costs down.

Recommended ways to load large ML models on Cloud Run

Google recommends either storing ML models inside container images or optimize loading them from Cloud Storage.

Storing and loading ML models trade-offs

Here is a comparison of the options:

Model location Deploy time Development experience Container startup time Storage cost
Container image Slow. An image containing a large model will take longer to import into Cloud Run. Changes to the container image will require redeployment, which may be slow for large images. Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance. Potentially multiple copies in Artifact Registry.
Cloud Storage, loaded using Cloud Storage FUSE volume mount Fast. Model downloaded during container startup. Not difficult to set up, does not require changes to the docker image. Fast when network optimizations. Does not parallelize the download. One copy in Cloud Storage.
Cloud Storage, downloaded concurrently using the Google Cloud CLI command gcloud storage cp or the Cloud Storage API as shown in the transfer manager concurrent download code sample. Fast. Model downloaded during container startup. Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API. Fast when network optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount. One copy in Cloud Storage.
Internet Fast. Model downloaded during container startup. Typically simpler (many frameworks download models from central repositories). Typically poor and unpredictable:
  • Frameworks may apply model transformations during initialization. (You should do this at build time).
  • Model host and libraries for downloading the model may not be efficient.
  • There is reliability risk associated with downloading from the internet. Your service could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket.
Depends on the model hosting provider.

Store models in container images

Storing ML models in container images deployed to Cloud Run benefits from Cloud Run's built-in container image streaming optimizations, which maximizes file loading time without additional network optimizations.

Building containers that include ML models can take a while to build. If using Cloud Build, you can configure Cloud Build to use bigger machines for faster builds. To do this, build an image using a build config file that has the following steps:

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', 'IMAGE', '.']
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', 'IMAGE']
images:
- IMAGE
options:
 machineType: 'E2_HIGHCPU_32'
 diskSizeGb: '500'
 

Can create one model copy per image if the layer containing the model is distinct between images (different hash). There could be additional Artifact Registry cost because there could be one copy of the model per image if your model layer is unique across each image.

Store models in Cloud Storage

To optimize ML model loading when loading ML models from Cloud Storage, either using Cloud Storage volume mounts or directly using the Cloud Storage API or command line, you must use Direct VPC with the egress setting value set to all-traffic, along with Private Service Connect.

Load models from the internet

To optimize ML model loading from the internet, route all traffic through the vpc network with the egress setting value set to all-traffic and set up Cloud NAT to reach the public internet at high bandwidth.

Build, deployment, runtime, and system design considerations

The following sections describe considerations for build, deploy, runtime and system design.

At build time

The following list shows considerations you need to take into account when you are planning your build:

  • Choose a good base image. You should start with an image from the Deep Learning Containers or the NVIDIA container registry for the ML framework you're using. These images have the latest performance-related packages installed. We don't recommend creating a custom image.
  • Choose 4-bit quantized models to maximize concurrency unless you can prove they affect result quality. Quantization produces smaller and faster models, reducing the amount of GPU memory needed to serve the model, and can increase parallelism at run time. Ideally, the models should be trained at the target bit depth rather than quantized down to it.
  • Pick a model format with fast load times to minimize container startup time, such as GGUF. These formats more accurately reflect the target quantization type and require less transformations when loaded onto the GPU. For security reasons, don't use pickle-format checkpoints.
  • Create and warm LLM caches at build time. Start the LLM on the build machine while building the docker image. Enable prompt caching and feed common or example prompts to help warm the cache for real-world use. Save the outputs it generates to be loaded at runtime.
  • Save your own inference model that you generate during build time. This saves significant time compared to loading less efficiently stored models and applying transforms like quantization at container startup.

At deployment

  1. Make sure you set service concurrency accurately in Cloud Run.
  2. Adjust your startup probes based on your configuration.

Startup probes determine whether the container has started and is ready to accept traffic. Consider these key points when configuring startup probes:

  • Adequate startup time: Allow sufficient time for your container, including models, to fully initialize and load.
  • Model readiness verification: Configure your probe to pass only when your application is ready to serve requests. Most serving engines automatically achieve this when the model is loaded into GPU memory, preventing premature requests.

Note that Ollama can open a TCP port before a model is loaded. To address this:

  • Preload models: Refer to the Ollama documentation for guidance on preloading your model during startup.

At run time

  • Actively manage your supported context length. The smaller the context window you support, the more queries you can support running in parallel. The details of how to do this depend on the framework.
  • Use the LLM caches you generated at build time. Supply the same flags you used during build time when you generated the prompt and prefix cache.
  • Load from the saved model you just wrote. See Storing and loading models trade-offs for a comparison on how to load the model.
  • Consider using a quantized key-value cache if your framework supports it. This can reduce per-query memory requirements and allows for configuration of more parallelism. However, it can also impact quality.
  • Tune the amount of GPU memory to reserve for model weights, activations and key-value caches. Set it as high as you can without getting an out-of-memory error.
  • Configure your concurrency correctly inside your service code. Make sure your service code is configured to work with your Cloud Run service concurrency settings.
  • Check to see whether your framework has any options for improving container startup performance (for example, using model loading parallelization).

At the system design level

  • Add semantic caches where appropriate. In some cases, caching whole queries and responses can be a great way of limiting the cost of common queries.
  • Control variance in your preambles. Prompt caches are only useful when they contain the prompts in sequence. Caches are effectively prefix-cached. Insertions or edits in the sequence mean that they're either not cached or only partially present.

Autoscaling and GPUs

Cloud Run automatically scales the number of instances of each revision based on factors such as CPU utilization and request concurrency. However, Cloud Run does not automatically scale the number of instances based on GPU utilization.

For a revision with a GPU, if the revision does not have significant CPU usage, Cloud Run scales out for request concurrency. To achieve optimal scaling for request concurrency, you must set an optimal maximum concurrent requests per instance, as described in the next section.

Maximum concurrent requests per instance

The maximum concurrent requests per instance setting controls the maximum number of requests Cloud Run sends to a single instance at once. You must tune concurrency to match the maximum concurrency the code inside each instance can handle with good performance.

Maximum concurrency and AI workloads

When running an AI inference workload on a GPU in each instance, the maximum concurrency that the code can handle with good performance depends on specific framework and implementation details. The following impacts how you set the optimal maximum concurrent requests setting:

  • Number of model instances loaded onto the GPU
  • Number of parallel queries per model
  • Use of batching
  • Specific batch configuration parameters
  • Amount of non-GPU work

If maximum concurrent requests is set too high, requests might end up waiting inside the instance for access to the GPU, which leads to increased latency. If maximum concurrent requests is set too low, the GPU might be underutilized causing Cloud Run to scale out more instances than necessary.

A rule of thumb for configuring maximum concurrent requests for AI workloads is:

(Number of model instances * parallel queries per model) + (number of model instances * ideal batch size)

For example, suppose an instance loads 3 model instances onto the GPU, and each model instance can handle 4 parallel queries. The ideal batch size is also 4 because that is the number of parallel queries each model instance can handle. Using the rule of thumb, you would set maximum concurrent requests 24: (3 * 4) + (3 * 4).

Note that this formula is just a rule of thumb. The ideal maximum concurrent requests setting depends on the specific details of your implementation. To achieve your actual optimal performance, we recommend load testing your service with different maximum concurrent requests settings to evaluate which option performs best.

Throughput versus latency versus cost tradeoffs

See Throughput versus latency versus costs tradeoffs for the impact of maximum concurrent requests on throughput, latency, and cost. Note that all Cloud Run services using GPUs have CPU always allocated.