Accelerate model loading on GKE with Run:ai Model Streamer and vLLM

This document shows you how to accelerate loading the large AI model weights from Cloud Storage by using Run:ai Model Streamer with the vLLM inference server on Google Kubernetes Engine (GKE).

The solution in this document assumes that you have your AI model and weights in safetensors format already loaded in a Cloud Storage bucket.

By adding the --load-format=runai_streamer flag to your vLLM deployment, you can use the Run:ai Model Streamer to improve model download efficiency for your AI workloads on GKE.

This document is intended for the following users:

  • Machine learning (ML) engineers who need to load large AI models from object storage into GPU/TPU nodes as quickly as possible.
  • Platform admins and operators who automate and optimize model serving infrastructure on GKE.
  • Cloud architects who evaluate specialized data loading tools for AI/ML workloads.

To learn more about common roles and example tasks referenced in Google Cloud content, see Common GKE user roles and tasks.

Overview

The solution described in this document uses three core components—Run:ai Model Streamer, vLLM, and the safetensors file format—to accelerate the process of loading model weights from Cloud Storage onto GPU nodes.

Run:ai Model Streamer

Run:ai Model Streamer is an open-source Python SDK that speeds up loading large AI models onto GPUs. It streams model weights directly from storage, like Cloud Storage buckets, to your GPU's memory. The model streamer is particularly suitable for accessing safetensors files located in Cloud Storage.

safetensors

safetensors is a file format to store tensors, the core data structures in AI models, in a way that enhances both security and speed. safetensors is designed as an alternative to Python's pickle format, and it enables fast loading times through a zero-copy approach. This approach lets the tensors be directly accessible from the source without needing to load the entire file into the local memory first.

vLLM

vLLM is an open-source library for LLM inference and serving. It's a high-performance inference server that's optimized for rapidly loading large AI models. In this document, vLLM is the core engine that runs your AI model on GKE and handles incoming inference requests. Run:ai Model Streamer's built-in authentication support for Cloud Storage requires vLLM version 0.11.1 or later.

How Run:ai Model Streamer accelerates model loading

When you start an LLM-based AI application for inference, a significant delay often occurs before the model is ready to use. This delay, known as a cold start, happens because the entire multi-gigabyte model file must be downloaded from a storage location, like a Cloud Storage bucket, to your machine's local disk. The file is then loaded into your GPU's memory. During this loading period, the expensive GPU is sitting idle, which is inefficient and costly.

Instead of the download-then-load process, the model streamer streams the model directly from your Cloud Storage to the GPU's memory. The streamer uses a high-performance backend to read multiple parts of the model, called tensors, in parallel. Reading tensors concurrently is significantly faster than loading the file sequentially.

Architecture overview

The Run:ai Model Streamer integrates with vLLM on GKE to accelerate model loading by streaming model weights directly from Cloud Storage to GPU memory, bypassing local disk.

The following diagram shows this architecture:

Architecture of Run:ai Model Streamer loading model weights from Cloud Storage to vLLM on GKE.
Run:ai Model Streamer architecture with vLLM and Cloud Storage.

This architecture includes the following components and workflow:

  • Cloud Storage bucket: stores the AI model weights in safetensors format.
  • GKE Pod with GPU: runs the vLLM inference server.
  • vLLM inference server: configured with the --load-format=runai_streamer flag, which enables the model streamer functionality.
  • Run:ai Model Streamer: when vLLM starts, the model streamer reads model weights from the specified gs:// path in the Cloud Storage bucket. Instead of downloading files to disk, it streams tensor data directly into the GPU memory of the GKE Pod, where it's immediately available to vLLM for inference.
  • Cloud Storage Anywhere Cache (Optional): if enabled, Anywhere Cache caches bucket data in the same zone as the GKE nodes, further accelerating data access for the streamer.

Benefits

  • Reduced cold start times: the model streamer significantly reduces the time it takes for models to start. It loads the model weights up to six times faster compared to conventional methods. For more information, see Run:ai Model Streamer benchmarks
  • Enhanced GPU utilization: by minimizing model loading delays, GPUs can dedicate more time to actual inference tasks, boosting overall efficiency and processing capacity.
  • Streamlined workflow: the solution described in this document integrates with GKE, enabling inference servers like vLLM or SGLang to directly access models in Cloud Storage buckets.

Before you begin

Make sure to complete the following prerequisites.

Select or create a project and enable the APIs

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project.

  • Enable the Kubernetes Engine, Cloud Storage, Compute Engine, IAM APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Roles required to select or create a project

    • Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
    • Create a project: To create a project, you need the Project Creator role (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

    Go to project selector

  • Verify that billing is enabled for your Google Cloud project.

  • Enable the Kubernetes Engine, Cloud Storage, Compute Engine, IAM APIs.

    Roles required to enable APIs

    To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

    Enable the APIs

  • Set up Cloud Shell

    This document uses the Google Cloud CLI and kubectl commands to create and manage the resources required for this solution. You can run these commands in Cloud Shell by clicking Activate Cloud Shell at the top of the Google Cloud console.

    In the Google Cloud console, activate Cloud Shell.

    Activate Cloud Shell

    Alternatively, you can install and initialize the gcloud CLI in your local shell environment to run the commands. If you want to use a local shell terminal, run the gcloud auth login command to authenticate with Google Cloud.

    Grant IAM roles

    Ensure that your Google Cloud account has the following IAM roles on your project so you can create a GKE cluster and manage Cloud Storage:

    • roles/container.admin
    • roles/storage.admin

    To grant these roles, run the following commands:

    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="user:$(gcloud config get-value account)" \
        --role="roles/container.admin"
    gcloud projects add-iam-policy-binding PROJECT_ID \
        --member="user:$(gcloud config get-value account)" \
        --role="roles/storage.admin"
    

    Replace PROJECT_ID with your project ID.

    Prepare your environment

    This section guides you through setting up your GKE cluster and configuring permissions to access your model in Cloud Storage.

    Create a GKE cluster with GPUs

    The Run:ai Model Streamer can be used with both GKE Autopilot and Standard clusters. Choose the cluster mode that best suits your needs.

    1. Set variables for your project and cluster name:

      export PROJECT_ID=PROJECT_ID
      export CLUSTER_NAME=CLUSTER_NAME
      

      Replace the following:

      • PROJECT_ID: your Google Cloud project ID. You can find your project ID by running the gcloud config get-value project command.
      • CLUSTER_NAME: The name of your cluster. For example, run-ai-test.
    2. Create either an Autopilot or a Standard cluster:

      Autopilot

      Follow these steps to create a GKE Autopilot cluster:

      1. Set the region for your cluster:

        export REGION=REGION
        

        Replace REGION with the region where you want to create your cluster. For optimal performance, use the same region as your Cloud Storage bucket.

      2. Create the cluster:

        gcloud container clusters create-auto $CLUSTER_NAME \
            --project=$PROJECT_ID \
            --location=$REGION
        

      Autopilot clusters automatically provision nodes based on workload requirements. When you deploy the vLLM server in a later step, Autopilot will provision the GPU nodes if required. For more information, see About node pool auto-creation.

      Standard

      Follow these steps to create a GKE Standard cluster:

      1. Set the zone for your cluster:

        export ZONE=ZONE
        

        Replace ZONE with the zone where you want to create your cluster. For optimal performance, use a zone in the same region as your Cloud Storage bucket.

      2. Create the cluster:

        gcloud container clusters create $CLUSTER_NAME \
            --project=$PROJECT_ID \
            --zone=$ZONE \
            --workload-pool=$PROJECT_ID.svc.id.goog \
            --num-nodes=1
        
      3. Create a node pool with one G2 machine (NVIDIA L4 GPU):

        gcloud container node-pools create g2-gpu-pool \
            --cluster=$CLUSTER_NAME \
            --zone=$ZONE \
            --machine-type=g2-standard-16 \
            --num-nodes=1 \
            --accelerator=type=nvidia-l4
        

    Configure Workload Identity Federation for GKE

    Configure Workload Identity Federation for GKE to allow your GKE workloads to securely access the model in your Cloud Storage bucket.

    1. Set variables for your Kubernetes service account and namespace:

      export KSA_NAME=KSA_NAME
      export NAMESPACE=NAMESPACE
      

      Replace the following:

      • NAMESPACE: the namespace that you want your workloads to run on. Make sure to use the same namespace to create all the resources in this document.
      • KSA_NAME: the name of the Kubernetes service account that your Pod can use to authenticate to Google Cloud APIs.
    2. Create a Kubernetes namespace:

      kubectl create namespace $NAMESPACE
      
    3. Create a Kubernetes service account (KSA):

      kubectl create serviceaccount $KSA_NAME \
          --namespace=$NAMESPACE
      
    4. Grant your KSA the necessary permissions:

      1. Set the environment variables:

        export BUCKET_NAME=BUCKET_NAME
        export PROJECT_ID=PROJECT_ID
        export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID \
            --format 'get(projectNumber)')
        

        Replace the following:

        • BUCKET_NAME: the name of your Cloud Storage bucket that contains your safetensors files.
        • PROJECT_ID: your Google Cloud project ID.

        The PROJECT_NUMBER, PROJECT_ID, NAMESPACE, and KSA_NAME will be used to construct the Workload Identity Federation for GKE principal identifier for your project in the following steps.

      2. Grant the roles/storage.bucketViewer role to your KSA to view objects in your Cloud Storage bucket:

        gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
            --member="principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/$NAMESPACE/sa/$KSA_NAME" \
            --role="roles/storage.bucketViewer"
        
      3. Grant the roles/storage.objectUser role to your KSA to read, write, and delete objects in your Cloud Storage bucket:

        gcloud storage buckets add-iam-policy-binding gs://$BUCKET_NAME \
            --member="principal://iam.googleapis.com/projects/$PROJECT_NUMBER/locations/global/workloadIdentityPools/$PROJECT_ID.svc.id.goog/subject/ns/$NAMESPACE/sa/$KSA_NAME" \
            --role="roles/storage.objectUser"
        

    You have now set up a GKE cluster with GPUs and configured Workload Identity Federation for GKE, granting a Kubernetes Service Account the necessary permissions to access your AI model in Cloud Storage. With the cluster and permissions in place, you are ready to deploy the vLLM inference server, which will use this service account to stream model weights with the Run:ai Model Streamer.

    Deploy vLLM with the Run:ai Model Streamer

    Deploy a Pod that runs the vLLM OpenAI-compatible server, and is configured with the --load-format=runai_streamer flag to use the Run:ai Model Streamer. The vLLM version must be 0.11.1 or later.

    The following sample manifest shows a vLLM configuration with model streamer enabled for a small size model, like gemma-2-9b-it, using a single NVIDIA L4 GPU.

    • If you're using a large model that requires multiple GPUs, increase the --tensor-parallel-size value to the required number of GPUs.
    • The --model-loader-extra-config='{"distributed":true}' flag enables distributed loading of model weights and is a recommended setting for improving model loading performance from object storage.

    For more information, see Tensor parallelism and Tunable parameters.

    1. Save the following manifest as vllm-deployment.yaml. The manifest is designed for flexibility across both Autopilot and Standard clusters.

          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: vllm-streamer-deployment
            namespace: NAMESPACE
          spec:
            replicas: 1
            selector:
              matchLabels:
                app: vllm-streamer
            template:
              metadata:
                labels:
                  app: vllm-streamer
              spec:
                serviceAccountName: KSA_NAME
                containers:
                  - name: vllm-container
                    image: vllm/vllm-openai:v0.11.1
                    command:
                      - python3
                      - -m
                      - vllm.entrypoints.openai.api_server
                    args:
                      - --model=gs://BUCKET_NAME/PATH_TO_MODEL
                      - --load-format=runai_streamer
                      - --model-loader-extra-config={"distributed":true}
                      - --host=0.0.0.0
                      - --port=8000
                      - --disable-log-requests
                      - --tensor-parallel-size=1
                    ports:
                      - containerPort: 8000
                        name: api
                    # startupProbe allows for longer startup times for large models
                    startupProbe:
                      httpGet:
                        path: /health
                        port: 8000
                      failureThreshold: 60  # 60 * 10s = 10 minutes timeout
                      periodSeconds: 10
                      initialDelaySeconds: 30
                    readinessProbe:
                      httpGet:
                        path: /health
                        port: 8000
                      failureThreshold: 3
                      periodSeconds: 10
                    resources:
                      limits:
                        nvidia.com/gpu: "1"
                      requests:
                        nvidia.com/gpu: "1"
                    volumeMounts:
                    - mountPath: /dev/shm
                      name: dshm
                nodeSelector:
                  cloud.google.com/gke-accelerator: nvidia-l4
                volumes:
                - emptyDir:
                    medium: Memory
                  name: dshm
      

      Replace the following:

      • NAMESPACE: your Kubernetes namespace.
      • KSA_NAME: your Kubernetes service account name.
      • BUCKET_NAME: your Cloud Storage bucket name.
      • PATH_TO_MODEL: the path to your model directory within the bucket, for example, models/my-llama.
    2. Apply the manifest to create the Deployment:

      kubectl create -f vllm-deployment.yaml
      

    You can generate more vLLM manifests using the GKE Inference Quickstart tool.

    Verify the Deployment

    1. Check the status of the Deployment:

      kubectl get deployments -n NAMESPACE
      
    2. Get the Pod name:

      kubectl get pods -n NAMESPACE | grep vllm-streamer
      

      Note the name of the Pod starting with vllm-streamer-deployment.

    3. To check if the model streamer downloads the model and weights, view the Pod logs:

      kubectl logs -f POD_NAME -n NAMESPACE
      

      Replace POD_NAME with the name of the Pod from the previous step. Successful streaming logs look similar to the following:

      [RunAI Streamer] Overall time to stream 15.0 GiB of all files: 13.4s, 1.1 GiB/s
      

    Optional: Boost performance with Anywhere Cache

    Cloud Storage Anywhere Cache can further accelerate model loading by caching data closer to your GKE nodes. Caching is especially beneficial when scaling out multiple nodes in the same zone.

    You enable Anywhere Cache for a specific Cloud Storage bucket in a specific Google Cloud zone. To improve performance, the cache's zone must match the zone where your GKE inference Pods run. Your approach depends on whether your Pods run in predictable zones.

    • For GKE Standard zonal clusters, where you know which zone your Pods will run in, enable Anywhere Cache for that specific zone.

    • For regional GKE clusters (both Autopilot and Standard), where Pods can be scheduled across multiple zones, you have the following options:

      • Enable caching in all zones: Enable Anywhere Cache in every zone within the cluster's region. This ensures a cache is available regardless of where GKE schedules your Pods. Note that you incur costs for each zone where caching is enabled. For more information, see Anywhere Cache pricing.
      • Place Pods into a specific zone: Use a nodeSelector or nodeAffinity rule in your workload manifest to constrain your Pods to a single zone. You can then enable Anywhere Cache in only that zone. This is a more cost-effective approach if your workload tolerates being restricted to a single zone.

    To enable Anywhere Cache for the zone where your GKE cluster resides, run the following commands:

    # Enable the cache
    gcloud storage buckets anywhere-caches create gs://$BUCKET_NAME $ZONE
    
    # Check the status of the cache
    gcloud storage buckets anywhere-caches describe $BUCKET_NAME/$ZONE
    

    Clean up

    To avoid incurring charges to your Google Cloud account for the resources used in this document, either delete the project that contains the resources, or keep the project and delete the individual resources.

    To delete the individual resources, follow these steps:

    1. Delete the GKE cluster. This action removes all nodes and workloads.

      gcloud container clusters delete CLUSTER_NAME --location=ZONE_OR_REGION
      

      Replace the following:

      • CLUSTER_NAME: the name of your cluster.
      • ZONE_OR_REGION: the zone or region of your cluster.
    2. Disable Anywhere Cache, if you enabled it, to avoid ongoing costs. For more information, see Disable a cache.

    What's next