Serve an LLM using TPU Trillium (v6e) on GKE with vLLM


This tutorial shows you how to serve large language models (LLMs) using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the vLLM serving framework. In this tutorial, you serve Llama 3.1 70b, use TPU Trillium (v6e), and set up horizontal Pod autoscaling using vLLM server metrics.

This document is a good starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when you deploy and serve your AI/ML workloads.

Background

By using TPU v6e on GKE, you can implement a robust, production-ready serving solution with all the benefits of managed Kubernetes, including efficient scalability and higher availability. This section describes the key technologies used in this guide.

TPU Trillium (v6e)

TPUs are Google's custom-developed application-specific integrated circuits (ASICs). TPUs are used to accelerate machine learning and AI models built using frameworks such as TensorFlow, PyTorch, and JAX. This tutorial uses TPU v6e, which is Google's latest generation AI accelerator.

Before you use TPUs in GKE, we recommend that you complete the following learning path:

  1. Learn about TPU v6e system architecture.
  2. Learn about TPUs in GKE.

vLLM

vLLM is a highly optimized, open-source framework for serving LLMs. VLLM can increase serving throughput on GPUs, with features such as the following:

  • Optimized transformer implementation with PagedAttention.
  • Continuous batching to improve the overall serving throughput.
  • Tensor parallelism and distributed serving on multiple TPUs.

To learn more, refer to the vLLM documentation.

Objectives

This tutorial is intended for MLOps or DevOps engineers or platform administrator who want to use GKE orchestration capabilities for serving LLMs.

This tutorial covers the following steps:

  1. Create a GKE Standard cluster with the recommended TPU v6e topology based on the model characteristics.
  2. Deploy the vLLM framework on a node pool in your cluster.
  3. Use the vLLM framework to serve Llama 3.1 70b using a load balancer.
  4. Set up horizontal Pod autoscaling using vLLM server metrics.
  5. Serve the model.

Before you begin

  • Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • Enable the required API.

    Enable the API

  • In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  • Make sure that billing is enabled for your Google Cloud project.

  • Enable the required API.

    Enable the API

  • Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/iam.securityAdmin, roles/artifactregistry.writer, roles/container.clusterAdmin

    Check for the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.

    4. For all rows that specify or include you, check the Role colunn to see whether the list of roles includes the required roles.

    Grant the roles

    1. In the Google Cloud console, go to the IAM page.

      Go to IAM
    2. Select the project.
    3. Click Grant access.
    4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.

    5. In the Select a role list, select a role.
    6. To grant additional roles, click Add another role and add each additional role.
    7. Click Save.

Prepare the environment

In this section, you provision the resources that you need to deploy vLLM and the model.

Get access to the model

You must sign the consent agreement to use Llama 3.1 70b in the Hugging Face repository.

Generate an access token

If you don't already have one, generate a new Hugging Face token:

  1. Click Your Profile > Settings > Access Tokens.
  2. Select New Token.
  3. Specify a Name of your choice and a Role of at least Read.
  4. Select Generate a token.

Launch Cloud Shell

In this tutorial, you use Cloud Shell to manage resources hosted on Google Cloud. Cloud Shell comes preinstalled with the software you need for this tutorial, including kubectl and the gcloud CLI.

To set up your environment with Cloud Shell, follow these steps:

  1. In the Google Cloud console, launch a Cloud Shell session by clicking Cloud Shell activation icon Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.

  2. Set the default environment variables:

    gcloud config set project PROJECT_ID && \
    export PROJECT_ID=$(gcloud config get project) && \
    export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)") && \
    export CLUSTER_NAME=CLUSTER_NAME && \
    export ZONE=ZONE && \
    export HF_TOKEN=HUGGING_FACE_TOKEN && \
    export CLUSTER_VERSION=CLUSTER_VERSION && \
    export GSBUCKET=GSBUCKET && \
    export KSA_NAME=KSA_NAME && \
    export NAMESPACE=NAMESPACE
    

    Replace the following values:

    • PROJECT_ID: your Google Cloud project ID.
    • CLUSTER_NAME: the name of your GKE cluster.
    • ZONE: a zone that supports v6e TPUs.
    • CLUSTER_VERSION: the GKE version, which must support the machine type that you want to use. Note that the default GKE version might not have availability for your target TPU. TPU v6e is supported in GKE versions 1.31.1-gke.1621000 or later.
    • GSBUCKET: the name of the Cloud Storage bucket to use for Cloud Storage FUSE.
    • KSA_NAME: the name of the Kubernetes ServiceAccount that's used to access Cloud Storage buckets. Bucket access is needed for Cloud Storage FUSE to work.
    • NAMESPACE: the Kubernetes namespace where you want to deploy the vLLM assets.

Create a GKE cluster and a TPU slice node pool

In Cloud Shell, do the following:

  1. Create a GKE Standard cluster:

    gcloud container clusters create CLUSTER_NAME \
        --project=PROJECT_ID \
        --zone=ZONE \
        --cluster-version=CLUSTER_VERSION \
        --workload-pool=PROJECT_ID.svc.id.goog \
        --addons GcsFuseCsiDriver
    
  2. Create a TPU slice node pool:

    gcloud container node-pools create tpunodepool \
        --zone=ZONE \
        --num-nodes=1 \
        --machine-type=ct6e-standard-8t \
        --cluster=CLUSTER_NAME \
        --enable-autoscaling --total-min-nodes=1 --total-max-nodes=2
    

    GKE creates the following resources for the LLM:

  3. Configure kubectl to communicate with your cluster:

    gcloud container clusters get-credentials CLUSTER_NAME --location=ZONE
    
  4. Create a Kubernetes Secret that contains the Hugging Face token:

    kubectl create secret generic hf-secret \
        --from-literal=hf_api_token=HUGGING_FACE_TOKEN \
        --namespace NAMESPACE
    

Create a Cloud Storage bucket

In Cloud Shell, run the following command:

gcloud storage buckets create gs://GSBUCKET \
    --uniform-bucket-level-access

This creates a Cloud Storage bucket to store the model files you download from Hugging Face.

Set up a Kubernetes ServiceAccount to access the bucket

  1. Create a namespace to use for the Kubernetes ServiceAccount. You can skip this step if you are using the default namespace:

    kubectl create namespace NAMESPACE
    
  2. Create the Kubernetes ServiceAccount:

    kubectl create serviceaccount KSA_NAME --namespace NAMESPACE
    
  3. Grant read-write access to the Kubernetes ServiceAccount in order to access the Cloud Storage bucket:

    gcloud storage buckets add-iam-policy-binding gs://GSBUCKET \
      --member "principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
      --role "roles/storage.objectUser"
    
  4. Alternatively, you can grant read-write access to all Cloud Storage buckets in the project:

    gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/NAMESPACE/sa/KSA_NAME" \
    --role "roles/storage.objectUser"
    

    GKE creates the following resources for the LLM:

    1. A Cloud Storage bucket to store the downloaded model and the compilation cache. A Cloud Storage FUSE CSI driver reads the content of the bucket.
    2. Volumes with file caching enabled and the parallel download feature of Cloud Storage FUSE.
    Best practice:

    Use use a file cache backed by RAM or pc-balanced depending on the expected size of the model contents, for example, weight files. In this tutorial, you use Cloud Storage FUSE file cache backed by RAM.

Build and deploy the TPU image

Containerize your vLLM server:

  1. Clone the vLLM repository and build the image:

    git clone https://github.com/vllm-project/vllm && cd vllm && git reset --hard cd34029e91ad2d38a58d190331a65f9096c0b157 && docker build -f Dockerfile.tpu . -t vllm-tpu
    
  2. Push the image to Artifact Registry:

    gcloud artifacts repositories create vllm-tpu --repository-format=docker --location=REGION_NAME && \
    gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
    docker image tag vllm-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/vllm-tpu/vllm-tpu:latest && \
    docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/vllm-tpu/vllm-tpu:latest
    

Deploy vLLM model server

To deploy the vLLM model server, follow these steps:

  1. Inspect the manifest vllm-llama3-70b.yaml:

    
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-tpu
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: vllm-tpu
      template:
        metadata:
          labels:
            app: vllm-tpu
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: KSA_NAME
          nodeSelector:
            cloud.google.com/gke-tpu-topology: 2x4
            cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
          containers:
          - name: vllm-tpu
            image: REGION_NAME-docker.pkg.dev/PROJECT_ID/vllm-tpu/vllm-tpu:latest
            command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
            args:
            - --host=0.0.0.0
            - --port=8000
            - --tensor-parallel-size=8
            - --max-model-len=8192
            - --model=meta-llama/Meta-Llama-3.1-70B
            - --download-dir=/data
            env: 
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-secret
                  key: hf_api_token
            - name: VLLM_XLA_CACHE_PATH
              value: "/data"
            ports:
            - containerPort: 8000
            resources:
              limits:
                google.com/tpu: 8
            readinessProbe:
              tcpSocket:
                port: 8000
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: GSBUCKET
                mountOptions: "implicit-dirs,file-cache:enable-parallel-downloads:true,file-cache:parallel-downloads-per-file:100,file-cache:max-parallel-downloads:-1,file-cache:download-chunk-size-mb:10,file-cache:max-size-mb:-1"
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: vllm-service
    spec:
      selector:
        app: vllm-tpu
      type: LoadBalancer	
      ports:
        - name: http
          protocol: TCP
          port: 8000  
          targetPort: 8000
    
  2. Apply the manifest by running the following command:

    kubectl apply -f vllm-llama3-70b.yaml
    
  3. View the logs from the running model server:

    kubectl logs -f -l app=vllm-tpu
    

    The output should look similar to the following:

    INFO 09-20 19:03:48 launcher.py:19] Available routes are:
    INFO 09-20 19:03:48 launcher.py:27] Route: /openapi.json, Methods: GET, HEAD
    INFO 09-20 19:03:48 launcher.py:27] Route: /docs, Methods: GET, HEAD
    INFO 09-20 19:03:48 launcher.py:27] Route: /docs/oauth2-redirect, Methods: GET, HEAD
    INFO 09-20 19:03:48 launcher.py:27] Route: /redoc, Methods: GET, HEAD
    INFO 09-20 19:03:48 launcher.py:27] Route: /health, Methods: GET
    INFO 09-20 19:03:48 launcher.py:27] Route: /tokenize, Methods: POST
    INFO 09-20 19:03:48 launcher.py:27] Route: /detokenize, Methods: POST
    INFO 09-20 19:03:48 launcher.py:27] Route: /v1/models, Methods: GET
    INFO 09-20 19:03:48 launcher.py:27] Route: /version, Methods: GET
    INFO 09-20 19:03:48 launcher.py:27] Route: /v1/chat/completions, Methods: POST
    INFO 09-20 19:03:48 launcher.py:27] Route: /v1/completions, Methods: POST
    INFO 09-20 19:03:48 launcher.py:27] Route: /v1/embeddings, Methods: POST
    INFO:     Started server process [1]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
    (RayWorkerWrapper pid=25987) INFO 09-20 19:03:46 tpu_model_runner.py:290] Compilation for decode done in 202.93 s.
    (RayWorkerWrapper pid=25987) INFO 09-20 19:03:46 tpu_model_runner.py:283] batch_size: 256, seq_len: 1 [repeated 7x across cluster]
    INFO 09-20 19:03:53 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
    

Serve the model

  1. Run the following command to get the external IP address of the Service:

    export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    
  2. In a new terminal, interact with the model using curl:

    curl http://$vllm_service:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3.1-70B",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'
    

    The output should be similar to the following:

    {"id":"cmpl-6b4bb29482494ab88408d537da1e608f","object":"text_completion","created":1727822657,"model":"meta-llama/Meta-Llama-3-8B","choices":[{"index":0,"text":" top holiday destination featuring scenic beauty and","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}
    

Set up the custom autoscaler

In this section, you set up horizontal Pod autoscaling using custom Prometheus metrics. You use the Google Cloud Managed Service for Prometheus metrics from the vLLM server.

To learn more, see Google Cloud Managed Service for Prometheus. This should be enabled by default on the GKE cluster.

  1. Set up the Custom Metrics Stackdriver Adapter on your cluster:

    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml
    
  2. Add the Monitoring Viewer role to the service account that the Custom Metrics Stackdriver Adapter uses:

    gcloud projects add-iam-policy-binding projects/PROJECT_ID \
        --role roles/monitoring.viewer \
        --member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/custom-metrics/sa/custom-metrics-stackdriver-adapter
    
  3. Save the following manifest as vllm_pod_monitor.yaml:

    
    apiVersion: monitoring.googleapis.com/v1
    kind: PodMonitoring
    metadata:
     name: vllm-pod-monitoring
    spec:
     selector:
       matchLabels:
         app: vllm-tpu
     endpoints:
     - path: /metrics
       port: 8000
       interval: 15s
    
  4. Apply it to the cluster:

    kubectl apply -f vllm_pod_monitor.yaml
    

Create load on the vLLM endpoint

Create load to the vLLM server to test how GKE autoscales with a custom vLLM metric.

  1. Run a bash script (load.sh) to send N number of parallel requests to the vLLM endpoint:

    #!/bin/bash
    N=PARALLEL_PROCESSES
    export vllm_service=$(kubectl get service vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    for i in $(seq 1 $N); do
      while true; do
        curl http://$vllm_service:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Meta-Llama-3.1-70B", "prompt": "Write a story about san francisco", "max_tokens": 100, "temperature": 0}'
      done &  # Run in the background
    done
    wait
    

    Replace PARALLEL_PROCESSES with the number of parallel processes that you want to run.

  2. Run the bash script:

    nohup ./load.sh &
    

Verify that Google Cloud Managed Service for Prometheus ingests the metrics

After Google Cloud Managed Service for Prometheus scrapes the metrics and you're adding load to the vLLM endpoint, you can view metrics on Cloud Monitoring.

  1. In the Google Cloud console, go to the Metrics explorer page.

    Go to Metrics explorer

  2. Click < > PromQL.

  3. Enter the following query to observe traffic metrics:

    vllm:avg_generation_throughput_toks_per_s{cluster='CLUSTER_NAME_HERE'}
    

In the line graph, the vLLM metric scales up from 0 (pre-load) to a value (post-load). This graph confirms your vLLM metrics are being ingested into Google Cloud Managed Service for Prometheus.

The following image is an example of a graph after the load script execution. In this case, the model server is serving around 2,000 generation tokens per second.

test

Deploy the Horizontal Pod Autoscaler configuration

In this section, you deploy the Horizontal Pod Autoscaler configuration.

  1. Save the following manifest as vllm-hpa.yaml:

    
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
     name: vllm-hpa
    spec:
     scaleTargetRef:
       apiVersion: apps/v1
       kind: Deployment
       name: vllm-tpu
     minReplicas: 1
     maxReplicas: 2
     metrics:
       - type: Pods
         pods:
           metric:
             name: prometheus.googleapis.com|vllm:num_requests_waiting|gauge
           target:
             type: AverageValue
             averageValue: 1
    

    The vLLM metrics in Google Cloud Managed Service for Prometheus follow the vllm:metric_name format.

    Best practice:

    Use num_requests_waiting for scaling throughput. Use gpu_cache_usage_perc for latency-sensitive TPU use cases.

  2. Deploy the Horizontal Pod Autoscaler configuration:

    kubectl apply -f vllm-hpa.yaml
    

    GKE schedules another Pod to deploy, which triggers the node pool autoscaler to add a second node before it deploys the second vLLM replica.

  3. Watch the progress of the Pod autoscaling:

    kubectl get hpa --watch
    

    The output is similar to the following:

    NAME       REFERENCE             TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
    vllm-hpa   Deployment/vllm-tpu   <unknown>/5   1         2         0          6s
    vllm-hpa   Deployment/vllm-tpu   34972m/5      1         2         1          16s
    vllm-hpa   Deployment/vllm-tpu   25112m/5      1         2         2          31s
    vllm-hpa   Deployment/vllm-tpu   35301m/5      1         2         2          46s
    vllm-hpa   Deployment/vllm-tpu   25098m/5      1         2         2          62s
    vllm-hpa   Deployment/vllm-tpu   35348m/5      1         2         2          77s
    
  4. Wait for 10 minutes and repeat the steps in the Verify that Google Cloud Managed Service for Prometheus ingests the metrics section. Google Cloud Managed Service for Prometheus ingests the metrics from both vLLM endpoints.

test2

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, run the following commands:

ps -ef | grep load.sh | awk '{print $2}' | xargs -n1 kill -9
gcloud container clusters delete CLUSTER_NAME \
  --location=ZONE

What's next