Serve an LLM with multiple GPUs in GKE


This tutorial shows you how to serve a large language model (LLM) with GPUs in Google Kubernetes Engine (GKE). This tutorial creates a GKE cluster that uses multiple L4 GPUs and prepares the GKE infrastructure to serve any of the following models:

Depending on the data format of the model, the number of GPUs varies. In this tutorial, each model uses two L4 GPUs. To learn more, see Calculating the amount of GPUs.

Before you complete this tutorial in GKE, we recommend that you learn About GPUs in GKE.

Objectives

This tutorial is intended for MLOps or DevOps engineers or platform administrator that want to use GKE orchestration capabilities for serving LLMs.

This tutorial covers the following steps:

  1. Create a cluster and node pools.
  2. Prepare your workload.
  3. Deploy your workload.
  4. Interact with the LLM interface.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
  • Some models have additional requirements. Ensure you meet these requirements:

Prepare your environment

  1. In the Google Cloud console, start a Cloud Shell instance:
    Open Cloud Shell

  2. Set the default environment variables:

    gcloud config set project PROJECT_ID
    export PROJECT_ID=$(gcloud config get project)
    export REGION=us-central1
    

    Replace the PROJECT_ID with your Google Cloud project ID.

Create a GKE cluster and node pool

You can serve LLMs on GPUs in a GKE Autopilot or Standard cluster. We recommend that you use a Autopilot cluster for a fully managed Kubernetes experience. To choose the GKE mode of operation that's the best fit for your workloads, see Choose a GKE mode of operation.

Autopilot

  1. In Cloud Shell, run the following command:

    gcloud container clusters create-auto l4-demo \
      --project=${PROJECT_ID} \
      --region=${REGION} \
      --release-channel=rapid
    

    GKE creates an Autopilot cluster with CPU and GPU nodes as requested by the deployed workloads.

  2. Configure kubectl to communicate with your cluster:

    gcloud container clusters get-credentials l4-demo --region=${REGION}
    

Standard

  1. In Cloud Shell, run the following command to create a Standard cluster that uses Workload Identity Federation for GKE:

    gcloud container clusters create l4-demo --location ${REGION} \
      --workload-pool ${PROJECT_ID}.svc.id.goog \
      --enable-image-streaming \
      --node-locations=$REGION-a \
      --workload-pool=${PROJECT_ID}.svc.id.goog \
      --machine-type n2d-standard-4 \
      --num-nodes 1 --min-nodes 1 --max-nodes 5 \
      --release-channel=rapid
    

    The cluster creation might take several minutes.

  2. Run the following command to create a node pool for your cluster:

    gcloud container node-pools create g2-standard-24 --cluster l4-demo \
      --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
      --machine-type g2-standard-24 \
      --enable-autoscaling --enable-image-streaming \
      --num-nodes=0 --min-nodes=0 --max-nodes=3 \
      --node-locations $REGION-a,$REGION-c --region $REGION --spot
    

    GKE creates the following resources for the LLM:

    • A public Google Kubernetes Engine (GKE) Standard edition cluster.
    • A node pool with g2-standard-24 machine type scaled down to 0 nodes. You aren't charged for any GPUs until you launch Pods. that request GPUs. This node pool provisions Spot VMs, which are priced lower than the default standard Compute Engine VMs and provide no guarantee of availability. You can remove the --spot flag from this command, and the cloud.google.com/gke-spot node selector in the text-generation-inference.yaml config to use on-demand VMs.
  3. Configure kubectl to communicate with your cluster:

    gcloud container clusters get-credentials l4-demo --region=${REGION}
    

Prepare your workload

The following section shows how to set up your workload depending on the model you want to use:

Llama 2 70b

  1. Set the default environment variables:

    export HF_TOKEN=HUGGING_FACE_TOKEN
    

    Replace the HUGGING_FACE_TOKEN with your HuggingFace token.

  2. Create a Kubernetes secret for the HuggingFace token:

    kubectl create secret generic l4-demo \
        --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    
  3. Create the following text-generation-inference.yaml manifest:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: ghcr.io/huggingface/text-generation-inference:1.4.3
            resources:
              requests:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: meta-llama/Llama-2-70b-chat-hf
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: l4-demo
                  key: HUGGING_FACE_TOKEN
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /data
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 150Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    In this manifest:

    • NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
  4. Apply the manifest:

    kubectl apply -f text-generation-inference.yaml
    

    The output is similar to the following:

    deployment.apps/llm created
    
  5. Verify the status of the model:

    kubectl get deploy
    

    The output is similar to the following:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           20m
    
  6. View the logs from the running deployment:

    kubectl logs -l app=llm
    

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

Mixtral 8x7b

  1. Set the default environment variables:

    export HF_TOKEN=HUGGING_FACE_TOKEN
    

    Replace the HUGGING_FACE_TOKEN with your HuggingFace token.

  2. Create a Kubernetes secret for the HuggingFace token:

    kubectl create secret generic l4-demo \
        --from-literal=HUGGING_FACE_TOKEN=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    
  3. Create the following text-generation-inference.yaml manifest:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: ghcr.io/huggingface/text-generation-inference:1.4.3
            resources:
              requests:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "5"
                memory: "40Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: mistralai/Mixtral-8x7B-Instruct-v0.1
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: l4-demo
                  key: HUGGING_FACE_TOKEN          
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /data
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 100Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    In this manifest:

    • NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, this may reduce model accuracy. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
  4. Apply the manifest:

    kubectl apply -f text-generation-inference.yaml
    

    The output is similar to the following:

    deployment.apps/llm created
    
  5. Verify the status of the model:

    watch kubectl get deploy
    

    The output is similar to the following when the deployment is ready. To exit the watch, type CTRL + C:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           10m
    
  6. View the logs from the running deployment:

    kubectl logs -l app=llm
    

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

Falcon 40b

  1. Create the following text-generation-inference.yaml manifest:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: llm
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: llm
      template:
        metadata:
          labels:
            app: llm
        spec:
          containers:
          - name: llm
            image: ghcr.io/huggingface/text-generation-inference:1.4.3
            resources:
              requests:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
              limits:
                cpu: "10"
                memory: "60Gi"
                nvidia.com/gpu: "2"
            env:
            - name: MODEL_ID
              value: tiiuae/falcon-40b-instruct
            - name: NUM_SHARD
              value: "2"
            - name: PORT
              value: "8080"
            - name: QUANTIZE
              value: bitsandbytes-nf4
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /data
                name: ephemeral-volume
          volumes:
            - name: dshm
              emptyDir:
                  medium: Memory
            - name: ephemeral-volume
              ephemeral:
                volumeClaimTemplate:
                  metadata:
                    labels:
                      type: ephemeral
                  spec:
                    accessModes: ["ReadWriteOnce"]
                    storageClassName: "premium-rwo"
                    resources:
                      requests:
                        storage: 175Gi
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-l4"
            cloud.google.com/gke-spot: "true"

    In this manifest:

    • NUM_SHARD must be 2 because the model requires two NVIDIA L4 GPUs.
    • QUANTIZE is set to bitsandbytes-nf4 which means that the model is loaded in 4 bit instead of 32 bits. This allows GKE to reduce the amount of GPU memory needed and improves the inference speed. However, the model accuracy can decrease. To learn how to calculate the GPUs to request, see Calculating the amount of GPUs.
  2. Apply the manifest:

    kubectl apply -f text-generation-inference.yaml
    

    The output is similar to the following:

    deployment.apps/llm created
    
  3. Verify the status of the model:

    watch kubectl get deploy
    

    The output is similar to the following when the deployment is ready. To exit the watch, type CTRL + C:

    NAME          READY   UP-TO-DATE   AVAILABLE   AGE
    llm           1/1     1            1           10m
    
  4. View the logs from the running deployment:

    kubectl logs -l app=llm
    

    The output is similar to the following:

    {"timestamp":"2024-03-09T05:08:14.751646Z","level":"INFO","message":"Warming up model","target":"text_generation_router","filename":"router/src/main.rs","line_number":291}
    {"timestamp":"2024-03-09T05:08:19.961136Z","level":"INFO","message":"Setting max batch total tokens to 133696","target":"text_generation_router","filename":"router/src/main.rs","line_number":328}
    {"timestamp":"2024-03-09T05:08:19.961164Z","level":"INFO","message":"Connected","target":"text_generation_router","filename":"router/src/main.rs","line_number":329}
    {"timestamp":"2024-03-09T05:08:19.961171Z","level":"WARN","message":"Invalid hostname, defaulting to 0.0.0.0","target":"text_generation_router","filename":"router/src/main.rs","line_number":343}
    

Create a Service of type ClusterIP

  1. Create the following llm-service.yaml manifest:

    apiVersion: v1
    kind: Service
    metadata:
      name: llm-service
    spec:
      selector:
        app: llm
      type: ClusterIP
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8080
    
  2. Apply the manifest:

    kubectl apply -f llm-service.yaml
    

Deploy a chat interface

Use Gradio to build a web application that lets you interact with your model. Gradio is a Python library that has a ChatInterface wrapper that creates user interfaces for chatbots.

Llama 2 70b

  1. Create a file named gradio.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy: 
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.0
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "llama-2-70b"
            - name: USER_PROMPT
              value: "[INST] prompt [/INST]"
            - name: SYSTEM_PROMPT
              value: "prompt"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
  2. Apply the manifest:

    kubectl apply -f gradio.yaml
    
  3. Find the external IP address of the Service:

    kubectl get svc
    

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. Copy the external IP address from the EXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IP address with the exposed port:

    http://EXTERNAL_IP
    

Mixtral 8x7b

  1. Create a file named gradio.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy: 
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.0
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "mixtral-8x7b"
            - name: USER_PROMPT
              value: "[INST] prompt [/INST]"
            - name: SYSTEM_PROMPT
              value: "prompt"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
  2. Apply the manifest:

    kubectl apply -f gradio.yaml
    
  3. Find the external IP address of the Service:

    kubectl get svc
    

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. Copy the external IP address from the EXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IP address with the exposed port:

    http://EXTERNAL_IP
    

Falcon 40b

  1. Create a file named gradio.yaml:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: gradio
      labels:
        app: gradio
    spec:
      strategy: 
        type: Recreate
      replicas: 1
      selector:
        matchLabels:
          app: gradio
      template:
        metadata:
          labels:
            app: gradio
        spec:
          containers:
          - name: gradio
            image: us-docker.pkg.dev/google-samples/containers/gke/gradio-app:v1.0.0
            resources:
              requests:
                cpu: "512m"
                memory: "512Mi"
              limits:
                cpu: "1"
                memory: "512Mi"
            env:
            - name: CONTEXT_PATH
              value: "/generate"
            - name: HOST
              value: "http://llm-service"
            - name: LLM_ENGINE
              value: "tgi"
            - name: MODEL_ID
              value: "falcon-40b-instruct"
            - name: USER_PROMPT
              value: "User: prompt"
            - name: SYSTEM_PROMPT
              value: "Assistant: prompt"
            ports:
            - containerPort: 7860
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: gradio-service
    spec:
      type: LoadBalancer
      selector:
        app: gradio
      ports:
      - port: 80
        targetPort: 7860
  2. Apply the manifest:

    kubectl apply -f gradio.yaml
    
  3. Find the external IP address of the Service:

    kubectl get svc
    

    The output is similar to the following:

    NAME             TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)        AGE
    gradio-service   LoadBalancer   10.24.29.197   34.172.115.35   80:30952/TCP   125m
    
  4. Copy the external IP address from the EXTERNAL-IP column.

  5. View the model interface from your web browser by using the external IP address with the exposed port:

    http://EXTERNAL_IP
    

Calculating the amount of GPUs

The amount of GPUs depends on the value of the QUANTIZE flag. In this tutorial, QUANTIZE is set to bitsandbytes-nf4, which means that the model is loaded in 4 bits.

A 70 billion parameter model would require a minimum of 40 GB of GPU memory which equals to 70 billion times 4 bits (70 billion x 4 bits= 35 GB) and considers a 5 GB of overhead. In this case, a single L4 GPU wouldn't have enough memory. Therefore, the examples in this tutorial use two L4 GPU of memory (2 x 24 = 48 GB). This configuration is sufficient for running Falcon 40b or Llama 2 70b in L4 GPUs.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the cluster

To avoid incurring charges to your Google Cloud account for the resources that you created in this guide, delete the GKE cluster:

gcloud container clusters delete l4-demo --region ${REGION}

What's next