This guide shows you how to optimize costs for LLM-serving workloads on GKE. This tutorial uses a combination of Flex-start VMs, Spot VMs, and custom compute class profiles for cost-effective inference.
This guide uses Mixtral 8x7b as an example LLM you can deploy.
This guide is intended for Machine learning (ML) engineers, Platform admins and operators, and for Data and AI specialists who are interested in using Kubernetes container orchestration capabilities for serving LLMs. For more information about common roles and example tasks that we reference in Google Cloud content, see Common GKE user roles and tasks.
Flex-start pricing
Flex-start is recommended if your workload requires dynamically provisioned resources as needed, for up to seven days with short-term reservations, no complex quota management, and cost-effective access. Flex-start is powered by Dynamic Workload Scheduler and is billed using Dynamic Workload Scheduler pricing:
- Discounted (up to 53%) for vCPUs, GPUs, and TPUs.
- You pay as you go.
Background
This section describes the available techniques that you can use to obtain computing resources, including GPU accelerators, based on the requirements of your AI/ML workloads. These techniques are called accelerator obtainability strategies in GKE.
GPUs
Graphical processing units (GPUs) let you accelerate specific workloads such as machine learning and data processing. GKE offers nodes that are equipped with these powerful GPUs to optimize the performance of machine learning and data processing tasks. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, A100, and L4 GPUs.
For more information, see About GPUs in GKE.
Flex-start
Flex-start, powered by Dynamic Workload Scheduler, is a GPU consumption option where GKE persists your GPU request and automatically provisions Flex-start VMs when capacity becomes available. Consider using flex-start for workloads that need GPU capacity for a limited time, up to seven days, and don't have a fixed start date. For more information, see flex-start.
Spot VMs
You can use GPUs with Spot VMs if your workloads can tolerate frequent node disruptions. Using Spot VMs or flex-start reduce the price of running GPUs. Using Spot VMs combined with flex-start provides a fallback option when Spot VMs capacity is unavailable.
For more information, see Using Spot VMs with GPU node pools.
Custom compute classes
You can request GPUs by using custom compute classes. Custom compute classes let you define a hierarchy of node configurations for GKE to prioritize during node scaling decisions, so that workloads run on your selected hardware. For more information, see About custom compute classes.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- 
    
    
      In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- 
      Create a project: To create a project, you need the Project Creator
      (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
- 
  
    Verify that billing is enabled for your Google Cloud project. 
- 
    
    
      In the Google Cloud console, on the project selector page, select or create a Google Cloud project. Roles required to select or create a project - Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- 
      Create a project: To create a project, you need the Project Creator
      (roles/resourcemanager.projectCreator), which contains theresourcemanager.projects.createpermission. Learn how to grant roles.
 
- 
  
    Verify that billing is enabled for your Google Cloud project. 
- 
    
        Make sure that you have the following role or roles on the project: Check for the roles- 
              In the Google Cloud console, go to the IAM page. Go to IAM
- Select the project.
- 
              In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator. 
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
 Grant the roles- 
              In the Google Cloud console, go to the IAM page. Go to IAM
- Select the project.
- Click Grant access.
- 
              In the New principals field, enter your user identifier. This is typically the email address for a Google Account. 
- In the Select a role list, select a role.
- To grant additional roles, click Add another role and add each additional role.
- Click Save.
 
- 
              
- Ensure that you have a GKE Autopilot or Standard cluster that runs version 1.32.2-gke.1652000 or later. Your cluster must enable node auto-provisioning and configure GPU limits .
- Create a Hugging Face account, if you don't already have one.
- Ensure your project has sufficient preemptible quota for NVIDIA L4 GPUs. For more information, see Preemptible quotas.
Get access to the model
If you don't already have one, generate a new Hugging Face token:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a name of your choice and a role of at least Read.
- Select Generate a token.
Create custom compute class profile
In this section, you create a custom compute class profile. Custom compute class profiles define the types and relationships between multiple compute resources used by your workload.
- In the Google Cloud console, launch a Cloud Shell session by clicking
Activate Cloud Shell in the Google Cloud console. A session opens in the bottom pane of the Google Cloud console. 
- Create a - dws-flex-start.yamlmanifest file:- apiVersion: cloud.google.com/v1 kind: ComputeClass metadata: name: dws-model-inference-class spec: priorities: - machineType: g2-standard-24 spot: true - machineType: g2-standard-24 flexStart: enabled: true nodeRecycling: leadTimeSeconds: 3600 nodePoolAutoCreation: enabled: true
- Apply the - dws-flex-start.yamlmanifest:- kubectl apply -f dws-flex-start.yaml
GKE deploys g2-standard-24 machines with L4 accelerators.
GKE uses
compute classes to
prioritize Spot VMs first, and Flex-start VMs
second.
Deploy the LLM workload
- Create a Kubernetes Secret that contains the Hugging Face token by using the following command: - kubectl create secret generic model-inference-secret \ --from-literal=HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN \ --dry-run=client -o yaml | kubectl apply -f -- Replace the - HUGGING_FACE_TOKENwith your Hugging Face access token.
- Create a file named - mixtral-deployment.yaml:- apiVersion: apps/v1 kind: Deployment metadata: name: inference-mixtral-ccc spec: replicas: 1 selector: matchLabels: app: llm template: metadata: labels: app: llm spec: nodeSelector: cloud.google.com/compute-class: dws-model-inference-class containers: - name: llm image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311 resources: requests: cpu: "5" memory: "40Gi" nvidia.com/gpu: "2" limits: cpu: "5" memory: "40Gi" nvidia.com/gpu: "2" env: - name: MODEL_ID value: mistralai/Mixtral-8x7B-Instruct-v0.1 - name: NUM_SHARD value: "2" - name: PORT value: "8080" - name: QUANTIZE value: bitsandbytes-nf4 - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: model-inference-secret key: HUGGING_FACE_TOKEN volumeMounts: - mountPath: /dev/shm name: dshm - mountPath: /tmp name: ephemeral-volume volumes: - name: dshm emptyDir: medium: Memory - name: ephemeral-volume ephemeral: volumeClaimTemplate: metadata: labels: type: ephemeral spec: accessModes: ["ReadWriteOnce"] storageClassName: "premium-rwo" resources: requests: storage: 100Gi- In this manifest, the - mountPathfield is set to- /tmp, because it's the path where the- HF_HOMEenvironment variable in the Deep Learning Container (DLC) for Text Generation Inference (TGI) is set to, instead of the default- /datapath that's set within the TGI default image. The downloaded model will be stored in this directory.
- Deploy the model: - kubectl apply -f mixtral-deployment.yaml- GKE schedules a new Pod to deploy, which triggers the node pool autoscaler to add a second node before it deploys the second replica of the model. 
- Verify the status of the model: - watch kubectl get deploy inference-mixtral-ccc- If the model was deployed successfully, the output is similar to the following: - NAME READY UP-TO-DATE AVAILABLE AGE inference-mixtral-ccc 1/1 1 1 10m- To exit the watch, press - CTRL + C.
- Wait for the container to download and start serving the model: - watch "kubectl logs $(kubectl get pods -l app=llm -o custom-columns=:metadata.name --no-headers) | tail"- To exit the watch, press - CTRL + C.
- View the node pools that GKE provisioned: - kubectl get nodes -L cloud.google.com/gke-nodepool- The output is similar to the following: - NAME STATUS ROLES AGE VERSION GKE-NODEPOOL gke-flex-na-nap-g2-standard--0723b782-fg7v Ready <none> 10m v1.32.3-gke.1152000 nap-g2-standard-24-spot-gpu2-1gbdlbxz gke-flex-nap-zo-default-pool-09f6fe53-fzm8 Ready <none> 32m v1.32.3-gke.1152000 default-pool gke-flex-nap-zo-default-pool-09f6fe53-lv2v Ready <none> 32m v1.32.3-gke.1152000 default-pool gke-flex-nap-zo-default-pool-09f6fe53-pq6m Ready <none> 32m v1.32.3-gke.1152000 default-pool- The name of the created node pool indicates the type of machine. In this case, GKE provisioned Spot VMs. 
- Expose the model: - kubectl expose deployment/inference-mixtral-ccc --port 8080 --name=llm-service
Interact with the model using curl
This section shows how you can perform a basic inference test to verify your deployed model.
- Set up port forwarding to the model: - kubectl port-forward service/llm-service 8080:8080- The output is similar to the following: - Forwarding from 127.0.0.1:8080 -> 8080
- In a new terminal session, chat with your model by using - curl:- curl http://localhost:8080/v1/completions \ -X POST \ -H "Content-Type: application/json" \ -d '{ "model": "mixtral-8x7b-instruct-gptq", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'- The output looks similar to the following: - George Washington was a Founding Father and the first president of the United States, serving from 1789 to 1797.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, either delete the project that contains the resources, or keep the project and delete the individual resources.
Delete the project
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the individual resource
- Delete the Kubernetes resources that you created from this guide: - kubectl delete deployment inference-mixtral-ccc kubectl delete service llm-service kubectl delete computeclass dws-model-inference-class kubectl delete secret model-inference-secret
- Delete the cluster: - gcloud container clusters delete CLUSTER_NAME
What's next
- Learn more how to Train a small workload with flex-start.
- Learn more about GPUs in GKE.