Serve open source models using TPUs on GKE with Optimum TPU

This tutorial shows you how to serve large language model (LLM) open source models, using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Optimum TPU serving framework from Hugging Face. In this tutorial, you download open source models from Hugging Face and deploy the models on a GKE Standard cluster using a container that runs Optimum TPU.

This guide provides a starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.

This tutorial is intended for Generative AI customers in the Hugging Face ecosystem, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs.

As a reminder, you have multiple options for LLM inference on Google Cloud—which span offerings like Vertex AI, GKE, and Google Compute Engine—where you can incorporate serving libraries like JetStream, vLLM, and other partner offerings. For example, you can use JetStream to get the latest optimizations from the project. If you prefer Hugging Face options, you can use Optimum TPU.

Optimum TPU supports the following features:

Continuous batching
Token streaming
Greedy search and multinomial sampling using transformers.

Get access to the model

You can use the Gemma 2B or Llama3 8B models. This tutorial focuses on these two models, but Optimum TPU supports more models.

Gemma 2B

To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.

You must sign the consent agreement to use Gemma. Follow these instructions:

Access the model consent page.
Verify consent using your Hugging Face account.
Accept the model terms.

Generate an access token

Generate a new Hugging Face token if you don't already have one:

Click Your Profile > Settings > Access Tokens.
Click New Token.
Specify a Name of your choice and a Role of at least Read.
Click Generate a token.
Copy the generated token to your clipboard.

Llama3 8B

You must sign the consent agreement to use Llama3 8b in the Hugging Face Repo

Generate an access token

Generate a new Hugging Face token if you don't already have one:

Click Your Profile > Settings > Access Tokens.
Select New Token.
Specify a Name of your choice and a Role of at least Read.
Select Generate a token.
Copy the generated token to your clipboard.

Create a GKE cluster

Create a GKE Standard cluster with 1 CPU node:

gcloud container clusters create CLUSTER_NAME \
    --project=PROJECT_ID \
    --num-nodes=1 \
    --location=ZONE

Create TPU node pool

Create a v5e TPU node pool with 1 node and 8 chips:

gcloud container node-pools create tpunodepool \
    --location=ZONE \
    --num-nodes=1 \
    --machine-type=ct5lp-hightpu-8t \
    --cluster=CLUSTER_NAME

If TPU resources are available, GKE provisions the node pool. If TPU resources are temporarily unavailable, the output shows a GCE_STOCKOUT error message. To troubleshoot TPU stockout errors, refer to Insufficient TPU resources to satisfy the TPU request.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${ZONE}

Build the container

Run the make command to build the image

cd optimum-tpu && make tpu-tgi

Push the image to the Artifact Registry

gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest

Create a Kubernetes Secret for Hugging Face credentials

Create a Kubernetes Secret that contains the Hugging Face token:

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

Deploy Optimum TPU

To deploy Optimum TPU, this tutorial uses a Kubernetes Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..

Gemma 2B

Save the following Deployment manifest as optimum-tpu-gemma-2b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=google/gemma-2b
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        securityContext:
            privileged: true
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120

---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

Apply the manifest

kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml

Llama3 8B

Save the following manifest as optimum-tpu-llama3-8b-2x4.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-tpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tgi-tpu
  template:
    metadata:
      labels:
        app: tgi-tpu
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 2x4
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      containers:
      - name: tgi-tpu
        image: REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
        args:
        - --model-id=meta-llama/Meta-Llama-3-8B
        - --max-concurrent-requests=4
        - --max-input-length=8191
        - --max-total-tokens=8192
        - --max-batch-prefill-tokens=32768
        - --max-batch-size=16
        env:
          - name: HF_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-secret
                key: hf_api_token
        ports:
        - containerPort: 80
        resources:
          limits:
            google.com/tpu: 8
        livenessProbe:
          httpGet:
            path: /health
            port: 80
          initialDelaySeconds: 300
          periodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
  name: service
spec:
  selector:
    app: tgi-tpu
  ports:
    - name: http
      protocol: TCP
      port: 8080
      targetPort: 80

This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.

Apply the manifest

kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml

View the logs from the running Deployment:

kubectl logs -f -l app=tgi-tpu

The output should be similar to the following:

2024-07-09T22:39:34.365472Z  WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z  INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z  INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z  INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z  WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0

Make sure the model is fully downloaded before proceeding to the next section.

Serve the model

Set up port forwarding to the model:

kubectl port-forward svc/service 8080:8080

Interact with the model server using curl

Verify your deployed models:

In a new terminal session, use curl to chat with the model:

curl 127.0.0.1:8080/generate     -X POST     -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}'     -H 'Content-Type: application/json'

The output should be similar to the following:

{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}

Serve open source models using TPUs on GKE with Optimum TPU

Get access to the model

Gemma 2B

Sign the license consent agreement

Generate an access token

Llama3 8B

Generate an access token

Create a GKE cluster

Create TPU node pool

Configure kubectl to communicate with your cluster:

Build the container

Push the image to the Artifact Registry

Create a Kubernetes Secret for Hugging Face credentials

Deploy Optimum TPU

Gemma 2B

Llama3 8B

Serve the model

Interact with the model server using curl