This tutorial shows you how to serve large language model (LLM) open source models, using Tensor Processing Units (TPUs) on Google Kubernetes Engine (GKE) with the Optimum TPU serving framework from Hugging Face. In this tutorial, you download open source models from Hugging Face and deploy the models on a GKE Standard cluster using a container that runs Optimum TPU.
This guide provides a starting point if you need the granular control, scalability, resilience, portability, and cost-effectiveness of managed Kubernetes when deploying and serving your AI/ML workloads.
This tutorial is intended for Generative AI customers in the Hugging Face ecosystem, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving LLMs.
As a reminder, you have multiple options for LLM inference on Google Cloud—which span offerings like Vertex AI, GKE, and Google Compute Engine—where you can incorporate serving libraries like JetStream, vLLM, and other partner offerings. For example, you can use JetStream to get the latest optimizations from the project. If you prefer Hugging Face options, you can use Optimum TPU.
Optimum TPU supports the following features:
- Continuous batching
- Token streaming
- Greedy search and multinomial sampling using transformers.
Get access to the model
You can use the Gemma 2B or Llama3 8B models. This tutorial focuses on these two models, but Optimum TPU supports more models.
Gemma 2B
To get access to the Gemma models for deployment to GKE, you must first sign the license consent agreement then generate a Hugging Face access token.
Sign the license consent agreement
You must sign the consent agreement to use Gemma. Follow these instructions:
- Access the model consent page.
- Verify consent using your Hugging Face account.
- Accept the model terms.
Generate an access token
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Click New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Click Generate a token.
- Copy the generated token to your clipboard.
Llama3 8B
You must sign the consent agreement to use Llama3 8b in the Hugging Face Repo
Generate an access token
Generate a new Hugging Face token if you don't already have one:
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
- Copy the generated token to your clipboard.
Create a GKE cluster
Create a GKE Standard cluster with 1 CPU node:
gcloud container clusters create CLUSTER_NAME \
--project=PROJECT_ID \
--num-nodes=1 \
--location=ZONE
Create TPU node pool
Create a v5e TPU node pool with 1 node and 8 chips:
gcloud container node-pools create tpunodepool \
--location=ZONE \
--num-nodes=1 \
--machine-type=ct5lp-hightpu-8t \
--cluster=CLUSTER_NAME
If TPU resources are available, GKE provisions the node pool. If
TPU resources are temporarily unavailable, the output shows a GCE_STOCKOUT
error message. To troubleshoot TPU stockout errors, refer to
Insufficient TPU resources to satisfy the TPU request.
Configure kubectl to communicate with your cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${ZONE}
Build the container
Run the make command to build the image
cd optimum-tpu && make tpu-tgi
Push the image to the Artifact Registry
gcloud artifacts repositories create optimum-tpu --repository-format=docker --location=REGION_NAME && \
gcloud auth configure-docker REGION_NAME-docker.pkg.dev && \
docker image tag huggingface/optimum-tpu REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest && \
docker push REGION_NAME-docker.pkg.dev/PROJECT_ID/optimum-tpu/tgi-tpu:latest
Create a Kubernetes Secret for Hugging Face credentials
Create a Kubernetes Secret that contains the Hugging Face token:
kubectl create secret generic hf-secret \
--from-literal=hf_api_token=${HF_TOKEN} \
--dry-run=client -o yaml | kubectl apply -f -
Deploy Optimum TPU
To deploy Optimum TPU, this tutorial uses a Kubernetes Deployment. A Deployment is a Kubernetes API object that lets you run multiple replicas of Pods that are distributed among the nodes in a cluster..
Gemma 2B
Save the following Deployment manifest as
optimum-tpu-gemma-2b-2x4.yaml
:This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.
Apply the manifest
kubectl apply -f optimum-tpu-gemma-2b-2x4.yaml
Llama3 8B
Save the following manifest as
optimum-tpu-llama3-8b-2x4.yaml
:This manifest describes an Optimum TPU deployment with an internal load balancer on TCP port 8080.
Apply the manifest
kubectl apply -f optimum-tpu-llama3-8b-2x4.yaml
View the logs from the running Deployment:
kubectl logs -f -l app=tgi-tpu
The output should be similar to the following:
2024-07-09T22:39:34.365472Z WARN text_generation_router: router/src/main.rs:295: no pipeline tag found for model google/gemma-2b
2024-07-09T22:40:47.851405Z INFO text_generation_router: router/src/main.rs:314: Warming up model
2024-07-09T22:40:54.559269Z INFO text_generation_router: router/src/main.rs:351: Setting max batch total tokens to 64
2024-07-09T22:40:54.559291Z INFO text_generation_router: router/src/main.rs:352: Connected
2024-07-09T22:40:54.559295Z WARN text_generation_router: router/src/main.rs:366: Invalid hostname, defaulting to 0.0.0.0
Make sure the model is fully downloaded before proceeding to the next section.
Serve the model
Set up port forwarding to the model:
kubectl port-forward svc/service 8080:8080
Interact with the model server using curl
Verify your deployed models:
In a new terminal session, use curl
to chat with the model:
curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":40}}' -H 'Content-Type: application/json'
The output should be similar to the following:
{"generated_text":"\n\nDeep learning is a subset of machine learning that uses artificial neural networks to learn from data.\n\nArtificial neural networks are inspired by the way the human brain works. They are made up of multiple layers"}