Serve an LLM using multi-host TPUs on GKE with Saxml

Standard

This tutorial shows you how to deploy and serve a large language model (LLM) using multi-host TPU slice node pool on Google Kubernetes Engine (GKE) with Saxml for efficient scalable architecture.

Background

Saxml is an experimental system that serves Paxml, JAX, and PyTorch frameworks. You can use TPUs to accelerate data processing with these frameworks. To demo the deployment of TPUs in GKE, this tutorial serves the 175B LmCloudSpmd175B32Test test model. GKE deploys this test model on two v5e TPU slice node pools with 4x8 topology respectively.

To properly deploy the test model, the TPU topology has been defined based on the size of the model. Given that the N billion 16 bit model approximately requires around 2 times (2xN) GB of memory, the 175B LmCloudSpmd175B32Test model requires about 350 GB of memory. The TPU v5e single TPU chip has 16 GB. To support 350 GB, GKE needs 21 v5e TPU chips (350/16= 21). Based on the mapping of TPU configuration, the proper TPU configuration for this tutorial is:

Machine type: ct5lp-hightpu-4t
Topology: 4x8 (32 number of TPU chips)

Selecting the right TPU topology for serving a model is important when deploying TPUs in GKE. To learn more, see Plan your TPU configuration.

Objectives

This tutorial is intended for MLOps or DevOps engineers or platform administrators that want to use GKE orchestration capabilities for serving data models.

This tutorial covers the following steps:

Prepare your environment with a GKE Standard cluster. The cluster has two v5e TPU slice node pools with 4x8 topology.
Deploy Saxml. Saxml needs an administrator server, a group of Pods that work as the model server, a prebuilt HTTP server, and a load balancer.
Use the Saxml to serve the LLM.

The following diagram shows the architecture that the following tutorial implements:

Architecture of a multi-host TPU on GKE. — **Figure:** Example architecture of a multi-host TPU on GKE.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin, roles/iam.policyAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. Click Grant access.
4. In the New principals field, enter your user identifier. This is typically the email address for a Google Account.
5. In the Select a role list, select a role.
6. To grant additional roles, click Add another role and add each additional role.
7. Click Save.

Ensure your project has sufficient quota for Cloud TPU in GKE.

Prepare the environment

In the Google Cloud console, start a Cloud Shell instance:
Open Cloud Shell
Set the default environment variables:
```
  gcloud config set project PROJECT_ID
  export PROJECT_ID=$(gcloud config get project)
  export ZONE=COMPUTE_ZONE
  export BUCKET_NAME=PROJECT_ID-gke-bucket
```
Replace the following values:
- PROJECT_ID: Your Google Cloud project ID.
- COMPUTE_ZONE: The zone where the ct5lp-hightpu-4t is available.
In this command, BUCKET_NAME specifies the name of the Google Cloud storage bucket to store Saxml administrator server configurations.

Create a GKE Standard cluster

Use Cloud Shell to do the following:

Create a Standard cluster that uses Workload Identity Federation for GKE:
```
gcloud container clusters create saxml \
    --zone=${ZONE} \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --cluster-version=VERSION \
    --num-nodes=4
```
Replace the VERSION with the GKE version number. GKE supports TPU v5e in version 1.27.2-gke.2100 and later. For more information, see TPU availability in GKE.

The cluster creation might take several minutes.

Create the first node pool named tpu1:

gcloud container node-pools create tpu1 \
    --zone=${ZONE} \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=4x8 \
    --num-nodes=8 \
    --cluster=saxml

The value for the --num-nodes flag is calculated by dividing the TPU topology by the number of TPU chips per TPU slice. In this case: (4 * 8) / 4.

Create the second node pool named tpu2:

gcloud container node-pools create tpu2 \
    --zone=${ZONE} \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=4x8 \
    --num-nodes=8 \
    --cluster=saxml

The value for the --num-nodes flag is calculated by dividing the TPU topology by the number of TPU chips per TPU slice. In this case: (4 * 8) / 4.

You have created the following resources:

A Standard cluster with four CPU nodes.
Two v5e TPU slice node pools with 4x8 topology. Each node pools represent eight TPU slice nodes with 4 TPU chips each.

The 175B model has to be served on a multi-host v5e TPU slice with 4x8 topology slice (32 v5e TPU chips) at minimum.

Create a Cloud Storage bucket

Create a Cloud Storage bucket to store Saxml administrator server configurations. A running administrator server periodically saves its state and the details of the published models.

In Cloud Shell, run the following:

gcloud storage buckets create gs://${BUCKET_NAME}

Configure your workloads access using Workload Identity Federation for GKE

Assign a Kubernetes ServiceAccount to the application and configure that Kubernetes ServiceAccount to act as an IAM service account.

Configure kubectl to communicate with your cluster:

gcloud container clusters get-credentials saxml --zone=${ZONE}

Create a Kubernetes ServiceAccount for your application to use:

kubectl create serviceaccount sax-sa --namespace default

Create an IAM service account for your application:

gcloud iam service-accounts create sax-iam-sa

Add an IAM policy binding for your IAM service account to read and write to Cloud Storage:

gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member "serviceAccount:sax-iam-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role roles/storage.admin

Allow the Kubernetes ServiceAccount to impersonate the IAM service account by adding an IAM policy binding between the two service accounts. This binding allows the Kubernetes ServiceAccount to act as the IAM service account, so that the Kubernetes ServiceAccount can read and write to Cloud Storage.
```
gcloud iam service-accounts add-iam-policy-binding sax-iam-sa@${PROJECT_ID}.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:${PROJECT_ID}.svc.id.goog[default/sax-sa]"
```
Annotate the Kubernetes service account with the email address of the IAM service account. This lets your sample app know which service account to use to access Google Cloud services. So when the app uses any standard Google API Client Libraries to access Google Cloud services, it uses that IAM service account.
```
kubectl annotate serviceaccount sax-sa \
  iam.gke.io/gcp-service-account=sax-iam-sa@${PROJECT_ID}.iam.gserviceaccount.com
```

Deploy Saxml

In this section, you deploy the Saxml administrator server and the Saxml model server.

Deploy the Saxml administrator server

Create the following sax-admin-server.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sax-admin-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sax-admin-server
  template:
    metadata:
      labels:
        app: sax-admin-server
    spec:
      hostNetwork: false
      serviceAccountName: sax-sa
      containers:
      - name: sax-admin-server
        image: us-docker.pkg.dev/cloud-tpu-images/inference/sax-admin-server:v1.1.0
        securityContext:
          privileged: true
        ports:
        - containerPort: 10000
        env:
        - name: GSBUCKET
          value: BUCKET_NAME

Replace BUCKET_NAME with your Cloud Storage created earlier:

perl -pi -e 's|BUCKET_NAME|BUCKET_NAME|g' sax-admin-server.yaml

Apply the manifest:
```
kubectl apply -f sax-admin-server.yaml
```

Verify that the administrator server Pod is up and running:

kubectl get deployment

The output is similar to the following:

NAME               READY   UP-TO-DATE   AVAILABLE   AGE
sax-admin-server   1/1     1            1           52s

Deploy Saxml model server

Workloads running in multi-host TPU slices require a stable network identifier for each Pod to discover peers in the same TPU slice. To define these identifiers, use IndexedJob, StatefulSet with a headless Service or JobSet which automatically creates a headless Service for all the Jobs that belong to the JobSet. A Jobset is a workload API that lets you manage a group of Kubernetes Jobs as a unit. The most common use case for a JobSet is distributed training, but you can also use it to run batch workloads.

The following section shows how to manage multiple groups of model server Pods with JobSet.

Install JobSet v0.2.3 or later.

kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/JOBSET_VERSION/manifests.yaml

Replace the JOBSET_VERSION with the JobSet version. For example, v0.2.3.

Validate JobSet controller is running in the jobset-system namespace:

kubectl get pod -n jobset-system

The output is similar to the following:

NAME                                        READY   STATUS    RESTARTS   AGE
jobset-controller-manager-69449d86bc-hp5r6   2/2     Running   0          2m15s

Deploy two model servers in two TPU slice node pools. Save the following sax-model-server-set manifest:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: sax-model-server-set
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 4
  replicatedJobs:
    - name: sax-model-server
      replicas: 2
      template:
        spec:
          parallelism: 8
          completions: 8
          backoffLimit: 0
          template:
            spec:
              serviceAccountName: sax-sa
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                cloud.google.com/gke-tpu-topology: 4x8
              containers:
              - name: sax-model-server
                image: us-docker.pkg.dev/cloud-tpu-images/inference/sax-model-server:v1.1.0
                args: ["--port=10001","--sax_cell=/sax/test", "--platform_chip=tpuv5e"]
                ports:
                - containerPort: 10001
                - containerPort: 8471
                securityContext:
                  privileged: true
                env:
                - name: SAX_ROOT
                  value: "gs://BUCKET_NAME/sax-root"
                - name: MEGASCALE_NUM_SLICES
                  value: ""
                resources:
                  requests:
                    google.com/tpu: 4
                  limits:
                    google.com/tpu: 4

Replace BUCKET_NAME with your Cloud Storage created earlier:
```
perl -pi -e 's|BUCKET_NAME|BUCKET_NAME|g' sax-model-server-set.yaml
```
In this manifest:
- replicas: 2 is the number of Job replicas. Each job represents a model server. Therefore, a group of 8 Pods.
- parallelism: 8 and completions: 8 are equal to the number of nodes in each node pool.
- backoffLimit: 0 must be zero to mark the Job as failed if any Pod fails.
- ports.containerPort: 8471 is the default port for the VMs communication
- name: MEGASCALE_NUM_SLICES unsets the environment variable because GKE isn't running Multislice training.

Apply the manifest:

kubectl apply -f sax-model-server-set.yaml

Verify the status of the Saxml Admin Server and Model Server Pods:

kubectl get pods

The output is similar to the following:

NAME                                              READY   STATUS    RESTARTS   AGE
sax-admin-server-557c85f488-lnd5d                 1/1     Running   0          35h
sax-model-server-set-sax-model-server-0-0-nj4sm   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-1-sl8w4   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-2-hb4rk   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-3-qv67g   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-4-pzqz6   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-5-nm7mz   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-6-7br2x   1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-7-4pw6z   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-0-8mlf5   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-1-h6z6w   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-2-jggtv   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-3-9v8kj   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-4-6vlb2   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-5-h689p   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-6-bgv5k   1/1     Running   0          24m
sax-model-server-set-sax-model-server-1-7-cd6gv   1/1     Running   0          24m

In this example, there are 16 model server containers: sax-model-server-set-sax-model-server-0-0-nj4sm and sax-model-server-set-sax-model-server-1-0-8mlf5 are the two primary model servers in each group.

Your Saxml cluster has two model servers deployed on two v5e TPU slice node pools with 4x8 topology respectively.

Deploy Saxml HTTP Server and load balancer

Use the following prebuilt image HTTP server image. Save the following sax-http.yaml manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sax-http
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sax-http
  template:
    metadata:
      labels:
        app: sax-http
    spec:
      hostNetwork: false
      serviceAccountName: sax-sa
      containers:
      - name: sax-http
        image: us-docker.pkg.dev/cloud-tpu-images/inference/sax-http:v1.0.0
        ports:
        - containerPort: 8888
        env:
        - name: SAX_ROOT
          value: "gs://BUCKET_NAME/sax-root"
---
apiVersion: v1
kind: Service
metadata:
  name: sax-http-lb
spec:
  selector:
    app: sax-http
  ports:
  - protocol: TCP
    port: 8888
    targetPort: 8888
  type: LoadBalancer

Replace BUCKET_NAME with your Cloud Storage created earlier:

perl -pi -e 's|BUCKET_NAME|BUCKET_NAME|g' sax-http.yaml

Apply the sax-http.yaml manifest:
```
kubectl apply -f sax-http.yaml
```
Note: You can use your own HTTP server built for Saxml. To learn more, see Inferencing using Saxml and an HTTP Server.

Wait for the HTTP Server container to finish creating:

kubectl get pods

The output is similar to the following:

NAME                                              READY   STATUS    RESTARTS   AGE
sax-admin-server-557c85f488-lnd5d                 1/1     Running   0          35h
sax-http-65d478d987-6q7zd                         1/1     Running   0          24m
sax-model-server-set-sax-model-server-0-0-nj4sm   1/1     Running   0          24m
...

Wait for the Service to have an external IP address assigned:

kubectl get svc

The output is similar to the following:

NAME           TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)          AGE
sax-http-lb    LoadBalancer   10.48.11.80   10.182.0.87   8888:32674/TCP   7m36s

Use Saxml

Load, deploy, and serve the model on the Saxml in the v5e TPU multihost slice:

Load the model

Retrieve the load balancer IP address for Saxml.

LB_IP=$(kubectl get svc sax-http-lb -o jsonpath='{.status.loadBalancer.ingress[*].ip}')
PORT="8888"

Load the LmCloudSpmd175B test model in two v5e TPU slice node pools:

curl --request POST \
--header "Content-type: application/json" \
-s ${LB_IP}:${PORT}/publish --data \
'{
    "model": "/sax/test/spmd",
    "model_path": "saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd175B32Test",
    "checkpoint": "None",
    "replicas": 2
}'

The test model does not have a fine-tuned checkpoint, the weights are randomly generated. The model loading could take up to 10 minutes.

The output is similar to the following:

{
    "model": "/sax/test/spmd",
    "path": "saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd175B32Test",
    "checkpoint": "None",
    "replicas": 2
}

Check the model readiness:

kubectl logs sax-model-server-set-sax-model-server-0-0-nj4sm

The output is similar to the following:

...
loading completed.
Successfully loaded model for key: /sax/test/spmd

The model is fully loaded.

Get information about the model:

curl --request GET \
--header "Content-type: application/json" \
-s ${LB_IP}:${PORT}/listcell --data \
'{
    "model": "/sax/test/spmd"
}'

The output is similar to the following:

{
"model": "/sax/test/spmd",
"model_path": "saxml.server.pax.lm.params.lm_cloud.LmCloudSpmd175B32Test",
"checkpoint": "None",
"max_replicas": 2,
"active_replicas": 2
}

Serve the model

Serve a prompt request:

curl --request POST \
--header "Content-type: application/json" \
-s ${LB_IP}:${PORT}/generate --data \
'{
  "model": "/sax/test/spmd",
  "query": "How many days are in a week?"
}'

The output shows an example of the model response. This response might not be meaningful because the test model has random weights.

Unpublish the model

Run the following command to unpublish the model:

curl --request POST \
--header "Content-type: application/json" \
-s ${LB_IP}:${PORT}/unpublish --data \
'{
    "model": "/sax/test/spmd"
}'

The output is similar to the following:

{
  "model": "/sax/test/spmd"
}

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

Delete the deployed resources

Delete the cluster you created for this tutorial:

gcloud container clusters delete saxml --zone ${ZONE}

Delete the service account:

gcloud iam service-accounts delete sax-iam-sa@${PROJECT_ID}.iam.gserviceaccount.com

Delete the Cloud Storage bucket:

gcloud storage rm -r gs://${BUCKET_NAME}

What's next

Learn about current TPU versions with the Cloud TPU system architecture.
Learn more about TPUs in GKE.

Serve an LLM using multi-host TPUs on GKE with Saxml Stay organized with collections Save and categorize content based on your preferences.

Background

Objectives

Before you begin

Check for the roles

Grant the roles

Prepare the environment

Create a GKE Standard cluster

Create a Cloud Storage bucket

Configure your workloads access using Workload Identity Federation for GKE

Deploy Saxml

Deploy the Saxml administrator server

Deploy Saxml model server

Deploy Saxml HTTP Server and load balancer

Use Saxml

Load the model

Serve the model

Unpublish the model

Clean up

Delete the deployed resources

What's next

Serve an LLM using multi-host TPUs on GKE with Saxml