Deploy a Ray Serve application with a Stable Diffusion model on Google Kubernetes Engine (GKE) with TPUs

Standard

This guide demonstrates how to deploy and serve a Stable Diffusion model on Google Kubernetes Engine (GKE) using TPUs, Ray Serve, and the Ray Operator add-on.

This guide is intended for Generative AI customers, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving models using Ray.

About Ray and Ray Serve

Ray is an open-source scalable compute framework for AI/ML applications. Ray Serve is a model serving library for Ray used for scaling and serving models in a distributed environment. For more information, see Ray Serve in the Ray documentation.

About TPUs

Tensor Processing Units (TPUs) are specialized hardware accelerators designed to significantly speed up the training and inference of large-scale machine learning models. Using Ray with TPUs lets you seamlessly scale high-performance ML applications. For more information about TPUs, see Introduction to Cloud TPU in the Cloud TPU documentation.

About the KubeRay TPU initialization webhook

As part of the Ray Operator add-on, GKE provides validating and mutating webhooks that handle TPU Pod scheduling and certain TPU environment variables required by frameworks like JAX for container initialization. The KubeRay TPU webhook mutates Pods with the app.kubernetes.io/name: kuberay label requesting TPUs with the following properties:

TPU_WORKER_ID: A unique integer for each worker Pod in the TPU slice.
TPU_WORKER_HOSTNAMES: A list of DNS hostnames for all TPU workers that need to communicate with each other within the slice. This variable is only injected for TPU Pods in a multi-host group.
replicaIndex: A Pod label that contains a unique identifier for the worker-group replica the Pod belongs to. This is useful for multi-host worker groups, where multiple worker Pods might belong to the same replica, and is used by Ray to enable multi-host autoscaling.
TPU_NAME: A string representing the GKE TPU PodSlice this Pod belongs to, set to the same value as the replicaIndex label.
podAffinity: Ensures GKE schedules TPU Pods with matching replicaIndex labels on the same node pool. This lets GKE scale multi-host TPUs atomically by node pools, rather than single nodes.

Objectives

Create a GKE cluster with a TPU node pool.
Deploy a Ray cluster with TPUs.
Deploy a RayService custom resource.
Interact with the Stable Diffusion model server.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

Cloud Shell is preinstalled with the software you need for this tutorial, including kubectl, and the gcloud CLI. If you don't use Cloud Shell, install the gcloud CLI.

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Install the Google Cloud CLI.

To initialize the gcloud CLI, run the following command:

gcloud init

Note: If you installed the gcloud CLI previously, make sure you have the latest version by running

gcloud components
      update

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
```
- Replace PROJECT_ID with your project ID.
- Replace USER_IDENTIFIER with the identifier for your user account. For example, user:myemail@example.com.
- Replace ROLE with each individual role.

Ensure sufficient quota

Ensure that your Google Cloud project has sufficient TPU quota in your Compute Engine region or zone. For more information, see Ensure sufficient TPU and GKE quotas in the Cloud TPU documentation. You might also need to increase your quotas for:

Persistent Disk SSD (GB)
In-use IP addresses

Prepare your environment

To prepare your environment, follow these steps:

Launch a Cloud Shell session from the Google Cloud console, by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.
Set environment variables:
```
export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central2-b
export CLUSTER_VERSION=CLUSTER_VERSION
```
Replace the following:
- PROJECT_ID: your Google Cloud project ID.
- CLUSTER_VERSION: the GKE version to use. Must be 1.30.1 or later.

Clone the GitHub repository:

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

Change to the working directory:

cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion

Create a cluster with a TPU node pool

Create a Standard GKE cluster with a TPU node pool:

Create a Standard mode cluster with the Ray Operator enabled:

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --machine-type=n1-standard-8 \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

Create a single-host TPU node pool:

gcloud container node-pools create tpu-pool \
    --location=${COMPUTE_REGION} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct4p-hightpu-4t \
    --num-nodes=1

To use TPUs with Standard mode, you must select:

A Compute Engine location with capacity for TPU accelerators
A compatible machine type for the TPU and
The physical topology of the TPU PodSlice

Configure a RayCluster resource with TPUs

Configure your RayCluster manifest to prepare your TPU workload:

Configure TPU `nodeSelector`

GKE uses Kubernetes nodeSelectors to ensure that TPU workloads are scheduled on the appropriate TPU topology and accelerator. For more information about selecting TPU nodeSelectors, see Deploy TPU workloads in GKE Standard.

Update the ray-cluster.yaml manifest to schedule your Pod on a v4 TPU podslice with a 2x2x1 topology:

nodeSelector:
  cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
  cloud.google.com/gke-tpu-topology: 2x2x1

Configure a TPU container resource

To use a TPU accelerator, you must specify the number of TPU chips that GKE should allocate to each Pod by configuring the google.com/tpuresource limits and requests in the TPU container field of your RayCluster manifest workerGroupSpecs.

Update the ray-cluster.yaml manifest with resource limits and requests:

resources:
  limits:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"
   requests:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"

Configure worker group `numOfHosts`

KubeRay v1.1.0 adds a numOfHosts field to the RayCluster custom resource, which specifies the number of TPU hosts to create per worker group replica. For multi-host worker groups, replicas are treated as PodSlices rather than individual workers, with numOfHosts worker nodes being created per replica.

Update the ray-cluster.yaml manifest with the following:

workerGroupSpecs:
  # Several lines omitted
  numOfHosts: 1 # the number of "hosts" or workers per replica

Create a RayService custom resource

Create a RayService custom resource:

Review the following manifest:

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stable-diffusion-tpu
spec:
  serveConfigV2: |
    applications:
      - name: stable_diffusion
        import_path: ai-ml.gke-ray.rayserve.stable-diffusion.stable_diffusion_tpu:deployment
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/refs/heads/main.zip"
          pip:
            - diffusers==0.7.2
            - flax
            - jax[tpu]==0.4.11
            - -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
            - fastapi
  rayClusterConfig:
    rayVersion: '2.9.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.9.0-py310
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "2"
                memory: "8G"
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.9.0-py310
            resources:
              limits:
                cpu: "100"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 200G
              requests:
                cpu: "100"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 200G
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x1

This manifest describes a RayService custom resource that creates a RayCluster resource with 1 head node and a TPU worker group with a 2x2x1 topology, meaning each worker node will have 4 v4 TPU chips.

The TPU node belongs to a single v4 TPU podslice with a 2x2x1 topology. To create a multi-host worker group, replace the gke-tpu nodeSelector values, google.com/tpu container limits and requests, and numOfHosts values with your multi-host configuration. For more information about TPU multi-host topologies, see System architecture in the Cloud TPU documentation.

Apply the manifest to your cluster:
```
kubectl apply -f ray-service-tpu.yaml
```
Verify the RayService resource is running:
```
kubectl get rayservices
```
The output is similar to the following:
```
NAME                   SERVICE STATUS   NUM SERVE ENDPOINTS
stable-diffusion-tpu   Running          2
```
In this output, Running in the SERVICE STATUS column indicates the RayService resource is ready.

(Optional) View the Ray Dashboard

You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.

Establish a port-forwarding session to the Ray dashboard from the Ray head service:
```
kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265
```
In a web browser, go to http://localhost:8265/.
Click the Serve tab.

Send prompts to the model server

Establish a port-forwarding session to the Serve endpoint from the Ray head service:
```
kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000
```
Open a new Cloud Shell session.
Submit a text-to-image prompt to the Stable Diffusion model server:
```
python stable_diffusion_tpu_req.py  --save_pictures
```
The results of the stable diffusion inference are saved to a file named diffusion_results.png.

Clean up

Delete the project

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

Delete individual resources

To delete the cluster, type:

gcloud container clusters delete ${CLUSTER_NAME}

What's next

Learn about Ray on Kubernetes.
Explore the KubeRay documentation.
Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.