Deploy a Ray Serve application with a Stable Diffusion model on Google Kubernetes Engine (GKE) with TPUs


This guide demonstrates how to deploy and serve a Stable Diffusion model on Google Kubernetes Engine (GKE) using TPUs, Ray Serve, and the Ray Operator add-on.

This guide is intended for Generative AI customers, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving models using Ray.

About Ray and Ray Serve

Ray is an open-source scalable compute framework for AI/ML applications. Ray Serve is a model serving library for Ray used for scaling and serving models in a distributed environment. For more information, see Ray Serve in the Ray documentation.

About TPUs

Tensor Processing Units (TPUs) are specialized hardware accelerators designed to significantly speed up the training and inference of large-scale machine learning models. Using Ray with TPUs lets you seamlessly scale high-performance ML applications. For more information about TPUs, see Introduction to Cloud TPU in the Cloud TPU documentation.

About the KubeRay TPU initialization webhook

As part of the Ray Operator add-on, GKE provides validating and mutating webhooks that handle TPU Pod scheduling and certain TPU environment variables required by frameworks like JAX for container initialization. The KubeRay TPU webhook mutates Pods with the app.kubernetes.io/name: kuberay label requesting TPUs with the following properties:

  • TPU_WORKER_ID: A unique integer for each worker Pod in the TPU slice.
  • TPU_WORKER_HOSTNAMES: A list of DNS hostnames for all TPU workers that need to communicate with each other within the slice. This variable is only injected for TPU Pods in a multi-host group.
  • replicaIndex: A Pod label that contains a unique identifier for the worker-group replica the Pod belongs to. This is useful for multi-host worker groups, where multiple worker Pods might belong to the same replica, and is used by Ray to enable multi-host autoscaling.
  • TPU_NAME: A string representing the GKE TPU PodSlice this Pod belongs to, set to the same value as the replicaIndex label.
  • podAffinity: Ensures GKE schedules TPU Pods with matching replicaIndex labels on the same node pool. This lets GKE scale multi-host TPUs atomically by node pools, rather than single nodes.

Objectives

  • Create a GKE cluster with a TPU node pool.
  • Deploy a Ray cluster with TPUs.
  • Deploy a RayService custom resource.
  • Interact with the Stable Diffusion model server.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.

Before you begin

Cloud Shell is preinstalled with the software you need for this tutorial, including kubectl, and the gcloud CLI. If you don't use Cloud Shell, install the gcloud CLI.

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. Install the Google Cloud CLI.
  3. To initialize the gcloud CLI, run the following command:

    gcloud init
  4. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  5. Make sure that billing is enabled for your Google Cloud project.

  6. Enable the GKE API:

    gcloud services enable container.googleapis.com
  7. Install the Google Cloud CLI.
  8. To initialize the gcloud CLI, run the following command:

    gcloud init
  9. Create or select a Google Cloud project.

    • Create a Google Cloud project:

      gcloud projects create PROJECT_ID

      Replace PROJECT_ID with a name for the Google Cloud project you are creating.

    • Select the Google Cloud project that you created:

      gcloud config set project PROJECT_ID

      Replace PROJECT_ID with your Google Cloud project name.

  10. Make sure that billing is enabled for your Google Cloud project.

  11. Enable the GKE API:

    gcloud services enable container.googleapis.com
  12. Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin

    gcloud projects add-iam-policy-binding PROJECT_ID --member="USER_IDENTIFIER" --role=ROLE
    • Replace PROJECT_ID with your project ID.
    • Replace USER_IDENTIFIER with the identifier for your user account. For example, user:myemail@example.com.

    • Replace ROLE with each individual role.

Ensure sufficient quota

Ensure that your Google Cloud project has sufficient TPU quota in your Compute Engine region or zone. For more information, see Ensure sufficient TPU and GKE quotas in the Cloud TPU documentation. You might also need to increase your quotas for:

  • Persistent Disk SSD (GB)
  • In-use IP addresses

Prepare your environment

To prepare your environment, follow these steps:

  1. Launch a Cloud Shell session from the Google Cloud console, by clicking Cloud Shell activation icon Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.

  2. Set environment variables:

    export PROJECT_ID=PROJECT_ID
    export CLUSTER_NAME=ray-cluster
    export COMPUTE_REGION=us-central2-b
    export CLUSTER_VERSION=CLUSTER_VERSION
    

    Replace the following:

    • PROJECT_ID: your Google Cloud project ID.
    • CLUSTER_VERSION: the GKE version to use. Must be 1.30.1 or later.
  3. Clone the GitHub repository:

    git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
    
  4. Change to the working directory:

    cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion
    

Create a cluster with a TPU node pool

Create an Autopilot or Standard GKE cluster with a TPU node pool:

Autopilot

Create an Autopilot mode cluster with the Ray Operator enabled:

gcloud container clusters create-auto ${CLUSTER_NAME} \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

To use TPUs with Autopilot mode, you must select a Compute Engine location with capacity for TPU accelerators. v4 TPUs are available in the us-central2-b. For more information on TPU availability by region/zone, see About TPUs in GKE.

Standard

  1. Create a Standard mode cluster with the Ray Operator enabled:

    gcloud container clusters create ${CLUSTER_NAME} \
        --addons=RayOperator \
        --cluster-version=${CLUSTER_VERSION} \
        --location=${COMPUTE_REGION}
    
  2. Create a single-host TPU node pool:

    gcloud container node-pools create ${NODEPOOL_NAME} \
        --location=${COMPUTE_REGION} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=ct4p-hightpu-4t \
        --num-nodes=1 \
        --tpu-topology=2x2x1
    

To use TPUs with Standard mode, you must select:

  • A Compute Engine location with capacity for TPU accelerators
  • A compatible machine type for the TPU and
  • The physical topology of the TPU PodSlice

Configure a RayCluster resource with TPUs

Configure your RayCluster manifest to prepare your TPU workload:

Configure TPU nodeSelector

GKE uses Kubernetes nodeSelectors to ensure that TPU workloads are scheduled on the appropriate TPU topology and accelerator. For more information about selecting TPU nodeSelectors, see Deploy TPU workloads in GKE Standard.

Update the ray-cluster.yaml manifest to schedule your Pod on a v4 TPU podslice with a 2x2x1 topology:

nodeSelector:
  cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
  cloud.google.com/gke-tpu-topology: 2x2x1

Configure a TPU container resource

To use a TPU accelerator, you must specify the number of TPU chips that GKE should allocate to each Pod by configuring the google.com/tpuresource limits and requests in the TPU container field of your RayCluster manifest workerGroupSpecs.

Update the ray-cluster.yaml manifest with resource limits and requests:

resources:
  limits:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"
   requests:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"

Configure worker group numOfHosts

KubeRay v1.1.0 adds a numOfHosts field to the RayCluster custom resource, which specifies the number of TPU hosts to create per worker group replica. For multi-host worker groups, replicas are treated as PodSlices rather than individual workers, with numOfHosts worker nodes being created per replica.

Update the ray-cluster.yaml manifest with the following:

workerGroupSpecs:
  # Several lines omitted
  numOfHosts: 1 # the number of "hosts" or workers per replica

Create a RayService custom resource

Create a RayService custom resource:

  1. Review the following manifest:

    apiVersion: ray.io/v1
    kind: RayService
    metadata:
      name: stable-diffusion-tpu
    spec:
      serveConfigV2: |
        applications:
          - name: stable_diffusion
            import_path: ai-ml.gke-ray.rayserve.stable-diffusion.stable_diffusion_tpu:deployment
            runtime_env:
              working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/refs/heads/main.zip"
              pip:
                - pydantic<2
                - google-api-python-client
                - pillow
                - diffusers==0.7.2
                - transformers==4.24.0
                - flax
                - ml_dtypes==0.2.0
                - jax[tpu]==0.4.11
                - -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                - fastapi
      rayClusterConfig:
        rayVersion: '2.9.0'
        headGroupSpec:
          rayStartParams: {}
          template:
            spec:
              containers:
              - name: ray-head
                image: rayproject/ray:2.9.0-py310
                ports:
                - containerPort: 6379
                  name: gcs
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
                resources:
                  limits:
                    cpu: "2"
                    memory: "8G"
                  requests:
                    cpu: "2"
                    memory: "8G"
        workerGroupSpecs:
        - replicas: 1
          minReplicas: 1
          maxReplicas: 10
          numOfHosts: 1
          groupName: tpu-group
          rayStartParams: {}
          template:
            spec:
              containers:
              - name: ray-worker
                image: rayproject/ray:2.9.0-py310
                resources:
                  limits:
                    cpu: "100"
                    ephemeral-storage: 20Gi
                    google.com/tpu: "4"
                    memory: 200G
                  requests:
                    cpu: "100"
                    ephemeral-storage: 20Gi
                    google.com/tpu: "4"
                    memory: 200G
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
                cloud.google.com/gke-tpu-topology: 2x2x1

    This manifest describes a RayService custom resource that creates a RayCluster resource with 1 head node and a TPU worker group with a 2x2x1 topology, meaning each worker node will have 4 v4 TPU chips.

    The TPU node belongs to a single v4 TPU podslice with a 2x2x1 topology. To create a multi-host worker group, replace the gke-tpu nodeSelector values, google.com/tpu container limits and requests, and numOfHosts values with your multi-host configuration. For more information about TPU multi-host topologies, see System architecture in the Cloud TPU documentation.

  2. Apply the manifest to your cluster:

    kubectl apply -f ray-service-tpu.yaml
    
  3. Verify the RayService resource is running:

    kubectl get rayservices
    

    The output is similar to the following:

    NAME                   SERVICE STATUS   NUM SERVE ENDPOINTS
    stable-diffusion-tpu   Running          2
    

    In this output, Running in the SERVICE STATUS column indicates the RayService resource is ready.

(Optional) View the Ray Dashboard

You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.

  1. Establish a port-forwarding session to the Ray dashboard from the Ray head service:

    kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265
    
  2. In a web browser, go to http://localhost:8265/.

  3. Click the Serve tab.

Send prompts to the model server

  1. Establish a port-forwarding session to the Serve endpoint from the Ray head service:

    kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000
    
  2. Open a new Cloud Shell session.

  3. Submit a text-to-image prompt to the Stable Diffusion model server:

    python stable_diffusion_tpu_req.py  --save_pictures
    

    The results of the stable diffusion inference are saved to a file named diffusion_results.png.

    Image generated by Stable Diffusion with 8 sections: a green chair, a man standing outside a house, a robot on the street, a family sitting at a table, a doc walking in a park, a flying dragon, a Japanese-style portrait of bears, and a waterfall.

Clean up

Delete the project

    Delete a Google Cloud project:

    gcloud projects delete PROJECT_ID

Delete individual resources

To delete the cluster, type:

gcloud container clusters delete ${CLUSTER_NAME}

What's next