This guide demonstrates how to deploy and serve a Stable Diffusion model on Google Kubernetes Engine (GKE) using TPUs, Ray Serve, and the Ray Operator add-on.
This guide is intended for Generative AI customers, new or existing users of GKE, ML Engineers, MLOps (DevOps) engineers, or platform administrators who are interested in using Kubernetes container orchestration capabilities for serving models using Ray.
About Ray and Ray Serve
Ray is an open-source scalable compute framework for AI/ML applications. Ray Serve is a model serving library for Ray used for scaling and serving models in a distributed environment. For more information, see Ray Serve in the Ray documentation.
About TPUs
Tensor Processing Units (TPUs) are specialized hardware accelerators designed to significantly speed up the training and inference of large-scale machine learning models. Using Ray with TPUs lets you seamlessly scale high-performance ML applications. For more information about TPUs, see Introduction to Cloud TPU in the Cloud TPU documentation.
About the KubeRay TPU initialization webhook
As part of the Ray Operator add-on, GKE provides validating and
mutating
webhooks
that handle TPU Pod scheduling and certain TPU environment variables required
by frameworks like
JAX for container initialization. The KubeRay
TPU webhook mutates Pods with the app.kubernetes.io/name: kuberay
label
requesting TPUs with the following properties:
TPU_WORKER_ID
: A unique integer for each worker Pod in the TPU slice.TPU_WORKER_HOSTNAMES
: A list of DNS hostnames for all TPU workers that need to communicate with each other within the slice. This variable is only injected for TPU Pods in a multi-host group.replicaIndex
: A Pod label that contains a unique identifier for the worker-group replica the Pod belongs to. This is useful for multi-host worker groups, where multiple worker Pods might belong to the same replica, and is used by Ray to enable multi-host autoscaling.TPU_NAME
: A string representing the GKE TPU PodSlice this Pod belongs to, set to the same value as thereplicaIndex
label.podAffinity
: Ensures GKE schedules TPU Pods with matchingreplicaIndex
labels on the same node pool. This lets GKE scale multi-host TPUs atomically by node pools, rather than single nodes.
Objectives
- Create a GKE cluster with a TPU node pool.
- Deploy a Ray cluster with TPUs.
- Deploy a RayService custom resource.
- Interact with the Stable Diffusion model server.
Costs
In this document, you use the following billable components of Google Cloud:
To generate a cost estimate based on your projected usage,
use the pricing calculator.
When you finish the tasks that are described in this document, you can avoid continued billing by deleting the resources that you created. For more information, see Clean up.
Before you begin
Cloud Shell is preinstalled with the software you need for this
tutorial, including kubectl
,
and the gcloud CLI. If you don't use Cloud Shell,
install the gcloud CLI.
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API:
gcloud services enable container.googleapis.com
- Install the Google Cloud CLI.
-
To initialize the gcloud CLI, run the following command:
gcloud init
-
Create or select a Google Cloud project.
-
Create a Google Cloud project:
gcloud projects create PROJECT_ID
Replace
PROJECT_ID
with a name for the Google Cloud project you are creating. -
Select the Google Cloud project that you created:
gcloud config set project PROJECT_ID
Replace
PROJECT_ID
with your Google Cloud project name.
-
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the GKE API:
gcloud services enable container.googleapis.com
-
Grant roles to your user account. Run the following command once for each of the following IAM roles:
roles/container.clusterAdmin, roles/container.admin
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
- Replace
PROJECT_ID
with your project ID. -
Replace
USER_IDENTIFIER
with the identifier for your user account. For example,user:myemail@example.com
. - Replace
ROLE
with each individual role.
- Replace
Ensure sufficient quota
Ensure that your Google Cloud project has sufficient TPU quota in your Compute Engine region or zone. For more information, see Ensure sufficient TPU and GKE quotas in the Cloud TPU documentation. You might also need to increase your quotas for:
- Persistent Disk SSD (GB)
- In-use IP addresses
Prepare your environment
To prepare your environment, follow these steps:
Launch a Cloud Shell session from the Google Cloud console, by clicking Activate Cloud Shell in the Google Cloud console. This launches a session in the bottom pane of the Google Cloud console.
Set environment variables:
export PROJECT_ID=PROJECT_ID export CLUSTER_NAME=ray-cluster export COMPUTE_REGION=us-central2-b export CLUSTER_VERSION=CLUSTER_VERSION
Replace the following:
PROJECT_ID
: your Google Cloud project ID.CLUSTER_VERSION
: the GKE version to use. Must be1.30.1
or later.
Clone the GitHub repository:
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
Change to the working directory:
cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion
Create a cluster with a TPU node pool
Create a Standard GKE cluster with a TPU node pool:
Create a Standard mode cluster with the Ray Operator enabled:
gcloud container clusters create ${CLUSTER_NAME} \ --addons=RayOperator \ --machine-type=n1-standard-8 \ --cluster-version=${CLUSTER_VERSION} \ --location=${COMPUTE_REGION}
Create a single-host TPU node pool:
gcloud container node-pools create tpu-pool \ --location=${COMPUTE_REGION} \ --cluster=${CLUSTER_NAME} \ --machine-type=ct4p-hightpu-4t \ --num-nodes=1 \ --tpu-topology=2x2x1
To use TPUs with Standard mode, you must select:
- A Compute Engine location with capacity for TPU accelerators
- A compatible machine type for the TPU and
- The physical topology of the TPU PodSlice
Configure a RayCluster resource with TPUs
Configure your RayCluster manifest to prepare your TPU workload:
Configure TPU nodeSelector
GKE uses Kubernetes nodeSelectors to ensure that TPU workloads are scheduled on the appropriate TPU topology and accelerator. For more information about selecting TPU nodeSelectors, see Deploy TPU workloads in GKE Standard.
Update the ray-cluster.yaml
manifest to schedule your Pod on a v4 TPU podslice
with a 2x2x1 topology:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x1
Configure a TPU container resource
To use a TPU accelerator, you must specify the number of TPU chips that
GKE should allocate to each Pod by configuring the
google.com/tpu
resource limits
and requests
in the TPU container field
of your RayCluster manifest workerGroupSpecs
.
Update the ray-cluster.yaml
manifest with resource limits and requests:
resources:
limits:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: "2G"
requests:
cpu: "1"
ephemeral-storage: 10Gi
google.com/tpu: "4"
memory: "2G"
Configure worker group numOfHosts
KubeRay v1.1.0 adds a numOfHosts
field to the RayCluster custom resource,
which specifies the number of TPU hosts to create per worker group replica.
For multi-host worker groups, replicas are treated as PodSlices rather than
individual workers, with numOfHosts
worker nodes being created per replica.
Update the ray-cluster.yaml
manifest with the following:
workerGroupSpecs:
# Several lines omitted
numOfHosts: 1 # the number of "hosts" or workers per replica
Create a RayService custom resource
Create a RayService custom resource:
Review the following manifest:
This manifest describes a RayService custom resource that creates a RayCluster resource with 1 head node and a TPU worker group with a 2x2x1 topology, meaning each worker node will have 4 v4 TPU chips.
The TPU node belongs to a single v4 TPU podslice with a 2x2x1 topology. To create a multi-host worker group, replace the
gke-tpu nodeSelector
values,google.com/tpu
container limits and requests, andnumOfHosts
values with your multi-host configuration. For more information about TPU multi-host topologies, see System architecture in the Cloud TPU documentation.Apply the manifest to your cluster:
kubectl apply -f ray-service-tpu.yaml
Verify the RayService resource is running:
kubectl get rayservices
The output is similar to the following:
NAME SERVICE STATUS NUM SERVE ENDPOINTS stable-diffusion-tpu Running 2
In this output,
Running
in theSERVICE STATUS
column indicates the RayService resource is ready.
(Optional) View the Ray Dashboard
You can view your Ray Serve deployment and relevant logs from the Ray Dashboard.
Establish a port-forwarding session to the Ray dashboard from the Ray head service:
kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265
In a web browser, go to
http://localhost:8265/
.Click the Serve tab.
Send prompts to the model server
Establish a port-forwarding session to the Serve endpoint from the Ray head service:
kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000
Open a new Cloud Shell session.
Submit a text-to-image prompt to the Stable Diffusion model server:
python stable_diffusion_tpu_req.py --save_pictures
The results of the stable diffusion inference are saved to a file named
diffusion_results.png
.
Clean up
Delete the project
Delete a Google Cloud project:
gcloud projects delete PROJECT_ID
Delete individual resources
To delete the cluster, type:
gcloud container clusters delete ${CLUSTER_NAME}
What's next
- Learn about Ray on Kubernetes.
- Explore the KubeRay documentation.
- Explore reference architectures, diagrams, and best practices about Google Cloud. Take a look at our Cloud Architecture Center.