在使用 TPU 的 Google Kubernetes Engine (GKE) 上利用 Stable Diffusion 模型部署 Ray Serve 应用

标准

本指南演示了如何使用 TPU、Ray Serve 和 Ray Operator 插件在 Google Kubernetes Engine (GKE) 上部署和应用 Stable Diffusion 模型。

本指南适用于对通过 Kubernetes 容器编排功能来使用 Ray 应用模型感兴趣的生成式 AI 客户、GKE 的新用户或现有用户、机器学习工程师、MLOps (DevOps) 工程师或平台管理员。

Ray 和 Ray Serve 简介

Ray 是一个适用于 AI/机器学习应用的开源可伸缩计算框架。Ray Serve 是 Ray 的模型部署库，用于在分布式环境中扩缩和应用模型。如需了解详情，请参阅 Ray 文档中的 Ray Serve。

TPU 简介

张量处理单元 (TPU) 是专用硬件加速器，旨在显著加快大规模机器学习模型的训练和推理速度。将 Ray 与 TPU 搭配使用，您可以无缝扩缩高性能 ML 应用。如需详细了解 TPU，请参阅 Cloud TPU 文档中的 Cloud TPU 简介。

KubeRay TPU 初始化网络钩子简介

作为 Ray Operator 插件的一部分，GKE 提供了验证和变更网络钩子，用于处理 TPU Pod 调度以及 JAX 等框架进行容器初始化所需的某些 TPU 环境变量。KubeRay TPU 网络钩子使用 app.kubernetes.io/name: kuberay 标签来请求具有以下属性的 Pod 以变更 TPU：

TPU_WORKER_ID：TPU 切片中每个工作器 Pod 的唯一整数。
TPU_WORKER_HOSTNAMES：需要在切片内相互通信的所有 TPU 工作器的 DNS 主机名列表。仅为多主机组中的 TPU Pod 注入此变量。
replicaIndex：一个 Pod 标签，其中包含 Pod 所属的工作器组副本的唯一标识符。这对多主机工作器组非常有用，在这种情况下，多个工作器 Pod 可能属于同一副本，并供 Ray 用于启用多主机自动扩缩。
TPU_NAME：表示此 Pod 所属的 GKE TPU PodSlice 的字符串，设置为与 replicaIndex 标签相同的值。
podAffinity：确保 GKE 在同一节点池中调度具有匹配 replicaIndex 标签的 TPU Pod。这样，GKE 就可以按节点池（而不是单个节点）以原子方式扩缩多主机 TPU。

目标

创建具有 TPU 节点池的 GKE 集群。
使用 TPU 部署 Ray 集群。
部署 RayService 自定义资源。
与 Stable Diffusion 模型服务器进行交互。

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

您可使用价格计算器根据您的预计使用情况来估算费用。

新 Google Cloud 用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

Cloud Shell 中预安装了本教程所需的软件，包括 kubectl 和 gcloud CLI。如果您不使用 Cloud Shell，请安装 gcloud CLI。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Install the Google Cloud CLI.

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API:

gcloud services enable container.googleapis.com

Grant roles to your user account. Run the following command once for each of the following IAM roles: roles/container.clusterAdmin, roles/container.admin
```
gcloud projects add-iam-policy-binding PROJECT_ID --member="user:USER_IDENTIFIER" --role=ROLE
```
Replace the following:
- PROJECT_ID: your project ID.
- USER_IDENTIFIER: the identifier for your user account—for example, myemail@example.com.
- ROLE: the IAM role that you grant to your user account.

确保有足够的配额

确保您的 Google Cloud 项目在 Compute Engine 区域或可用区中有足够的 TPU 配额。如需了解详情，请参阅 Cloud TPU 文档中的确保有足够的 TPU 和 GKE 配额。您可能还需要增加以下配额：

永久性磁盘（固态硬盘，单位为 GB）
使用中的 IP 地址数

准备环境

如需准备环境，请按照以下步骤操作：

点击Google Cloud 控制台中的 激活 Cloud Shell，从 Google Cloud 控制台启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。
设置环境变量：
```
export PROJECT_ID=PROJECT_ID
export CLUSTER_NAME=ray-cluster
export COMPUTE_REGION=us-central2-b
export CLUSTER_VERSION=CLUSTER_VERSION
```
替换以下内容：
- PROJECT_ID：您的 Google Cloud项目 ID。
- CLUSTER_VERSION：要使用的 GKE 版本。必须为 1.30.1 或更高版本。

克隆 GitHub 代码库：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples

切换到工作目录：

cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/stable-diffusion

创建具有 TPU 节点池的集群

创建具有 TPU 节点池的 Standard GKE 集群：

创建启用了 Ray Operator 的 Standard 模式集群：

gcloud container clusters create ${CLUSTER_NAME} \
    --addons=RayOperator \
    --machine-type=n1-standard-8 \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

创建单主机 TPU 节点池：

gcloud container node-pools create tpu-pool \
    --location=${COMPUTE_REGION} \
    --cluster=${CLUSTER_NAME} \
    --machine-type=ct4p-hightpu-4t \
    --num-nodes=1

如需在 Standard 模式下使用 TPU，您必须选择：

支持 TPU 加速器的 Compute Engine 位置
与 TPU 兼容的机器类型，以及
TPU PodSlice 的物理拓扑

使用 TPU 配置 RayCluster 资源

配置 RayCluster 清单以准备 TPU 工作负载：

配置 TPU `nodeSelector`

GKE 使用 Kubernetes nodeSelectors 确保在适当的 TPU 拓扑和加速器上调度 TPU 工作负载。如需详细了解如何选择 TPU nodeSelector，请参阅在 GKE Standard 中部署 TPU 工作负载。

更新 ray-cluster.yaml 清单以在具有 2x2x1 拓扑的 v4 TPU PodSlice 上调度 Pod：

nodeSelector:
  cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
  cloud.google.com/gke-tpu-topology: 2x2x1

配置 TPU 容器资源

如需使用 TPU 加速器，您必须通过在 RayCluster 清单 workerGroupSpecs 的 TPU 容器字段中配置 google.com/tpu 资源 limits 和 requests，指定 GKE 应为每个 Pod 分配的 TPU 芯片数量。

使用资源限制和请求更新 ray-cluster.yaml 清单：

resources:
  limits:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"
   requests:
    cpu: "1"
    ephemeral-storage: 10Gi
    google.com/tpu: "4"
    memory: "2G"

配置工作器组 `numOfHosts`

KubeRay v1.1.0 向 RayCluster 自定义资源添加了 numOfHosts 字段，以指定每个工作器组副本要创建的 TPU 主机数量。对于多主机工作器组，副本被视为 PodSlice（而不是单个工作器），其中每个副本创建 numOfHosts 个工作器节点。

使用以下代码更新 ray-cluster.yaml 清单：

workerGroupSpecs:
  # Several lines omitted
  numOfHosts: 1 # the number of "hosts" or workers per replica

创建 RayService 自定义资源

创建 RayService 自定义资源：

请查看以下清单：

apiVersion: ray.io/v1
kind: RayService
metadata:
  name: stable-diffusion-tpu
spec:
  serveConfigV2: |
    applications:
      - name: stable_diffusion
        import_path: ai-ml.gke-ray.rayserve.stable-diffusion.stable_diffusion_tpu:deployment
        runtime_env:
          working_dir: "https://github.com/GoogleCloudPlatform/kubernetes-engine-samples/archive/refs/heads/main.zip"
          pip:
            - diffusers==0.7.2
            - flax
            - jax[tpu]==0.4.11
            - -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
            - fastapi
  rayClusterConfig:
    rayVersion: '2.9.0'
    headGroupSpec:
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray-ml:2.9.0-py310
            ports:
            - containerPort: 6379
              name: gcs
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            - containerPort: 8000
              name: serve
            resources:
              limits:
                cpu: "2"
                memory: "8G"
              requests:
                cpu: "2"
                memory: "8G"
    workerGroupSpecs:
    - replicas: 1
      minReplicas: 1
      maxReplicas: 10
      numOfHosts: 1
      groupName: tpu-group
      rayStartParams: {}
      template:
        spec:
          containers:
          - name: ray-worker
            image: rayproject/ray-ml:2.9.0-py310
            resources:
              limits:
                cpu: "100"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 200G
              requests:
                cpu: "100"
                ephemeral-storage: 20Gi
                google.com/tpu: "4"
                memory: 200G
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x1

此清单描述了一个 RayService 自定义资源，该资源用于创建一个包含 1 个主节点的 RayCluster 资源和一个拓扑为 2x2x1 的 TPU 工作器组，这意味着每个工作器节点将有 4 个 v4 TPU 芯片。

TPU 节点属于具有 2x2x1 拓扑的单个 v4 TPU PodSlice。如需创建多主机工作器组，请将 gke-tpu nodeSelector 值、google.com/tpu 容器限制和请求以及 numOfHosts 值替换为您的多主机配置。如需详细了解 TPU 多主机拓扑，请参阅 Cloud TPU 文档中的系统架构。

将清单应用到您的集群：
```
kubectl apply -f ray-service-tpu.yaml
```
验证 RayService 资源是否正在运行：
```
kubectl get rayservices
```
输出类似于以下内容：
```
NAME                   SERVICE STATUS   NUM SERVE ENDPOINTS
stable-diffusion-tpu   Running          2
```
在此输出中，SERVICE STATUS 列中的 Running 表示 RayService 资源已准备就绪。

（可选）查看 Ray 信息中心

您可以通过 Ray 信息中心查看 Ray Serve 部署和相关日志。

建立从 Ray 头服务到 Ray 信息中心的端口转发会话：

kubectl port-forward svc/stable-diffusion-tpu-head-svc 8265:8265

在网络浏览器中，前往 http://localhost:8265/。
点击服务标签页。

将提示发送到模型服务器

从 Ray 头服务建立到 Serve 端点的端口转发会话：

kubectl port-forward svc/stable-diffusion-tpu-serve-svc 8000

打开新的 Cloud Shell 会话。
将文本转图片提示提交到 Stable Diffusion 模型服务器：
```
python stable_diffusion_tpu_req.py  --save_pictures
```
Stable Diffusion 推断的结果会保存到名为 diffusion_results.png 的文件中。

清理

删除项目

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

删除各个资源

如需删除集群，请输入以下命令：

gcloud container clusters delete ${CLUSTER_NAME}

后续步骤

了解 Ray on Kubernetes。
浏览 KubeRay 文档。
探索有关 Google Cloud 的参考架构、图表和最佳做法。查看我们的 Cloud 架构中心。

在使用 TPU 的 Google Kubernetes Engine (GKE) 上利用 Stable Diffusion 模型部署 Ray Serve 应用 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

Ray 和 Ray Serve 简介

TPU 简介

KubeRay TPU 初始化网络钩子简介

目标

费用

准备工作

确保有足够的配额

准备环境

创建具有 TPU 节点池的集群

使用 TPU 配置 RayCluster 资源

配置 TPU nodeSelector

配置 TPU 容器资源

配置工作器组 numOfHosts

创建 RayService 自定义资源

（可选）查看 Ray 信息中心

将提示发送到模型服务器

清理

删除项目

删除各个资源

后续步骤

在使用 TPU 的 Google Kubernetes Engine (GKE) 上利用 Stable Diffusion 模型部署 Ray Serve 应用

配置 TPU `nodeSelector`

配置工作器组 `numOfHosts`