通过 MaxDiffusion 使用 GKE 上的 TPU 应用 Stable Diffusion XL (SDXL)

Autopilot Standard

本教程介绍了如何通过 MaxDiffusion，使用 Google Kubernetes Engine (GKE) 上的张量处理单元 (TPU) 应用 SDXL 图片生成模型。在本教程中，您将从 Hugging Face 下载模型，然后使用运行 MaxDiffusion 的容器将其部署到 Autopilot 或 Standard 集群上。

如果您在部署和应用 AI/机器学习工作负载时需要利用托管式 Kubernetes 的精细控制、自定义、可伸缩性、弹性、可移植性和成本效益，那么本指南是一个很好的起点。如果您需要统一的托管式 AI 平台来经济高效地快速构建和应用机器学习模型，我们建议您试用我们的 Vertex AI 部署解决方案。

背景

通过 MaxDiffusion 使用 GKE 上的 TPU 来应用 SDXL，您可以构建一个可用于生产用途的强大服务解决方案，具备托管式 Kubernetes 的所有优势，包括经济高效、可伸缩性和更高的可用性。本部分介绍本教程中使用的关键技术。

Stable Diffusion XL (SDXL)

Stable Diffusion XL (SDXL) 是 MaxDiffusion 支持用于推理的一种潜在 diffusion 模型 (LDM)。对于生成式 AI，您可以使用 LDM 基于文本描述生成高质量的图片。LDM 对于图片搜索和图片标注等应用非常有用。

SDXL 支持使用分片注解进行单主机或多主机推理。这样一来，SDXL 就可以跨多个机器进行训练和运行，从而提高效率。

如需了解详情，请参阅 Stability AI 仓库提供的生成模型和 SDXL 论文。

TPU

TPU 是 Google 定制开发的应用专用集成电路 (ASIC)，用于加速机器学习和使用 TensorFlow、PyTorch 和 JAX 等框架构建的 AI 模型。

使用 GKE 中的 TPU 之前，我们建议您完成以下学习路线：

了解 Cloud TPU 系统架构中的当前 TPU 版本可用性。
了解 GKE 中的 TPU。

本教程介绍如何提供 SDXL 模型。GKE 在单主机 TPU v5e 节点上部署模型，并根据模型要求配置 TPU 拓扑，以低延迟响应提示。在本指南中，该模型使用具有 1x1 拓扑的 TPU v5e 芯片。

MaxDiffusion

MaxDiffusion 是一系列用 Python 和 Jax 编写的参考实现，其中包含在 XLA 设备（包括 TPU 和 GPU）上运行的各种潜在 diffusion 模型。MaxDiffusion 是 Diffusion 项目的起点，可用于研究和生产。

如需了解详情，请参阅 MaxDiffusion 仓库。

目标

本教程适用于使用 JAX 的生成式 AI 客户、SDXL 的新用户或现有用户、任何机器学习工程师、MLOps (DevOps) 工程师或是对使用 Kubernetes 容器编排功能应用 LLM 感兴趣的平台管理员。

本教程介绍以下步骤：

根据模型特征创建一个具有推荐 TPU 拓扑的 GKE Autopilot 或 Standard 集群。
构建 SDXL 推理容器映像。
在 GKE 上部署 SDXL 推理服务器。
通过 Web 应用提供与模型的交互。

架构

本部分介绍本教程中使用的 GKE 架构。该架构包括 GKE Autopilot 集群或 Standard 集群，用于预配 TPU 和托管 MaxDiffusion 组件。GKE 使用这些组件来部署和应用模型。

下图展示了此架构的组件：

使用 GKE 上的 TPU v5e 应用 MaxDiffusion 的示例架构。

此架构包括以下组件：

GKE Autopilot 或 Standard 区域级集群。
一个在 MaxDiffusion 部署中托管 SDXL 模型的单主机 TPU 切片节点池。
具有 ClusterIP 类型的负载均衡器的 Service 组件。此 Service 会将入站流量分布到所有 MaxDiffusion HTTP 副本。
具有外部 LoadBalancer Service 的 WebApp HTTP 服务器，该 Service 分配入站流量并将模型服务流量重定向到 ClusterIP Service。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the required API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  进入 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 在选择角色列表中，选择一个角色。
6. 如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
7. 点击 Save（保存）。

确保您有足够的配额用于 TPU v5e PodSlice Lite 芯片。在本教程中，您将使用按需实例。

准备环境

在本教程中，您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 预安装有本教程所需的软件，包括 kubectl 和 gcloud CLI。

如需使用 Cloud Shell 设置您的环境，请按照以下步骤操作：

在 Google Cloud 控制台中，点击 Google Cloud 控制台中的 激活 Cloud Shell 以启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。
设置默认环境变量：
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION_NAME
export ZONE=ZONE
```
替换以下值：
- PROJECT_ID：您的 Google Cloud 项目 ID。
- CLUSTER_NAME：GKE 集群的名称。
- REGION_NAME：GKE 集群、Cloud Storage 存储桶和 TPU 节点所在的区域。该区域包含可以使用 TPU v5e 机器类型的可用区（例如 us-west1、us-west4、us-central1、us-east1、us-east5 或 europe-west4）。
- （仅限标准集群）ZONE：可以使用 TPU 资源的可用区（例如 us-west4-a）。对于 Autopilot 集群，您无需指定可用区，只需指定区域。

克隆示例代码库并打开教程目录：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
cd kubernetes-engine-samples/ai-ml/maxdiffusion-tpu 
WORK_DIR=$(pwd)
gcloud artifacts repositories create gke-llm --repository-format=docker --location=$REGION
gcloud auth configure-docker $REGION-docker.pkg.dev

创建和配置 Google Cloud 资源

请按照以下说明创建所需的资源。

创建 GKE 集群

您可以在 GKE Autopilot 或 Standard 集群中的 TPU 上应用 SDXL。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式，请参阅选择 GKE 操作模式。

Autopilot

在 Cloud Shell 中，运行以下命令：

gcloud container clusters create-auto ${CLUSTER_NAME} \
  --project=${PROJECT_ID} \
  --region=${REGION} \
  --release-channel=rapid \
  --cluster-version=1.29

GKE 会根据所部署的工作负载的请求，创建具有 CPU 和 TPU 节点的 Autopilot 集群。

配置 kubectl 以与您的集群通信：

  gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}

Standard

创建使用适用于 GKE 的工作负载身份联合的区域级 GKE Standard 集群。

gcloud container clusters create ${CLUSTER_NAME} \
    --enable-ip-alias \
    --machine-type=n2-standard-4 \
    --num-nodes=2 \
    --workload-pool=${PROJECT_ID}.svc.id.goog \
    --location=${REGION}

集群创建可能需要几分钟的时间。

运行以下命令来为集群创建节点池：
```
gcloud container node-pools create maxdiffusion-tpu-nodepool \
  --cluster=${CLUSTER_NAME} \
  --machine-type=ct5lp-hightpu-1t \
  --num-nodes=1 \
  --region=${REGION} \
  --node-locations=${ZONE} \
  --spot
```
GKE 会创建具有 1x1 拓扑和一个节点的 TPU v5e 节点池。

如需创建具有不同拓扑的节点池，请了解如何规划 TPU 配置。请确保更新本教程中的示例值，例如 cloud.google.com/gke-tpu-topology 和 google.com/tpu。

配置 kubectl 以与您的集群通信：

  gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${REGION}

构建 SDXL 推理容器

请按照以下说明为 SDXL 推理服务器构建容器映像。

打开 server/cloudbuild.yaml 清单：

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/max-diffusion:latest', '.' ]
images:
- '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/max-diffusion:latest'

执行构建并创建推理容器映像。
```
cd $WORK_DIR/build/server
gcloud builds submit . --region=$REGION
```
输出包含容器映像的路径。

部署 SDXL 推理服务器

在本部分中，您将部署 SDXL 推理服务器。本教程将使用 Kubernetes Deployment 来部署该服务器。Deployment 是一个 Kubernetes API 对象，可让您运行在集群的节点中分布的多个 Pod 副本。

探索 serve_sdxl_v5e.yaml 清单。

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stable-diffusion-deployment
spec:
  selector:
    matchLabels:
      app: max-diffusion-server
  replicas: 1  # number of nodes in node-pool
  template:
    metadata:
      labels:
        app: max-diffusion-server
    spec:
      nodeSelector:
        cloud.google.com/gke-tpu-topology: 1x1 #  target topology
        cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
        #cloud.google.com/gke-spot: "true"
      volumes:
      - name: dshm
        emptyDir:
              medium: Memory
      containers:
      - name: serve-stable-diffusion
        image: REGION-docker.pkg.dev/PROJECT_ID/gke-llm/max-diffusion:latest
        env:
        - name: MODEL_NAME
          value: 'stable_diffusion'
        ports:
        - containerPort: 8000
        resources:
          requests:
            google.com/tpu: 1  # TPU chip request
          limits:
            google.com/tpu: 1  # TPU chip request
        volumeMounts:
            - mountPath: /dev/shm
              name: dshm

---
apiVersion: v1
kind: Service
metadata:
  name: max-diffusion-server
  labels:
    app: max-diffusion-server
spec:
  type: ClusterIP
  ports:
    - port: 8000
      targetPort: 8000
      name: http-max-diffusion-server
      protocol: TCP
  selector:
    app: max-diffusion-server

更新清单中的项目 ID。

cd $WORK_DIR
sed -i "s|PROJECT_ID|$PROJECT_ID|g" serve_sdxl_v5e.yaml
sed -i "s|REGION|$REGION|g" serve_sdxl_v5e.yaml

应用清单：

kubectl apply -f serve_sdxl_v5e.yaml

输出类似于以下内容：

deployment.apps/max-diffusion-server created

验证模型的状态：

watch kubectl get deploy

输出类似于以下内容：

NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
stable-diffusion-deployment   1/1     1            1           8m21s

检索 ClusterIP 地址：
```
kubectl get service max-diffusion-server
```
输出包含 ClusterIP 字段。请记下 CLUSTER-IP 值。

验证 Deployment：

 export ClusterIP=CLUSTER_IP
 kubectl run curl --image=curlimages/curl \
    -it --rm --restart=Never \
    -- "$ClusterIP:8000"

将 CLUSTER_IP 替换为您前面记下的 CLUSTER-IP 值。输出类似于以下内容：

{"message":"Hello world! From FastAPI running on Uvicorn with Gunicorn."}
pod "curl" deleted

查看 Deployment 的日志：

kubectl logs -l app=max-diffusion-server

Deployment 完成后，输出类似于以下内容：

2024-06-12 15:45:45,459 [INFO] __main__: replicate params:
2024-06-12 15:45:46,175 [INFO] __main__: start initialized compiling
2024-06-12 15:45:46,175 [INFO] __main__: Compiling ...
2024-06-12 15:45:46,175 [INFO] __main__: aot compiling:
2024-06-12 15:45:46,176 [INFO] __main__: tokenize prompts:2024-06-12 15:48:49,093 [INFO] __main__: Compiled in 182.91802048683167
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

部署 webapp 客户端

在本部分中，您将部署 webapp 客户端以应用 SDXL 模型。

探索 build/webapp/cloudbuild.yaml 清单。

steps:
- name: 'gcr.io/cloud-builders/docker'
  args: [ 'build', '-t', '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/max-diffusion-web:latest', '.' ]
images:
- '$LOCATION-docker.pkg.dev/$PROJECT_ID/gke-llm/max-diffusion-web:latest'

在 build/webapp 目录下执行构建并创建客户端容器映像。
```
cd $WORK_DIR/build/webapp
gcloud builds submit . --region=$REGION
```
输出包含容器映像的路径。

打开 serve_sdxl_client.yaml 清单：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: max-diffusion-client
spec:
  selector:
    matchLabels:
      app: max-diffusion-client
  template:
    metadata:
      labels:
        app: max-diffusion-client
    spec:
      containers:
      - name: webclient
        image: REGION-docker.pkg.dev/PROJECT_ID/gke-llm/max-diffusion-web:latest
        env:
          - name: SERVER_URL
            value: "http://ClusterIP:8000"
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "256Mi"
            cpu: "500m"
        ports:
        - containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
  name: max-diffusion-client-service
spec:
  type: LoadBalancer
  selector:
    app: max-diffusion-client
  ports:
  - port: 8080
    targetPort: 5000

修改清单中的项目 ID：

cd $WORK_DIR
sed -i "s|PROJECT_ID|$PROJECT_ID|g" serve_sdxl_client.yaml
sed -i "s|ClusterIP|$ClusterIP|g" serve_sdxl_client.yaml
sed -i "s|REGION|$REGION|g" serve_sdxl_client.yaml

应用清单：

kubectl apply -f serve_sdxl_client.yaml

检索 LoadBalancer IP 地址：
```
kubectl get service max-diffusion-client-service
```
输出包含 LoadBalancer 字段。请记下 EXTERNAL-IP 值。

使用网页与模型交互

通过网络浏览器访问以下网址：
```
http://EXTERNAL_IP:8080
```
将 EXTERNAL_IP 替换为前面记下的 EXTERNAL_IP 值。

使用聊天界面与 SDXL 交互。添加提示，然后点击提交。例如：

Create a detailed image of a fictional historical site, capturing its unique architecture and cultural significance

输出是一个模型生成的图片，类似于以下示例：

SDXL 生成的图片

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

逐个删除资源

保留项目但删除各个资源，如以下部分所述。运行以下命令，并按照提示操作：

gcloud container clusters delete ${CLUSTER_NAME} --region=${REGION}

后续步骤

使用其他 TPU 拓扑配置本教程。如需详细了解其他 TPU 拓扑，请参阅规划 TPU 配置。
在您在本教程中克隆的示例仓库中探索 MaxDiffusion 推理服务器示例代码。
详细了解 GKE 中的 TPU。
探索 JetStream GitHub 仓库。
探索 Vertex AI Model Garden。

通过 MaxDiffusion 使用 GKE 上的 TPU 应用 Stable Diffusion XL (SDXL)

背景

Stable Diffusion XL (SDXL)

TPU

MaxDiffusion

目标

架构

准备工作

Check for the roles

Grant the roles

准备环境

创建和配置 Google Cloud 资源

创建 GKE 集群

Autopilot

Standard

构建 SDXL 推理容器

部署 SDXL 推理服务器

部署 webapp 客户端

使用网页与模型交互

清理

删除项目

逐个删除资源

后续步骤