此页面由 Cloud Translation API 翻译。

在 GKE 上运行 Cloud TPU 应用

本指南介绍如何完成以下任务：

设置 Cloud TPU 配置，以便为在 Google Kubernetes Engine 下运行做好准备
使用 Cloud TPU 创建 GKE 集群支持
使用 TensorBoard 直观呈现 Cloud TPU 指标并分析应用的表现
在 Docker 中构建模型并将模型容器化

如需详细了解 TPU 虚拟机架构，请参阅系统架构。本指南只能与 TPU 节点架构搭配使用。

在 GKE 上运行 Cloud TPU 应用的优势

Cloud TPU 训练应用可以配置为 GKE Pod 内的容器中运行。这样配置有以下优势：

改进了工作流设置和管理：GKE 管理 TPU 生命周期。使用 GKE 设置 Cloud TPU 初始化和训练后，GKE 可以重复和管理您的工作负载，包括作业故障恢复。
优化费用：您只需支付作业活动期间的 TPU 费用。GKE 会根据 Pod 的资源要求自动创建和删除 TPU。

使用灵活：只需在您的 Pod 规范中少量更改即可申请其他硬件加速器（CPU、GPU 或 TPU）：

kind: Pod
metadata:
  name: example-tpu
  annotations:
    # The Cloud TPUs that will be created for this Job will support
    # TensorFlow 2.12.1. This version MUST match the
    # TensorFlow version that your model is built on.
    tf-version.cloud-tpus.google.com: "2.12.1"
spec:
  containers:
  - name: example-container
    resources:
      limits:
        cloud-tpus.google.com/v2: 8
        # See the line above for TPU, or below for CPU / GPU.
        # cpu: 2
        # nvidia.com/gpu: 1

可伸缩性：GKE 提供 API（作业和 Deployment），可扩展到数百个 GKE Pod 和 TPU 节点。
容错：GKE 的 Job API，以及 TensorFlow 检查点机制， “从运行到完成”语义如果虚拟机实例或 Cloud TPU 节点发生故障，您的训练作业将使用从检查点读取的最新状态自动重新运行。

Cloud TPU 和 GKE 配置要求和限制

定义 GKE 配置时，请注意以下事项：

Windows Server 节点池不支持 Cloud TPU。
您必须在提供 Cloud TPU 的地区创建 GKE 集群和节点池，还必须创建 Cloud Storage 存储分区，将您的训练数据和模型保存在 GKE 集群所属的区域中。如需查看列表，请参阅类型和地区文档。可用的可用区
您必须为 GKE 集群使用符合 RFC 1918 规范的 IP 地址。如需了解详情，请参阅 GKE 网络。
每个容器最多可以申请一个 Cloud TPU，但 Pod 中的多个容器可以各自申请一个 Cloud TPU。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

结合使用 GKE 和 Cloud TPU 时，您的项目会使用 Google Cloud 的计费组件。如需估算您的费用，请查看 Cloud TPU 价格和 GKE 价格；并且在使用完资源之后，按照相关说明清理资源。

在 Google Cloud 控制台上启用以下 API：

创建支持 Cloud TPU 的新集群

请按照以下说明设置环境并创建支持 Cloud TPU 的 GKE 集群，运行 gcloud CLI：

安装 gcloud 组件，您需要这些组件来运行支持 Cloud TPU 的 GKE：
```
$ gcloud components install kubectl 
```
使用您的 Google Cloud 项目 ID 来配置 gcloud。
```
$ gcloud config set project project-name
```
将 project-name 替换为您的 Google Cloud 项目的名称。

当您第一次在新的 Cloud Shell 虚拟机中运行此命令时，系统会显示 Authorize Cloud Shell 页面。点击底部的 Authorize 以允许 gcloud 使用您的凭据。
对您计划使用 Cloud TPU 资源的可用区配置 gcloud。本示例使用的是 us-central1-b，可以使用任何受支持可用区中的 TPU。
```
$ gcloud config set compute/zone us-central1-b
```
使用 gcloud container clusters create 命令在 GKE 上创建一个支持 Cloud TPU 的集群。

注意：如果您需要指定包含共享 VPC，请使用将现有 CIDR 范围与共享 VPC 搭配使用中所示的命令。
```
$ gcloud container clusters create cluster-name \
  --release-channel=stable \
  --scopes=cloud-platform \
  --enable-ip-alias \
  --enable-tpu
```
命令标志说明

发布渠道

发布渠道提供了一种管理集群自动升级的方法。创建新集群时，您可以选择其发布渠道。您的集群将仅升级到该渠道中提供的版本。

scopes

确保集群中的所有节点都能访问您的 Cloud Storage 存储桶。如需实现此访问，该集群和存储桶必须位于同一项目中。请注意，Kubernetes Pod 默认情况下会继承其所部署到的节点的范围。因此，scopes=cloud-platform 向在该集群中运行的所有 Kubernetes Pod 授予 cloud-platform 范围。如果您想要根据每个 Pod 来限制访问权限，请参阅 GKE 指南通过服务账号进行身份验证。

enable-ip-alias

表示集群使用别名 IP 范围。这是在 GKE 上使用 Cloud TPU 的必要条件。

enable-tpu

表示集群必须支持 Cloud TPU。

tpu-ipv4-cidr（可选，未在上文中指定）

表示用于 Cloud TPU 的 CIDR 范围。指定采用 IP/20 形式的 IP_RANGE，例如 10.100.0.0/20。如果您未指定此标志，则系统会将系统会自动分配和分配大小为 /20 的 CIDR 范围。

创建集群后，您应该会看到类似于以下内容的消息：

NAME             LOCATION       MASTER_VERSION    MASTER_IP     MACHINE_TYPE   NODE_VERSION      NUM_NODES  STATUS
cluster-name  us-central1-b  1.16.15-gke.4901  34.71.245.25  n1-standard-1  1.16.15-gke.4901  3          RUNNING

在 Kubernetes Pod 规范中申请 Cloud TPU

请在 Kubernetes Pod 规范中执行以下操作：

您必须在容器中使用同一 TensorFlow 版本来构建模型。请参阅支持的版本。

在limits resource 字段。

请注意，Cloud TPU 资源的单位是 Cloud TPU 核心的数量。下表列出了有效资源请求的示例。请参阅 TPU 类型和可用区获取有效 TPU 资源的完整列表。

如果要使用的资源是 Cloud TPU Pod，请申请配额，因为 Cloud TPU Pod 的默认配额为零。

资源请求	Cloud TPU 类型
cloud-tpus.google.com/v2: 8	Cloud TPU v2 设备（8 核）
cloud-tpus.google.com/preemptible-v2: 8	抢占式 Cloud TPU v2 设备（8 核）
cloud-tpus.google.com/v3: 8	Cloud TPU v3 设备（8 核）
cloud-tpus.google.com/preemptible-v3: 8	抢占式 Cloud TPU v3 设备（8 核）
cloud-tpus.google.com/v2: 32	v2-32 Cloud TPU Podd（32 核）
cloud-tpus.google.com/v3: 32	v3-32 Cloud TPU Podd（32 核）（测试版）

要详细了解如何在 Pod 规范中指定资源和限制，请参阅 Kubernetes 文档。

以下示例 Pod 规范请求了一个抢占式 Cloud TPU 安装 TensorFlow 2.12.1 的 v2-8 TPU。

Cloud TPU 节点的生命周期取决于请求这些节点的 Kubernetes Pod。安排 Kubernetes Pod 时会按需创建 Cloud TPU，删除 Pod 时会回收此 Cloud TPU。

apiVersion: v1
kind: Pod
metadata:
  name: gke-tpu-pod
  annotations:
     # The Cloud TPUs that will be created for this Job will support
     # TensorFlow 2.12.1. This version MUST match the
     # TensorFlow version that your model is built on.
     tf-version.cloud-tpus.google.com: "2.12.1"
spec:
  restartPolicy: Never
  containers:
  - name: gke-tpu-container
    # The official TensorFlow 2.12.1 image.
    # https://hub.docker.com/r/tensorflow/tensorflow
    image: tensorflow/tensorflow:2.12.1
    command:
    - python
    - -c
    - |
      import tensorflow as tf
      print("Tensorflow version " + tf.__version__)

      tpu = tf.distribute.cluster_resolver.TPUClusterResolver('$(KUBE_GOOGLE_CLOUD_TPU_ENDPOINTS)')
      print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

      tf.config.experimental_connect_to_cluster(tpu)
      tf.tpu.experimental.initialize_tpu_system(tpu)
      strategy = tf.distribute.TPUStrategy(tpu)

      @tf.function
      def add_fn(x,y):
          z = x + y
          return z

      x = tf.constant(1.)
      y = tf.constant(1.)
      z = strategy.run(add_fn, args=(x,y))
      print(z)
    resources:
      limits:
        # Request a single Preemptible v2-8 Cloud TPU device to train the model.
        cloud-tpus.google.com/preemptible-v2: 8

创建作业

请按照以下步骤在 GKE 集群中创建作业。并安装 kubectl

使用文本编辑器创建 Pod 规范 example-job.yaml，然后复制并粘贴 Pod 规范中的 Pod 规范
运行该作业：
```
$ kubectl create -f example-job.yaml
```
```
pod "gke-tpu-pod" created
```
此命令创建作业，该作业会自动安排 Pod。

验证 GKE Pod 是否已安排且 Cloud TPU 节点是否已预配。请求 Cloud TPU 节点的 GKE Pod 可能要等到 5 分钟后才会运行。在 GKE Pod 安排好之前，您将会看到类似如下所示的输出。

$ kubectl get pods -w

NAME          READY     STATUS    RESTARTS   AGE
gke-tpu-pod   0/1       Pending   0          1m

在大约 5 分钟后，您应该会看到如下内容：

NAME          READY     STATUS              RESTARTS   AGE
gke-tpu-pod   0/1       Pending             0          21s
gke-tpu-pod   0/1       Pending             0          2m18s
gke-tpu-pod   0/1       Pending             0          2m18s
gke-tpu-pod   0/1       ContainerCreating   0          2m18s
gke-tpu-pod   1/1       Running             0          2m48s
gke-tpu-pod   0/1       Completed           0          3m8s

您需要使用 Ctrl-C 退出“kubectl get”命令。

您可以使用以下 kubectl 命令输出日志信息并检索每个 GKE Pod 的更多详细信息。例如，如需查看 GKE Pod 的日志输出，请使用：

$ kubectl logs gke-tpu-pod

您将看到如下所示的输出：

2021-09-24 18:55:25.400699: I tensorflow/core/platform/cpu_feature_guard.cc:142]
This TensorFlow binary is optimized with oneAPI Deep Neural Network Library
(oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-24 18:55:25.405947: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
Initialize GrpcChannelCache for job worker -> {0 -> 10.0.16.2:8470}
2021-09-24 18:55:25.406058: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32769}
2021-09-24 18:55:28.091729: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
Initialize GrpcChannelCache for job worker -> {0 -> 10.0.16.2:8470}
2021-09-24 18:55:28.091896: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272]
Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32769}
2021-09-24 18:55:28.092579: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:427]
Started server with target: grpc://localhost:32769
Tensorflow version 2.12.1
Running on TPU  ['10.0.16.2:8470']
PerReplica:{
  0: tf.Tensor(2.0, shape=(), dtype=float32),
  1: tf.Tensor(2.0, shape=(), dtype=float32),
  2: tf.Tensor(2.0, shape=(), dtype=float32),
  3: tf.Tensor(2.0, shape=(), dtype=float32),
  4: tf.Tensor(2.0, shape=(), dtype=float32),
  5: tf.Tensor(2.0, shape=(), dtype=float32),
  6: tf.Tensor(2.0, shape=(), dtype=float32),
  7: tf.Tensor(2.0, shape=(), dtype=float32)
}

如需查看 GKE Pod 的完整说明，请使用以下命令：

$ kubectl describe pod gke-tpu-pod

如需了解详情，请参阅应用自测与调试。

在 Docker 映像中构建模型并将模型容器化

如需详细了解此过程，请参阅构建和容器化您自己的模型。

在现有集群上启用 Cloud TPU 支持

在现有 GKE 上启用 Cloud TPU 支持请在 Google Cloud CLI 中执行以下步骤：

启用 Cloud TPU 支持：
```
gcloud beta container clusters update cluster-name --enable-tpu
```
将 cluster-name 替换为您的集群名称。

重要提示：更新集群后，原始 API 服务器端点可能不可用。

更新 kubeconfig 条目：

gcloud container clusters get-credentials cluster-name

设置自定义 CIDR 范围

默认情况下，GKE 会为该集群预配的 TPU 分配一个大小为 /20 的 CIDR 地址块。您可以通过运行以下命令为 Cloud TPU 指定自定义 CIDR 范围：

gcloud beta container clusters update cluster-name \
  --enable-tpu \
  --tpu-ipv4-cidr 10.100.0.0/20

请替换以下内容：

cluster-name：现有集群的名称。
10.100.0.0/20：您的自定义 CIDR 范围。

将现有 CIDR 网段与共享 VPC 搭配使用

请按照有关使用共享 VPC 的 GKE 集群中的 TPU 的指南，验证共享 VPC 的配置是否正确。

使用共享 VPC 的配备 Cloud TPU 的 GKE 集群

在集群中停用 Cloud TPU

如需在现有 GKE 集群上停用 Cloud TPU 支持，请在 Google Cloud CLI 中执行以下步骤：

验证您的所有工作负载是否均未使用 Cloud TPU：
```
$ kubectl get tpu
```
在您的集群中停用 Cloud TPU 支持：
```
$ gcloud beta container clusters update cluster-name --no-enable-tpu
```
将 cluster-name 替换为您的集群名称。

对于区域级集群，此操作大约需要 5 分钟；对于地区级集群，此操作大约需要 15 分钟，具体取决于集群所在的地区。
一旦操作完成且未出现错误，您就可以验证集群预配的 TPU 是否已被移除：
```
$ gcloud compute tpus list
```
Cloud TPU 创建的 TPU 的名称采用以下格式：
```
$ gke-cluster-name-cluster-id-tpu-tpu-id
```
请替换以下内容：
- cluster-name：现有集群的名称。
- cluster-id：现有集群的 ID。
- tpu-id：Cloud TPU 的 ID。
如果出现任何 TPU，则可以通过运行以下命令手动将其删除：
```
$ gcloud compute tpus delete gke-cluster-name-cluster-id-tpu-tpu-id
```

清理

在 GKE 上使用 Cloud TPU 后，请清理资源，以避免您的 Cloud Billing 账号产生额外扣款。

运行以下命令以删除 GKE 集群，将 cluster-name 替换为您的集群名称，将 project-name 替换为您的 Google Cloud 项目名称：
```
$ gcloud container clusters delete cluster-name \
--project=project-name --zone=us-central1-b
```
完成数据检查后，请使用 gcloud CLI 命令删除您创建的 Cloud Storage 存储桶。将 bucket-name 替换为您的 Cloud Storage 存储分区的名称：
```
$ gcloud storage rm gs://bucket-name --recursive
```