此旧版 AI Platform Training 已弃用，2025 年 1 月 31 日之后将不再在 Google Cloud 上提供。将资源迁移到 Vertex AI 自定义训练，以获取 AI Platform 中不可用的新机器学习功能。

此页面由 Cloud Translation API 翻译。

使用 TPU 训练模型

张量处理单元 (TPU) 是 Google 定制开发的 ASIC，用于加速机器学习工作负载。您可以使用 Cloud TPU 在 AI Platform Training 上运行训练作业。AI Platform Training 提供一个作业管理接口，因此您无需自行管理 TPU，而是可以借助 AI Platform Training jobs API，就像使用该 API 在 CPU 或 GPU 上进行训练时一样。

高层级 TensorFlow API 可帮助您在 Cloud TPU 硬件上运行模型。

设置您的 Google Cloud 环境

请完成入门指南中设置部分的操作，以配置您的 Google Cloud 环境。

授权 Cloud TPU 访问项目

请按照以下步骤为 Google Cloud 项目所关联的 Cloud TPU 服务账号名称授权：

通过调用 projects.getConfig 获取 Cloud TPU 服务账号名称。示例：

PROJECT_ID=PROJECT_ID

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
    https://ml.googleapis.com/v1/projects/$PROJECT_ID:getConfig

保存 API 返回的 serviceAccountProject 和 tpuServiceAccount 字段的值。

初始化 Cloud TPU 服务账号：

curl -H "Authorization: Bearer $(gcloud auth print-access-token)"  \
  -H "Content-Type: application/json" -d '{}'  \
  https://serviceusage.googleapis.com/v1beta1/projects/<serviceAccountProject>/services/tpu.googleapis.com:generateServiceIdentity

现在，将该 Cloud TPU 服务账号添加为项目的成员，并为其授予 Cloud ML Service Agent 角色。在 Google Cloud 控制台中或使用 gcloud 命令完成以下步骤：

控制台

登录 Google Cloud 控制台，然后选择您正在使用 TPU 的项目。
选择 IAM 和管理 > IAM。
点击添加按钮向项目添加成员。
在成员文本框中输入 TPU 服务账号。
点击角色下拉列表。
启用 Cloud ML Service Agent 角色 (Service Agent > Cloud ML Service Agent)。

gcloud

设置包含项目 ID 和 Cloud TPU 服务账号的环境变量：

PROJECT_ID=PROJECT_ID
SVC_ACCOUNT=your-tpu-sa-123@your-tpu-sa.google.com.iam.gserviceaccount.com

将 ml.serviceAgent 角色授予 Cloud TPU 服务账号：

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member serviceAccount:$SVC_ACCOUNT --role roles/ml.serviceAgent

如需详细了解如何为服务账号授予角色，请参阅 IAM 文档。

示例：训练示例 MNIST 模型

本部分介绍如何使用 TPU 和运行时版本 2.11 训练示例 MNIST 模型。示例作业使用预定义的 BASIC_TPU 规模层级作为机器配置。本指南后面的部分将为您介绍如何设置自定义配置。

此示例假定您使用的是安装了 gcloud CLI 的 Bash shell。运行以下命令以获取代码并将训练作业提交到 AI Platform Training：

下载 TensorFlow 参考模型的代码，并导航到包含示例代码的目录：

git clone https://github.com/tensorflow/models.git \
  --branch=v2.11.0 \
  --depth=1

cd models

在 models 目录中创建 setup.py 文件。这可确保在 gcloud ai-platform jobs submit training 命令创建训练代码的 Tar 归档文件包时将所有必要的子包包括在 models/official 目录中，并且确保 AI Platform Training 会在运行训练作业时安装 TensorFlow 数据集作为依赖项。此训练代码依赖 TensorFlow 数据集来加载 MNIST 数据。

要创建 setup.py 文件，请在 shell 中运行以下命令：
```
cat << END > setup.py
from setuptools import find_packages
from setuptools import setup

setup(
    name='official',
    install_requires=[
       'tensorflow-datasets~=3.1',
       'tensorflow-model-optimization>=0.4.1'
   ],
    packages=find_packages()
)
END
```

使用 gcloud ai-platform jobs submit training 命令提交您的训练作业：

gcloud ai-platform jobs submit training tpu_mnist_1 \
  --staging-bucket=gs://BUCKET_NAME \
  --package-path=official \
  --module-name=official.vision.image_classification.mnist_main \
  --runtime-version=2.11 \
  --python-version=3.7 \
  --scale-tier=BASIC_TPU \
  --region=us-central1 \
  -- \
  --distribution_strategy=tpu \
  --data_dir=gs://tfds-data/datasets \
  --model_dir=gs://BUCKET_NAME/tpu_mnist_1_output

将 BUCKET_NAME 替换为 Google Cloud 项目中 Cloud Storage 存储桶的名称。gcloud CLI 将您的封装训练代码上传到此存储桶，AI Platform Training 将训练输出保存在该存储桶中。

监控训练作业。作业完成后，您可以在 gs://BUCKET_NAME/tpu_mnist_1_output 目录中查看其输出。

详细了解如何在 Cloud TPU 上训练模型

本部分将较为详细地向您介绍如何使用 Cloud TPU 在 AI Platform Training 上配置作业和训练模型。

指定提供 TPU 的区域

您需要在提供 TPU 的区域中运行作业。以下区域目前提供 TPU：

us-central1
europe-west4

如需全面了解提供 AI Platform Training 服务（包括模型训练和在线/批量预测）的区域，请参阅区域指南。

TensorFlow 和 AI Platform Training 版本控制

AI Platform Training 运行时版本 1.15、2.1、2.2、2.3、2.4、2.5、2.6、2.7、2.8、2.9 及 2.11 可用于在 Cloud TPU 上训练模型。如需了解详情，请参阅 AI Platform Training 运行时版本和对应的 TensorFlow 版本。

版本控制政策与 Cloud TPU 的版本控制政策相同。在训练作业请求中，请务必指定可用于 TPU 且与训练代码中使用的 TensorFlow 版本匹配的运行时版本。

连接到 TPU gRPC 服务器

在 TensorFlow 程序中，使用 TPUClusterResolver 连接到在 TPU 虚拟机上运行的 TPU gRPC 服务器。

TensorFlow 使用 TPU 指南介绍了如何将TPUClusterResolver 与 TPUStrategy 分发策略配合使用。

但是，如果您对在 AI Platform Training 上运行的代码使用 TPUClusterResolver，则必须进行一项重要更改：构建 TPUClusterResolver 实例时请勿提供任何参数。当 tpu、zone 和 project 关键字参数都设置为默认值 None 时，AI Platform Training 会通过环境变量自动为集群解析器提供必要的连接详情。

以下 TensorFlow 2 示例展示了如何初始化集群解析器以及用于在 AI Platform Training 上进行训练的分布策略：

import tensorflow as tf

resolver = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

在 TensorFlow 代码中使用 TPU

要在机器上使用 TPU，请使用 TensorFlow 2 的 TPUStrategy API。TensorFlow TPU 使用指南介绍了如何操作。

要在 TensorFlow 1 中使用 TPU 进行训练，您可以改为使用 TPUEstimator API。TPUEstimator API 的 Cloud TPU 指南介绍了如何操作。

Cloud TPU 文档还提供了可在 Cloud TPU 上使用的低层级 TensorFlow 操作列表。

在 PyTorch 代码中使用 TPU

如需在使用预构建的 PyTorch 容器时利用 TPU，请使用 torch_xla 软件包。了解如何在 PyTorch 文档中针对 TPU 使用 torch_xla。如需查看使用 torch_xla 的更多示例，请参阅 PyTorch XLA GitHub 代码库中的教程。

请注意，在 AI Platform Training 上使用 TPU 进行训练时，您使用的是单个 XLA 设备，而不是多个 XLA 设备。

另请参阅本页面的下一部分，了解如何为 PyTorch 和 TPU 配置训练作业。

配置自定义 TPU 机器

运行 TPU 训练作业需要采用双虚拟机配置。其中一个虚拟机（主虚拟机）运行 Python 代码。主虚拟机驱动在 TPU 工作器上运行的 TensorFlow 服务器。

如需将 TPU 与 AI Platform Training 配合使用，请将您的训练作业配置为以下列三种方式之一访问支持 TPU 的机器：

使用 BASIC_TPU 规模层级。您可以使用此方法访问 TPU v2 加速器。
为主虚拟机使用 cloud_tpu 工作器和旧版机器类型。您可以使用此方法访问 TPU v2 加速器。
为主虚拟机使用 cloud_tpu 工作器和 Compute Engine 机器类型。您可以使用此方法访问 TPU v2 或 TPU v3 加速器。TPU v3 加速器现已发布 Beta 版。

包含 TPU 的基本机器配置

将规模层级设置为 BASIC_TPU 可得到一个主虚拟机和一个包含一个 TPU 和 8 个 TPU v2 核心的 TPU 虚拟机，方法与运行上述示例时相同。

旧版机器类型配置中的 TPU 工作器

或者，如果主虚拟机上需要更多计算资源，您可以设置自定义机器配置：

将规模层级设置为 CUSTOM。
配置主虚拟机以使用适合您的作业要求的旧版机器类型。
将 workerType 设置为 cloud_tpu 可得到一个包含一个 Cloud TPU 和 8 个 TPU v2 核心的 TPU 虚拟机。
将 workerCount 设置为 1。
使用 Cloud TPU 时，请勿指定参数服务器。如果 parameterServerCount 大于零，该服务会拒绝作业请求。

以下示例展示了使用此类配置的 config.yaml 文件：

trainingInput:
  scaleTier: CUSTOM
  masterType: complex_model_m
  workerType: cloud_tpu
  workerCount: 1

Compute Engine 机器类型配置中的 TPU 工作器

您还可以为主虚拟机使用 Compute Engine 机器类型并为 TPU 虚拟机添加一项 acceleratorConfig 配置，以此来设置自定义机器配置。

您可以使用此类配置来设置包含 8 个 TPU v2 核心的 TPU 工作器（类似于不含 acceleratorConfig 的配置）或包含 8 个 TPU v3 核心的 TPU 工作器（测试版）。详细了解TPU v2 和 TPU v3 加速器之间的区别。

使用 Compute Engine 机器类型还可以为配置主虚拟机提供更高的灵活性：

将规模层级设置为 CUSTOM。
配置主虚拟机以使用适合作业要求的 Compute Engine 机器类型。
将 workerType 设置为 cloud_tpu。
添加一个带有 acceleratorConfig 字段的 workerConfig。在该 acceleratorConfig 中，将 type 设置为 TPU_V2 或 TPU_V3，并将 count 设置为 8。您不能挂接任何其他数量的 TPU 核心。
将 workerCount 设置为 1。
使用 Cloud TPU 时，请勿指定参数服务器。如果 parameterServerCount 大于零，该服务会拒绝作业请求。

以下示例展示了使用此类配置的 config.yaml 文件：

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3（测试版）

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V3
      count: 8

使用 TPU Pod

TPU Pod 是一组通过专用高速网络接口连接的 TPU 设备。TPU Pod 最多可以有 2,048 个 TPU 核心，因此您可以在多个 TPU 之间分配处理负载。

如需使用 TPU Pod，您必须先提交增加配额请求。

以下示例 config.yaml 文件展示了如何使用 TPU Pod：

TPU v2 Pod

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V2_POD
      count: 128

TPU v3 Pod

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    acceleratorConfig:
      type: TPU_V3_POD
      count: 32

可用于各种 TPU 的 Pod 核心数量有限制。可用配置：

TPU Pod 类型	可使用的 Pod 核心数
`TPU_V2_POD`	`32, 128, 256, 512`
`TPU_V3_POD`	`32, 128, 256`

如需详细了解如何充分利用 TPU Pod 核心，请参阅有关 TPU Pod 的 Cloud TPU 文档。

在 TPU 工作器上使用预构建的 PyTorch 容器

如果您想要使用 TPU 执行 PyTorch 训练，则必须在训练作业的 trainingInput 中指定 tpuTfVersion 字段。设置 tpuTfVersion 以匹配用于训练的预构建 PyTorch 容器的版本。

AI Platform Training 支持通过 TPU 对以下预构建的 PyTorch 容器进行训练：

容器映像 URI	`tpuTfVersion`
`gcr.io/cloud-ml-public/training/pytorch-xla.1-11`	`pytorch-1.11`
`gcr.io/cloud-ml-public/training/pytorch-xla.1-10`	`pytorch-1.10`
`gcr.io/cloud-ml-public/training/pytorch-xla.1-9`	`pytorch-1.9`
`gcr.io/cloud-ml-public/training/pytorch-xla.1-7`	`pytorch-1.7`
`gcr.io/cloud-ml-public/training/pytorch-xla.1-6`	`pytorch-1.6`

例如，如需使用 PyTorch 1.11 预构建容器进行训练，您可以使用以下 config.yaml 文件配置训练：

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/cloud-ml-public/training/pytorch-xla.1-11
    tpuTfVersion: pytorch-1.11
    acceleratorConfig:
      type: TPU_V2
      count: 8

另请参阅本页面有关在 PyTorch 代码中使用 TPU 的上一部分。

在 TPU 工作器上使用自定义容器

如果要在 TPU 工作器上运行自定义容器，而不是使用支持 TPU 的 AI Platform Training 运行时版本之一，则必须在提交训练作业时指定额外的配置字段。请将 tpuTfVersion 设置为包含容器使用的 TensorFlow 版本的运行时版本。您必须指定一个当前支持使用 TPU 进行训练的运行时版本。

由于您要将作业配置为使用自定义容器，因此 AI Platform Training 在运行训练作业时不会使用此运行时版本的环境。但是，AI Platform Training 需要此字段，以便为您的自定义容器使用的 TensorFlow 版本正确准备 TPU 工作器。

以下示例展示了一个 config.yaml 文件，其 TPU 配置与前一部分中的 TPU 配置相似，但在此示例中，主虚拟机和 TPU 工作器各自运行不同的自定义容器：

TPU v2

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 2.11
    acceleratorConfig:
      type: TPU_V2
      count: 8

TPU v3（Beta 版）

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  masterConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-master-image-name:your-master-tag-name
  workerType: cloud_tpu
  workerCount: 1
  workerConfig:
    imageUri: gcr.io/YOUR_PROJECT_ID/your-worker-image-name:your-worker-tag-name
    tpuTfVersion: 2.11
    acceleratorConfig:
      type: TPU_V3
      count: 8

如果您使用 gcloud beta ai-platform jobs submit training 命令提交训练作业，则可以使用 --tpu-tf-version 标志而不是 config.yaml 文件来指定 tpuTfVersion API 字段。

预配 TPU 后使用 `TPUClusterResolver`

使用自定义容器时，您必须等待 TPU 预配完毕，然后才能调用 TPUClusterResolver 以使用该容器。以下示例代码展示了如何处理 TPUClusterResolver 逻辑：

def wait_for_tpu_cluster_resolver_ready():
  """Waits for `TPUClusterResolver` to be ready and return it.

  Returns:
    A TPUClusterResolver if there is TPU machine (in TPU_CONFIG). Otherwise,
    return None.
  Raises:
    RuntimeError: if failed to schedule TPU.
  """
  tpu_config_env = os.environ.get('TPU_CONFIG')
  if not tpu_config_env:
    tf.logging.info('Missing TPU_CONFIG, use CPU/GPU for training.')
    return None

  tpu_node = json.loads(tpu_config_env)
  tf.logging.info('Waiting for TPU to be ready: \n%s.', tpu_node)

  num_retries = 40
  for i in range(num_retries):
    try:
      tpu_cluster_resolver = (
          tf.contrib.cluster_resolver.TPUClusterResolver(
              tpu=[tpu_node['tpu_node_name']],
              zone=tpu_node['zone'],
              project=tpu_node['project'],
              job_name='worker'))
      tpu_cluster_resolver_dict = tpu_cluster_resolver.cluster_spec().as_dict()
      if 'worker' in tpu_cluster_resolver_dict:
        tf.logging.info('Found TPU worker: %s', tpu_cluster_resolver_dict)
        return tpu_cluster_resolver
    except Exception as e:
      if i < num_retries - 1:
        tf.logging.info('Still waiting for provisioning of TPU VM instance.')
      else:
        # Preserves the traceback.
        raise RuntimeError('Failed to schedule TPU: {}'.format(e))
    time.sleep(10)

  # Raise error when failed to get TPUClusterResolver after retry.
  raise RuntimeError('Failed to schedule TPU.')

详细了解如何使用自定义容器进行分布式训练。

后续步骤

详细了解如何在 AI Platform Training 上训练模型。
了解 AI Platform Training 上的超参数调节，并特别注意使用 Cloud TPU 进行超参数调节的详细信息。
了解其他 Cloud TPU 参考模型。
遵循 Cloud TPU 最佳做法，针对 Cloud TPU 优化您的模型。
参阅 Cloud TPU 问题排查和常见问题解答，以便诊断和解决问题。

使用 GPU

使用 TPU 训练模型

设置您的 Google Cloud 环境

授权 Cloud TPU 访问项目

控制台

gcloud

示例：训练示例 MNIST 模型

详细了解如何在 Cloud TPU 上训练模型

指定提供 TPU 的区域

TensorFlow 和 AI Platform Training 版本控制

连接到 TPU gRPC 服务器

在 TensorFlow 代码中使用 TPU

在 PyTorch 代码中使用 TPU

配置自定义 TPU 机器

包含 TPU 的基本机器配置

旧版机器类型配置中的 TPU 工作器

Compute Engine 机器类型配置中的 TPU 工作器

TPU v2

TPU v3（测试版）

使用 TPU Pod

TPU v2 Pod

TPU v3 Pod

在 TPU 工作器上使用预构建的 PyTorch 容器

在 TPU 工作器上使用自定义容器

TPU v2

TPU v3（Beta 版）

预配 TPU 后使用 TPUClusterResolver

后续步骤

预配 TPU 后使用 `TPUClusterResolver`