使用 PyTorch 在 Cloud TPU 上预训练 FairSeq RoBERTa

本教程介绍如何在 Cloud TPU 上训练 FairSeq 的 RoBERTa。具体来说，它遵循 FairSeq 教程，使用公共数据集 wikitext-103 来预训练模型。

目标

创建并配置 PyTorch 环境
准备数据集
运行训练作业
确认您可以查看输出结果

费用

在本文档中，您将使用 Google Cloud 的以下收费组件：

Compute Engine
Cloud TPU

您可使用价格计算器根据您的预计使用情况来估算费用。 Google Cloud 新用户可能有资格申请免费试用。

准备工作

在开始学习本教程之前，请检查您的 Google Cloud 项目是否已正确设置。

登录您的 Google Cloud 账号。如果您是 Google Cloud 新手，请创建一个账号来评估我们的产品在实际场景中的表现。新客户还可获享 $300 赠金，用于运行、测试和部署工作负载。

在 Google Cloud Console 中的项目选择器页面上，选择或创建一个 Google Cloud 项目。

转到“项目选择器”

确保您的 Google Cloud 项目已启用结算功能。

在 Google Cloud Console 中的项目选择器页面上，选择或创建一个 Google Cloud 项目。

转到“项目选择器”

确保您的 Google Cloud 项目已启用结算功能。

本演示使用 Google Cloud 的收费组件。请查看 Cloud TPU 价格页面估算您的费用。请务必在使用完您创建的资源以后清理这些资源，以免产生不必要的费用。

设置 Compute Engine 实例

打开一个 Cloud Shell 窗口。

打开 v
为项目 ID 创建一个变量。
```
export PROJECT_ID=project-id
```
将 Google Cloud CLI 配置为使用您要创建 Cloud TPU 的项目。
```
gcloud config set project ${PROJECT_ID}
```
当您第一次在新的 Cloud Shell 虚拟机中运行此命令时，系统会显示 Authorize Cloud Shell 页面。点击页面底部的 Authorize，以允许 gcloud 使用您的凭据进行 Google Cloud API 调用。

从 Cloud Shell 启动本教程所需的 Compute Engine 资源。

gcloud compute instances create roberta-tutorial \
--zone=us-central1-a \
--machine-type=n1-standard-16  \
--image-family=torch-xla \
--image-project=ml-images  \
--boot-disk-size=200GB \
--scopes=https://www.googleapis.com/auth/cloud-platform

连接到新的 Compute Engine 实例。
```
gcloud compute ssh roberta-tutorial --zone=us-central1-a
```
要点：从现在起，前缀 (vm) $ 表示您应在 Compute Engine 虚拟机实例上运行该命令。

启动 Cloud TPU 资源

在 Compute Engine 虚拟机中，使用以下命令启动 Cloud TPU 资源：

(vm) $ gcloud compute tpus create roberta-tutorial \
--zone=us-central1-a \
--network=default \
--version=pytorch-2.0  \
--accelerator-type=v3-8

确定 Cloud TPU 资源的 IP 地址。
```
(vm) $ gcloud compute tpus describe --zone=us-central1-a roberta-tutorial
```
重要提示：IP 地址位于 NETWORK_ENDPOINTS 列下。在创建和配置 PyTorch 环境时，您将需要该 IP 地址（不包括端口号）。

创建并配置 PyTorch 环境

启动 conda 环境。
```
(vm) $ conda activate torch-xla-2.0
```
为 Cloud TPU 资源配置环境变量。

注意：TPU_IP_ADDRESS 变量必须与您在启动 Cloud TPU 资源时确定的 Cloud TPU 的 IP 地址相等。它是 Google Cloud Console 上 Compute Engine > TPU 下显示的内部 IP 地址。
```
(vm) $ export TPU_IP_ADDRESS=ip-address
```
```
(vm) $ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
```

设置数据

运行以下命令来安装 FairSeq：

(vm) $ pip install --editable /usr/share/torch-xla-2.0/tpu-examples/deps/fairseq

创建目录 pytorch-tutorial-data 来存储模型数据。

(vm) $ mkdir $HOME/pytorch-tutorial-data
(vm) $ cd $HOME/pytorch-tutorial-data

按照 FairSeq RoBERTa 的 README“预处理数据”部分中的说明操作。准备数据集大约需要 10 分钟。

训练模型

要训练模型，请先设置一些环境变量：

(vm) $ export TOTAL_UPDATES=125000    # Total number of training steps
(vm) $ export WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
(vm) $ export PEAK_LR=0.0005          # Peak learning rate, adjust as needed
(vm) $ export TOKENS_PER_SAMPLE=512   # Max sequence length
(vm) $ export UPDATE_FREQ=16          # Increase the batch size 16x
(vm) $ export DATA_DIR=${HOME}/pytorch-tutorial-data/data-bin/wikitext-103

然后，运行以下脚本：

(vm) $ python3 \
      /usr/share/torch-xla-pytorch-2.0/tpu-examples/deps/fairseq/train.py $DATA_DIR \
      --task=masked_lm --criterion=masked_lm \
      --arch=roberta_base --sample-break-mode=complete \
      --tokens-per-sample=512 \
      --optimizer=adam \
      --adam-betas='(0.9,0.98)' \
      --adam-eps=1e-6 \
      --clip-norm=0.0 \
      --lr-scheduler=polynomial_decay \
      --lr=0.0005 \
      --warmup-updates=10000 \
      --dropout=0.1 \
      --attention-dropout=0.1 \
      --weight-decay=0.01 \
      --update-freq=16 \
      --train-subset=train \
      --valid-subset=valid \
      --num_cores=8 \
      --metrics_debug \
      --save-dir=checkpoints \
      --log_steps=30 \
      --log-format=simple \
      --skip-invalid-size-inputs-valid-test \
      --suppress_loss_report \
      --input_shapes 16x512 18x480 21x384 \
      --max-epoch=1

训练脚本会运行大约 15 分钟，并在完成后生成类似于以下内容的消息：

saved checkpoint /home/user/checkpoints/checkpoint1.pt
(epoch 1 @ 119 updates) (writing took 25.19265842437744 seconds)
| done training in 923.8 seconds

验证输出结果

训练作业完成后，可以在以下目录中查找模型检查点：

$HOME/checkpoints

清理

使用您创建的资源后，请进行清理，以免您的帐号产生不必要的费用：

断开与 Compute Engine 实例的连接（如果您尚未这样做）：
```
(vm) $ exit
```
您的提示符现在应为 user@projectname，表明您位于 Cloud Shell 中。
在 Cloud Shell 中，使用 Google Cloud CLI 删除 Compute Engine 实例。
```
$ gcloud compute instances delete roberta-tutorial --zone=us-central1-a
```

使用 Google Cloud CLI 删除 Cloud TPU 资源。

$ gcloud compute tpus delete roberta-tutorial --zone=us-central1-a

后续步骤

试用 PyTorch Colab：