通过 Saxml，在 Vertex AI 上使用多主机 Cloud TPU 应用 Llama 3 开放模型

Llama 3 是 Meta 提供的开源大语言模型 (LLM)。本指南介绍如何通过 Saxml，在 Vertex AI 上使用多主机张量处理单元 (TPU) 应用 Llama 3 LLM。

在本指南中，您需要下载 Llama 3 70B 模型权重和词元化器，并将其部署到在 TPU 上运行 Saxml 的 Vertex AI 上。

准备工作

我们建议您使用 M2 内存优化虚拟机下载模型并将其转换为 Saxml。这是因为模型转换过程需要大量内存，如果您选择内存不足的机器类型，该过程可能会失败。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
按照 Artifact Registry 文档安装 Docker。
确保您有足够的配额用于 Vertex AI 的 16 个 TPU v5e 芯片。

本教程假定您使用 Cloud Shell 与 Google Cloud进行互动。如果您想使用其他 shell 取代 Cloud Shell，请执行以下额外的配置：

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

如果您使用其他 shell（而不是 Cloud Shell）来部署模型，请确保 Google Cloud CLI 版本高于 475.0.0。您可以通过运行 gcloud components update 命令来更新 Google Cloud CLI。

如果您使用 Vertex AI SDK 部署模型，请确保您拥有 1.50.0 版或更高版本。

获取对模型的访问权限并下载模型权重

以下步骤适用于具有 M2 内存优化虚拟机的 Vertex AI Workbench 实例。如需了解如何更改 Vertex AI Workbench 实例的机器类型，请参阅更改 Vertex AI Workbench 实例的机器类型。

访问 Llama 模型同意页面。
选择 Llama 3，填写同意书并接受条款及条件。
检查您的收件箱中是否收到包含签名网址的电子邮件。

通过执行以下命令，从 GitHub 下载 download.sh 脚本：

wget https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh
chmod +x download.sh

如需下载模型权重，请运行您从 GitHub 下载的 download.sh 脚本。
出现提示时，输入您在上一部分中收到的电子邮件中的签名网址。
提示输入要下载的模型时，请输入 70B。

将模型权重转换为 Saxml 格式

运行以下命令以下载 Saxml：

git clone https://github.com/google/saxml.git

运行以下命令以配置 Python 虚拟环境：
```
python -m venv .
source bin/activate
```

运行以下命令以安装依赖项：

pip install --upgrade pip

pip install paxml

pip install praxis

pip install torch

如需将模型权重转换为 Saxml 格式，请运行以下命令：
```
python3 saxml/saxml/tools/convert_llama_ckpt.py \
    --base PATH_TO_META_LLAMA3 \
    --pax PATH_TO_PAX_LLAMA3 \
    --model-size llama3_70b
```
请替换以下内容：
- PATH_TO_META_LLAMA3：包含下载的模型权重的目录的路径
- PATH_TO_PAX_LLAMA3：用于存储转换后的模型权重的目录的路径
注意：您可以对任何 Llama 2 或 Llama 3 模型使用此命令。

转换后的模型将置于 $PATH_TO_PAX_LLAMA3/checkpoint_00000000 文件夹中。
将词元化器文件从原始目录复制到名为 vocabs 的子文件夹，如下所示：
```
cp $PATH_TO_META_LLAMA3/tokenizer.model $PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
```

在 $PATH_TO_PAX_LLAMA3 文件夹以及该文件夹的 metadata 和 state 子文件夹中添加空的 commit_success.txt 文件，如下所示：

touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt

$PATH_TO_PAX_LLAMA3 文件夹现在包含以下文件夹和文件：

$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt
$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model

创建 Cloud Storage 存储桶

创建 Cloud Storage 存储桶以存储转换后的模型权重。

在 Cloud Shell 中，运行以下命令，并将 PROJECT_ID 替换为您的项目 ID：
```
projectid=PROJECT_ID
gcloud config set project ${projectid}
```
如需创建存储桶，请运行以下命令：
```
gcloud storage buckets create gs://WEIGHTS_BUCKET_NAME
```
将 WEIGHTS_BUCKET_NAME 替换为您要用于存储桶的名称。

将模型权重复制到 Cloud Storage 存储桶

如需将模型权重复制到您的存储桶，请运行以下命令：

gcloud storage cp PATH_TO_PAX_LLAMA3/* gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/ --recursive

上传模型

us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest 提供了预构建的 Saxml 容器。

如需使用预构建的 Saxml 容器将 Model 资源上传到 Vertex AI，请运行 gcloud ai models upload 命令，如下所示：

gcloud ai models upload \
    --region=LOCATION \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest \
    --artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b' \
    --container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16' \
    --container-args='--platform_chip=tpuv5e' \
    --container-args='--platform_topology=4x4' \
    --container-args='--ckpt_path_suffix=checkpoint_00000000' \
    --container-deployment-timeout-seconds=2700 \
    --container-ports=8502 \
    --project=PROJECT_ID

进行以下替换：

LOCATION：您在其中使用 Vertex AI 的区域。请注意，TPU 仅在 us-west1 中可用。
MODEL_DISPLAY_NAME：您希望用于模型的显示名称
PROJECT_ID：您的 Google Cloud 项目的 ID

创建在线推理端点

如需创建端点，请运行以下命令：

gcloud ai endpoints create \
    --region=LOCATION \
    --display-name=ENDPOINT_DISPLAY_NAME \
    --project=PROJECT_ID

将 ENDPOINT_DISPLAY_NAME 替换为您希望用于端点的显示名称。

将模型部署到端点

端点准备就绪后，您便可以将模型部署到端点。

在本教程中，您将部署一个 Llama 3 70B 模型，该模型使用 4x4 拓扑针对 16 个 Cloud TPU v5e 芯片进行分片。不过，您可以指定以下受支持的多主机 Cloud TPU 拓扑中的任意一种：

机器类型	拓扑	TPU 芯片数量	主机数量
`ct5lp-hightpu-4t`	4x4	16	2
`ct5lp-hightpu-4t`	4x8	32	4
`ct5lp-hightpu-4t`	8x8	64	8
`ct5lp-hightpu-4t`	8x16	128	16
`ct5lp-hightpu-4t`	16x16	256	32

如果您要部署在 Saxml GitHub 代码库中定义的其他 Llama 模型，请确保其分区方式与您的目标设备数量相匹配，并且 Cloud TPU 有足够的内存来加载模型。

如需了解如何在单主机 Cloud TPU 上部署模型，请参阅部署模型。

如需查看受支持的 Cloud TPU 类型和区域的完整列表，请参阅 Vertex AI 位置。

获取在线推理端点的 ID：

ENDPOINT_ID=$(gcloud ai endpoints list \
    --region=LOCATION \
    --filter=display_name=ENDPOINT_NAME \
    --format="value(name)")

获取模型的模型 ID：

MODEL_ID=$(gcloud ai models list \
    --region=LOCATION \
    --filter=display_name=DEPLOYED_MODEL_NAME \
    --format="value(name)")

将模型部署到端点：
```
gcloud ai endpoints deploy-model $ENDPOINT_ID \
    --region=LOCATION \
    --model=$MODEL_ID \
    --display-name=DEPLOYED_MODEL_NAME \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=4x4 \
    --traffic-split=0=100
```
将 DEPLOYED_MODEL_NAME 替换为所部署的模型的名称。这可以与模型显示名称 (MODEL_DISPLAY_NAME) 相同。

部署操作可能会超时。

deploy-model 命令会返回操作 ID，可用于检查操作完成时间。您可以轮询操作状态，直到响应包含 "done": true。可使用以下命令轮询状态：
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
将 OPERATION_ID 替换为上一条命令返回的操作 ID。

从已部署的模型获取在线推理

如需从 Vertex AI 端点获取在线推理，请运行 gcloud ai endpoints predict 命令。

运行以下命令以创建包含示例推理请求的 request.json 文件：

cat << EOF > request.json
{"instances": [{"text_batch": "the distance between Earth and Moon is "}]}
EOF

如需将在线推理请求发送到端点，请运行以下命令：

gcloud ai endpoints predict $ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION \
    --json-request=request.json

清理

为避免产生额外的 Vertex AI 费用，请删除您在本教程中创建的 Google Cloud 资源：

如需从端点取消部署模型并删除端点，请运行以下命令：

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

如需删除模型，请运行以下命令：

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet