在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

标准

概览

在本快速入门中，您将学习如何在 A3 Mega 上运行基于容器的 Megatron-LM PyTorch 工作负载。您可以在以下 GitHub 仓库 megaron-gke 中找到相应代码。

准备工作

请按照以下步骤启用 Google Kubernetes Engine (GKE) API：

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  进入 IAM
2. 选择项目。
3. 点击 授予访问权限。
4. 在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
5. 在选择角色列表中，选择一个角色。
6. 如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
7. 点击 Save（保存）。
创建 A3 Mega 集群

创建具有 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群。如需了解详情，请参阅使用 GPUDirect 和多网络功能最大限度地提高 GPU 网络带宽。

设置环境

为一些常见参数创建环境变量
```
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION
export ZONE=ZONE
export PROJECT_ID=PROJECT_ID
```
替换以下内容：
- CLUSTER_NAME：启用了 GPUDirect-TCPXO 和多网络的 A3 Mega GKE 集群的名称。
- REGION：您在其中创建集群的区域。
- ZONE：您在其中创建集群的可用区。
- PROJECT_ID：您的 Google Cloud 项目 ID。
配置 Google Cloud CLI 以使用您的 Google Cloud 凭据进行身份验证：
```
gcloud auth login
```
如需了解详情，请参阅使用 Google Cloud CLI 时进行身份验证。

安装 kubectl 和 GKE gcloud CLI 插件：

sudo apt-get install kubectl
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

提取 GKE 集群的凭据：

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --zone=${ZONE} \
  --project=${PROJECT_ID}

如果尚未安装 Helm，请安装：

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh && rm get_helm.sh
sudo chmod +x /usr/local/bin/helm

使用拓扑感知调度器来部署 Pod

您可以使用拓扑感知调度器将 GKE Pod 部署到具有指定 GPU 拓扑的节点。

在以下 kubectl 命令中，您将直接从仓库中使用文件。或者，您也可以在本地克隆仓库，kubectl 命令可以改为引用本地文件。

如需了解详情，请参阅拓扑调度器。

设置服务账号：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml

在 configmap 中安装拓扑调度器脚本：

curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py

kubectl -n kube-system create configmap topology-scheduler-scripts \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

安装拓扑标签 daemonset 和拓扑调度器 Pod：

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml

观察拓扑调度器的操作：

kubectl -n kube-system logs topology-scheduler-pod

运行工作负载

构建 Dockerfile 并推送到 Google Cloud Artifact Registry

创建 Cloud Storage 存储桶和 Docker 仓库。在 scripts/setup-and-configure-resources.sh script 中，将存储桶和仓库名称替换为您创建的名称，然后运行脚本：
```
bash scripts/setup-and-configure-resources.sh
```
构建 pytorch-megatron:23.11-py3 映像并将其推送到您的仓库。确保 scripts/build-and-push-docker-image.sh 文件中的 Docker 仓库名称与您在 scripts/setup-and-configure-resources.sh 脚本中使用的仓库名称相匹配。您还可以在推送之前修改 Docker 映像标记名称。
```
bash scripts/build-and-push-docker-image.sh
```
注意：此映像基于 nvcr.io/nvidia/pytorch:23.11-py3，进行了细微更改。

启动 Megaron-LM Llama2 基准

修改 helm/values.yaml 文件以指定在前面部分中创建的 Cloud Storage 存储桶和 Docker 映像。如需查看一些示例配置，请参阅 sample-configurations。
可选：您还可以修改 selected-configuration.sh 文件，以指定您对默认 Helm 配置所做的任何更改。
```
helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
```
将 HELM_EXPERIMENT_NAME 替换为实验的任意名称。

注意：如果要多次运行 Helm 实验，您可以使用 helm uninstall 命令，也可以使用其他名称创建新实验。

该实验会将 Nsight Systems 性能剖析工具中的指标写入 megatron-experiments 目录中指定的 Cloud Storage 存储桶。

清理

为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用，请按照以下步骤操作。

删除 GKE 集群：

转到集群页面：

转到“集群”

选择 CLUSTER_NAME 对应的复选框。
点击删除。
如需确认删除，请输入 CLUSTER_NAME，然后点击删除。

删除 Cloud Storage 存储桶

转至存储桶页面：

进入“存储桶”

选中您为本快速入门创建的 Cloud Storage 存储桶对应的复选框。
点击删除。
如需确认删除，请输入 DELETE，然后点击删除。

后续步骤

详细了解如何在 GKE 中使用 GPU

在 A3 Mega 虚拟机上使用 Megatron-LM 训练 Llama2

概览

准备工作

Check for the roles

Grant the roles

创建 A3 Mega 集群

设置环境

使用拓扑感知调度器来部署 Pod

运行工作负载

构建 Dockerfile 并推送到 Google Cloud Artifact Registry

启动 Megaron-LM Llama2 基准

清理

删除 GKE 集群：

删除 Cloud Storage 存储桶

后续步骤