A3 Mega 仮想マシンで Megatron-LM を使用して Llama2 をトレーニングする

Standard

概要

このクイックスタートでは、A3 Mega でコンテナベースの Megatron-LM PyTorch ワークロードを実行する方法について説明します。コードは、GitHub リポジトリ megatron-gke で入手できます。

始める前に

次の手順で Google Kubernetes Engine（GKE）API を有効にします。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

Make sure that you have the following role or roles on the project: roles/container.admin, roles/compute.networkAdmin, roles/iam.serviceAccountUser
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  [IAM] に移動
2. プロジェクトを選択します。
3. [ アクセスを許可] をクリックします。
4. [新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Google アカウントのメールアドレスです。
5. [ロールを選択] リストでロールを選択します。
6. 追加のロールを付与するには、 [別のロールを追加] をクリックして各ロールを追加します。
7. [保存] をクリックします。
A3 Mega クラスタを作成する

GPUDirect-TCPXO とマルチネットワーキングを使用して A3 Mega GKE クラスタを作成します。詳細については、GPUDirect とマルチネットワーキングで GPU ネットワーク帯域幅を最大にするをご覧ください。

環境を設定する

共通パラメータの環境変数を作成します。
```
export CLUSTER_NAME=CLUSTER_NAME
export REGION=REGION
export ZONE=ZONE
export PROJECT_ID=PROJECT_ID
```
次のように置き換えます。
- CLUSTER_NAME: GPUDirect-TCPXO とマルチネットワーキングが有効になっている A3 Mega GKE クラスタの名前。
- REGION: クラスタを作成したリージョン。
- ZONE: クラスタを作成したゾーン。
- PROJECT_ID: Google Cloud プロジェクト ID。
認証に Google Cloud 認証情報を使用するように Google Cloud CLI を構成します。
```
gcloud auth login
```
詳細については、Google Cloud CLI を使用して認証するをご覧ください。

kubectl と GKE gcloud CLI プラグインをインストールします。

sudo apt-get install kubectl
sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin

GKE クラスタの認証情報を取得します。

gcloud container clusters get-credentials ${CLUSTER_NAME} \
  --zone=${ZONE} \
  --project=${PROJECT_ID}

まだインストールされていない場合は、Helm をインストールします。

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh && rm get_helm.sh
sudo chmod +x /usr/local/bin/helm

トポロジ対応スケジューラを使用して Pod をデプロイする

トポロジ対応スケジューラを使用すると、指定された GPU トポロジを持つノードに GKE Pod をデプロイできます。

次の kubectl コマンドでは、リポジトリからファイルを直接使用します。また、リポジトリのクローンをローカルに作成し、kubectl コマンドでローカルファイルを参照することもできます。

詳細については、トポロジスケジューラをご覧ください。

サービスアカウントを設定します。

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/service-account.yaml

トポロジスケジューラスクリプトを configmap にインストールします。

curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.py
curl -OL  https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.py

kubectl -n kube-system create configmap topology-scheduler-scripts \
    --from-file=schedule-daemon.py=schedule-daemon.py \
    --from-file=label-nodes-daemon.py=label-nodes-daemon.py

トポロジラベルの DaemonSet とトポロジスケジューラ Pod をインストールします。

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/label-nodes-daemon.yaml
$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpxo/topology-scheduler/schedule-daemon.yaml

トポロジスケジューラのアクションを確認します。
```
kubectl -n kube-system logs topology-scheduler-pod
```

ワークロードを実行する

Dockerfile をビルドして Google Cloud Artifact Registry に push する

Cloud Storage バケットと Docker リポジトリを作成します。scripts/setup-and-configure-resources.sh script で、バケット名とリポジトリ名を作成した名前に置き換えて、スクリプトを実行します。
```
bash scripts/setup-and-configure-resources.sh
```
pytorch-megatron:23.11-py3 イメージをビルドしてリポジトリに push します。scripts/build-and-push-docker-image.sh ファイル内の Docker リポジトリ名が、scripts/setup-and-configure-resources.sh スクリプトで使用したリポジトリ名と一致していることを確認します。push する前に Docker イメージのタグ名を編集することもできます。
```
bash scripts/build-and-push-docker-image.sh
```
注: このイメージは nvcr.io/nvidia/pytorch:23.11-py3 をベースにしており、変更は最小限です。

Megatron-LM Llama2 ベンチマークを起動する

helm/values.yaml ファイルを編集して、Cloud Storage バケットと、前のセクションで作成した Docker イメージを指定します。構成例については、サンプル構成をご覧ください。
省略可: selected-configuration.sh ファイルを編集して、デフォルトの Helm 構成に加えた変更を指定することもできます。
```
helm install HELM_EXPERIMENT_NAME helm/ --values helm/values.yaml
```
HELM_EXPERIMENT_NAME は、テストの任意の名前に置き換えます。

注: Helm テストを複数回実行する場合は、helm uninstall コマンドを使用して既存のテストを消去するか、別の名前で新しいテストを作成します。

このテストでは、Nsight Systems プロファイリングツールからの指標を megatron-experiments ディレクトリで指定された Cloud Storage バケットに書き込みます。

クリーンアップ

このページで使用したリソースについて、Google Cloud アカウントに課金されないようにするには、次の操作を行います。

GKE クラスタを削除する

[クラスタ] ページに移動します。

[クラスタ] に移動

CLUSTER_NAME のチェックボックスをオンにします。
[削除] をクリックします。
削除を確定するには、「CLUSTER_NAME」と入力して [削除] をクリックします。

Cloud Storage バケットを削除する

[バケット] ページに移動します。

[バケット] に移動

このクイックスタート用に作成した Cloud Storage バケットのチェックボックスをオンにします。
[削除] をクリックします。
削除を確定するには、「DELETE」と入力して [削除] をクリックします。

次のステップ

GKE で GPU の使用方法の詳細を確認する

A3 Mega 仮想マシンで Megatron-LM を使用して Llama2 をトレーニングする

概要

始める前に

Check for the roles

Grant the roles

A3 Mega クラスタを作成する

環境を設定する

トポロジ対応スケジューラを使用して Pod をデプロイする

ワークロードを実行する

Dockerfile をビルドして Google Cloud Artifact Registry に push する

Megatron-LM Llama2 ベンチマークを起動する

クリーンアップ

GKE クラスタを削除する

Cloud Storage バケットを削除する

次のステップ