このページは Cloud Translation API によって翻訳されました。

Saxml で、Vertex AI のマルチホスト Cloud TPU を使用して Llama 3 オープンモデルを提供する

Llama 3 は、Meta のオープンソースの大規模言語モデル（LLM）です。このガイドでは、Saxml を使用して、Vertex AI でマルチホスト Tensor Processing Unit（TPU）を使用して Llama 3 LLM を提供する方法について説明します。

このガイドでは、Llama 3 70B モデルの重みとトークナイザーをダウンロードし、TPU で Saxml を実行する Vertex AI にデプロイします。

始める前に

モデルをダウンロードして Saxml に変換するには、M2 メモリ最適化 VM を使用することをおすすめします。これは、モデル変換プロセスに大量のメモリが必要であり、メモリが十分でないマシンタイプを選択すると失敗する可能性があるためです。

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI and Artifact Registry APIs.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the APIs

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Artifact Registry のドキュメントに従って、Docker をインストールします。
Vertex AI に、16 個の TPU v5e チップに十分な割り当てがあることを確認します。

このチュートリアルでは、Cloud Shell を使用して Google Cloudを操作していることを前提としています。Cloud Shell の代わりに別のシェルを使用する場合は、次の追加の構成を行います。

Install the Google Cloud CLI.
If you're using an external identity provider (IdP), you must first sign in to the gcloud CLI with your federated identity.
To initialize the gcloud CLI, run the following command:
```
gcloud init
```

モデルのデプロイに Cloud Shell ではなく別のシェルを使用している場合は、Google Cloud CLI のバージョンが 475.0.0 以降であることを確認してください。Google Cloud CLI を更新するには、gcloud components update コマンドを実行します。

Vertex AI SDK を使用してモデルをデプロイする場合は、バージョン 1.50.0 以降であることを確認してください。

モデルへのアクセス権を取得して、モデルの重みをダウンロードする

次の手順は、M2 メモリ最適化 VM を持つ Vertex AI Workbench インスタンスに関するものです。Vertex AI Workbench インスタンスのマシンタイプの変更については、Vertex AI Workbench インスタンスのマシンタイプを変更するをご覧ください。

Llama モデルの同意ページに移動します。
Llama 3 を選択し、同意フォームに記入して、利用規約に同意します。
受信トレイに署名付き URL が記載されたメールをが届いているか確認します。
次のコマンドを実行して、GitHub から download.sh スクリプトをダウンロードします。
```
wget https://raw.githubusercontent.com/meta-llama/llama3/main/download.sh
chmod +x download.sh
```
モデルの重みをダウンロードするには、GitHub からダウンロードした download.sh スクリプトを実行します。
プロンプトが表示されたら、前のセクションで受信したメールの署名付き URL を入力します。
ダウンロードするモデルの入力を求めるメッセージが表示されたら、「70B」と入力します。

モデルの重みを Saxml 形式に変換する

次のコマンドを実行して、Saxml をダウンロードします。
```
git clone https://github.com/google/saxml.git
```
次のコマンドを実行して、Python 仮想環境を構成します。
```
python -m venv .
source bin/activate
```

次のコマンドを実行して依存関係をインストールします。

pip install --upgrade pip

pip install paxml

pip install praxis

pip install torch

モデルの重みを Saxml 形式に変換するには、次のコマンドを実行します。
```
python3 saxml/saxml/tools/convert_llama_ckpt.py \
    --base PATH_TO_META_LLAMA3 \
    --pax PATH_TO_PAX_LLAMA3 \
    --model-size llama3_70b
```
次のように置き換えます。
- PATH_TO_META_LLAMA3: ダウンロードしたモデルの重みが格納されているディレクトリのパス
- PATH_TO_PAX_LLAMA3: 変換されたモデルの重みを保存するディレクトリのパス
注: このコマンドは、Llama 2 モデルまたは Llama 3 モデルで使用できます。
変換されたモデルは $PATH_TO_PAX_LLAMA3/checkpoint_00000000 フォルダに配置されます。
トークナイザーファイルを元のディレクトリから vocabs というサブフォルダに次のようにコピーします。
```
cp $PATH_TO_META_LLAMA3/tokenizer.model $PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model
```

次のように、$PATH_TO_PAX_LLAMA3 フォルダに空の commit_success.txt ファイルを追加し、そのフォルダに metadata サブフォルダと state サブフォルダを追加します。

touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
touch $PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt

$PATH_TO_PAX_LLAMA3 フォルダには、次のフォルダとファイルが含まれます。

$PATH_TO_PAX_LLAMA3/checkpoint_00000000/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/metadata/commit_success.txt
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/
$PATH_TO_PAX_LLAMA3/checkpoint_00000000/state/commit_success.txt
$PATH_TO_PAX_LLAMA3/vocabs/tokenizer.model

Cloud Storage バケットを作成する

変換されたモデルの重みを保存する Cloud Storage バケットを作成します。

Cloud Shell で、次のコマンドを実行します。ここで、PROJECT_ID はプロジェクト ID に置き換えます。
```
projectid=PROJECT_ID
gcloud config set project ${projectid}
```
バケットを作成するには、次のコマンドを実行します。
```
gcloud storage buckets create gs://WEIGHTS_BUCKET_NAME
```
WEIGHTS_BUCKET_NAME は、バケットに使用する名前に置き換えます。

モデルの重みを Cloud Storage バケットにコピーする

モデルの重みをバケットにコピーするには、次のコマンドを実行します。

gcloud storage cp PATH_TO_PAX_LLAMA3/* gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b/ --recursive

モデルをアップロードする

ビルド済みの Saxml コンテナは us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest で入手できます。

ビルド済みの Saxml コンテナを使用して Model リソースを Vertex AI にアップロードするには、次のように gcloud ai models upload コマンドを実行します。

gcloud ai models upload \
    --region=LOCATION \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/sax-tpu:latest \
    --artifact-uri='gs://WEIGHTS_BUCKET_NAME/llama3/pax_70b' \
    --container-args='--model_path=saxml.server.pax.lm.params.lm_cloud.LLaMA3_70BFP16x16' \
    --container-args='--platform_chip=tpuv5e' \
    --container-args='--platform_topology=4x4' \
    --container-args='--ckpt_path_suffix=checkpoint_00000000' \
    --container-deployment-timeout-seconds=2700 \
    --container-ports=8502 \
    --project=PROJECT_ID

次のように置き換えます。

LOCATION: Vertex AI を使用するリージョン。TPU は us-west1 でのみ使用できます。
MODEL_DISPLAY_NAME: モデルの表示名
PROJECT_ID: Google Cloud プロジェクトの ID

オンライン推論エンドポイントを作成する

エンドポイントを作成するには、次のコマンドを実行します。

gcloud ai endpoints create \
    --region=LOCATION \
    --display-name=ENDPOINT_DISPLAY_NAME \
    --project=PROJECT_ID

ENDPOINT_DISPLAY_NAME は、エンドポイントの表示名に置き換えます。

エンドポイントにモデルをデプロイする

エンドポイントの準備が整ったら、モデルをエンドポイントにデプロイします。

このチュートリアルでは、4x4 トポロジを使用して 16 個の Cloud TPU v5e チップにシャーディングされた Llama 3 70B モデルをデプロイします。ただし、サポートされている次のマルチホスト Cloud TPU トポロジを指定できます。

マシンタイプ	トポロジ	TPU チップの数	ホストの数
`ct5lp-hightpu-4t`	4x4	16	2
`ct5lp-hightpu-4t`	4x8	32	4
`ct5lp-hightpu-4t`	8x8	64	8
`ct5lp-hightpu-4t`	8x16	128	16
`ct5lp-hightpu-4t`	16x16	256	32

Saxml GitHub リポジトリで定義されている別の Llama モデルをデプロイする場合は、ターゲットとするデバイスの数に合わせてパーティション分割されていること、Cloud TPU にモデルを読み込むのに十分なメモリがあることを確認してください。

単一ホストの Cloud TPU にモデルをデプロイする方法については、モデルをデプロイするをご覧ください。

サポートされている Cloud TPU のタイプとリージョンの完全なリストについては、Vertex AI のロケーションをご覧ください。

オンライン推論エンドポイントのエンドポイント ID を取得します。

ENDPOINT_ID=$(gcloud ai endpoints list \
    --region=LOCATION \
    --filter=display_name=ENDPOINT_NAME \
    --format="value(name)")

モデルのモデル ID を取得します。

MODEL_ID=$(gcloud ai models list \
    --region=LOCATION \
    --filter=display_name=DEPLOYED_MODEL_NAME \
    --format="value(name)")

エンドポイントにモデルをデプロイします。
```
gcloud ai endpoints deploy-model $ENDPOINT_ID \
    --region=LOCATION \
    --model=$MODEL_ID \
    --display-name=DEPLOYED_MODEL_NAME \
    --machine-type=ct5lp-hightpu-4t \
    --tpu-topology=4x4 \
    --traffic-split=0=100
```
DEPLOYED_MODEL_NAME は、デプロイするモデルの名前に置き換えます。これは、モデルの表示名（MODEL_DISPLAY_NAME）と同じにすることもできます。

デプロイオペレーションがタイムアウトする可能性があります。

deploy-model コマンドは、オペレーションが完了したことを確認するために使用できるオペレーション ID を返します。レスポンスに "done": true が表示されるまで、オペレーションのステータスをポーリングできます。次のコマンドを使用してステータスをポーリングします。
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
OPERATION_ID を、前のコマンドで返されたオペレーション ID に置き換えます。

デプロイされたモデルからオンライン推論を取得する

Vertex AI エンドポイントからオンライン推論を取得するには、gcloud ai endpoints predict コマンドを実行します。

次のコマンドを実行して、サンプルの推論リクエストを含む request.json ファイルを作成します。
```
cat << EOF > request.json
{"instances": [{"text_batch": "the distance between Earth and Moon is "}]}
EOF
```

オンライン推論リクエストをエンドポイントに送信するには、次のコマンドを実行します。

gcloud ai endpoints predict $ENDPOINT_ID \
    --project=PROJECT_ID \
    --region=LOCATION \
    --json-request=request.json

クリーンアップ

Vertex AI の料金が発生しないように、このチュートリアルで作成した Google Cloud リソースを削除します。

エンドポイントからモデルのデプロイを解除し、エンドポイントを削除するには、次のコマンドを実行します。

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

モデルを削除するには、次のコマンドを実行します。

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet