このページは Cloud Translation API によって翻訳されました。

マルチホスト GPU デプロイを使用して DeepSeek-V3 モデルをサービングする

概要

Vertex AI は、DeepSeek-V3、DeepSeek-R1、Meta LLama3.1 405（量子化されていないバージョン）など、単一の GPU ノードのメモリ容量を超えるモデルをサービングするためのマルチホスト GPU デプロイをサポートしています。

このガイドでは、vLLM を使用して Vertex AI でマルチホスト画像処理装置（GPU）を使用して DeepSeek-V3 モデルをサービングする方法について説明します。他のモデルの設定も同様です。詳細については、テキストとマルチモーダル言語モデル用の vLLM サービングをご覧ください。

始める前に、次の内容を理解しておいてください。

料金計算ツールを使うと、予想使用量に基づいて費用の見積もりを出すことができます。

コンテナ

マルチホストデプロイをサポートするため、このガイドでは Model Garden の Ray 統合を含むビルド済み vLLM コンテナイメージを使用します。Ray は、複数の GPU ノードでモデルを実行するために必要な分散処理を可能にします。このコンテナは、Chat Completions API を使用したストリーミングリクエストの処理もサポートしています。

必要に応じて、独自の vLLM マルチノードイメージを作成できます。このカスタムコンテナイメージは、Vertex AI と互換性がある必要があります。

始める前に

モデルのデプロイを開始する前に、このセクションに記載されている前提条件を満たしてください。

Google Cloud プロジェクトを設定する

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

GPU 割り当てをリクエストする

DeepSeek-V3 をデプロイするには、それぞれ 8 個の H100 GPU を備えた 2 つの a3-highgpu-8g VM が必要です。合計で 16 個の H100 GPU が必要になります。デフォルト値は 16 未満であるため、H100 GPU 割り当ての増加をリクエストする必要がある可能性があります。

H100 GPU の割り当てを表示するには、 Google Cloud コンソールの [割り当てとシステム上限] ページに移動します。

[割り当てとシステム上限] に移動
割り当ての調整をリクエストする。

モデルをアップロードする

モデルを Model リソースとして Vertex AI にアップロードするには、次のように gcloud ai models upload コマンドを実行します。

gcloud ai models upload \
    --region=LOCATION \
    --project=PROJECT_ID \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01 \
    --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=7080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=8;--pipeline-parallel-size=2;--gpu-memory-utilization=0.82;--max-model-len=163840;--max-num-seqs=64;--enable-chunked-prefill;--kv-cache-dtype=auto;--trust-remote-code;--disable-log-requests' \
    --container-deployment-timeout-seconds=7200 \
    --container-ports=7080 \
    --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3

次のように置き換えます。

LOCATION: Vertex AI を使用するリージョン。
PROJECT_ID: Google Cloud プロジェクトの ID
MODEL_DISPLAY_NAME: モデルの表示名

専用のオンライン推論エンドポイントを作成する

チャット補完リクエストをサポートするには、Model Garden コンテナに専用エンドポイントが必要です。専用エンドポイントはプレビュー版であり、Google Cloud CLI をサポートしていないため、REST API を使用してエンドポイントを作成する必要があります。

専用エンドポイントを作成するには、次のコマンドを実行します。

PROJECT_ID=PROJECT_ID
REGION=LOCATION
ENDPOINT="${REGION}-aiplatform.googleapis.com"

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
  -d '{
    "displayName": "ENDPOINT_DISPLAY_NAME",
    "dedicatedEndpointEnabled": true
    }'

次のように置き換えます。

ENDPOINT_DISPLAY_NAME: エンドポイントの表示名

モデルをデプロイする

gcloud ai endpoints list コマンドを実行して、オンライン推論エンドポイントのエンドポイント ID を取得します。

ENDPOINT_ID=$(gcloud ai endpoints list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
 --format="value(name)")

gcloud ai models list コマンドを実行して、モデルのモデル ID を取得します。

MODEL_ID=$(gcloud ai models list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'MODEL_DISPLAY_NAME' \
 --format="value(name)")

gcloud ai deploy-model コマンドを実行して、モデルをエンドポイントにデプロイします。
```
gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
 --project=PROJECT_ID \
 --region=LOCATION \
 --model=$MODEL_ID \
 --display-name="DEPLOYED_MODEL_NAME" \
 --machine-type=a3-highgpu-8g \
 --traffic-split=0=100 \
 --accelerator=type=nvidia-h100-80gb,count=8 \
 --multihost-gpu-node-count=2
```
DEPLOYED_MODEL_NAME は、デプロイするモデルの名前に置き換えます。これは、モデルの表示名（MODEL_DISPLAY_NAME）と同じにすることもできます。

DeepSeek-V3 などの大規模モデルのデプロイには、デフォルトのデプロイタイムアウトよりも時間がかかることがあります。deploy-model コマンドがタイムアウトした場合、デプロイプロセスはバックグラウンドで実行され続けます。

deploy-model コマンドは、オペレーションが完了したことを確認するために使用できるオペレーション ID を返します。レスポンスに "done": true が表示されるまで、オペレーションのステータスをポーリングできます。次のコマンドを使用してステータスをポーリングします。
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
OPERATION_ID を、前のコマンドで返されたオペレーション ID に置き換えます。

デプロイされたモデルからオンライン推論を取得する

このセクションでは、DeepSeek-V3 モデルがデプロイされている専用パブリックエンドポイントにオンライン推論リクエストを送信する方法について説明します。

gcloud projects describe コマンドを実行して、プロジェクト番号を取得します。
```
PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
```

未加工の予測リクエストを送信します。

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
-d '{
   "prompt": "Write a short story about a robot.",
   "stream": false,
   "max_tokens": 50,
   "temperature": 0.7
   }'

チャット完了リクエストを送信します。

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
-d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'

ストリーミングを有効にするには、"stream" の値を false から true に変更します。

クリーンアップ

Vertex AI の料金が発生しないように、このチュートリアルで作成した Google Cloud リソースを削除します。

エンドポイントからモデルのデプロイを解除し、エンドポイントを削除するには、次のコマンドを実行します。

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_DISPLAY_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

モデルを削除するには、次のコマンドを実行します。

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet

次のステップ

vLLM を使用した Vertex AI でのマルチホスト GPU デプロイに関する包括的なリファレンス情報については、テキストとマルチモーダルの言語モデル用の vLLM サービングをご覧ください。
独自の vLLM マルチノードイメージを作成する方法を学習します。カスタムコンテナイメージは Vertex AI と互換性がある必要があります。