本頁面由 Cloud Translation API 翻譯而成。

使用多主機 GPU 部署方式提供 DeepSeek-V3 模型

總覽

Vertex AI 支援多主機 GPU 部署作業，可服務超出單一 GPU 節點記憶體容量的模型，例如 DeepSeek-V3、DeepSeek-R1 和 Meta LLama3.1 405 (非量化版本)。

本指南說明如何使用 Vertex AI 上的多主機圖形處理單元 (GPU) 和 vLLM，提供 DeepSeek-V3 模型。其他機型的設定方式也類似。詳情請參閱「使用 vLLM 提供純文字和多模態語言模型」。

開始之前，請務必詳閱下列事項：

使用 Pricing Calculator 根據您的預測使用量來產生預估費用。

容器

為支援多主機部署作業，本指南使用預先建構的 vLLM 容器映像檔，其中整合了 Model Garden 的 Ray。Ray 可提供分散式處理功能，在多個 GPU 節點上執行模型。這個容器也支援使用 Chat Completions API 處理串流要求。

如要建立自己的 vLLM 多節點映像檔，請參閱這篇文章。請注意，這個自訂容器映像檔必須與 Vertex AI 相容。

事前準備

開始部署模型前，請先完成本節列出的必要條件。

設定 Google Cloud 專案

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

申請 GPU 配額

如要部署 DeepSeek-V3，您需要兩部 a3-highgpu-8g VM，每部 VM 搭載八個 H100 GPU，總共 16 個 H100 GPU。由於預設值小於 16，您可能需要申請增加 H100 GPU 配額。

如要查看 H100 GPU 配額，請前往 Google Cloud 控制台的「配額與系統限制」頁面。

前往「配額與系統限制」
申請調整配額。

上傳模型

如要將模型以 Model 資源的形式上傳至 Vertex AI，請執行 gcloud ai models upload 指令，如下所示：

gcloud ai models upload \
    --region=LOCATION \
    --project=PROJECT_ID \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250130_0916_RC01 \
    --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=8080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=16;--pipeline-parallel-size=1;--gpu-memory-utilization=0.9;--trust-remote-code;--max-model-len=32768' \
    --container-deployment-timeout-seconds=4500 \
    --container-ports=8080 \
    --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3

請將下列項目改為對應的值：

LOCATION：您使用 Vertex AI 的區域
PROJECT_ID：您的 Google Cloud 專案 ID
MODEL_DISPLAY_NAME：您要為模型設定的顯示名稱

建立專屬的線上推論端點

如要支援即時通訊完成要求，Model Garden 容器需要專屬端點。專屬端點目前為預先發布版，不支援 Google Cloud CLI，因此您必須使用 REST API 建立端點。

如要建立專屬端點，請執行下列指令：

PROJECT_ID=PROJECT_ID
REGION=LOCATION
ENDPOINT="${REGION}-aiplatform.googleapis.com"

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
  -d '{
    "displayName": "ENDPOINT_DISPLAY_NAME",
    "dedicatedEndpointEnabled": true
    }'

請將下列項目改為對應的值：

ENDPOINT_DISPLAY_NAME：端點的顯示名稱

部署模型

執行 gcloud ai endpoints list 指令，取得線上推論端點的端點 ID：

ENDPOINT_ID=$(gcloud ai endpoints list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
 --format="value(name)")

執行 gcloud ai models list 指令，取得模型的模型 ID：

MODEL_ID=$(gcloud ai models list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'MODEL_DISPLAY_NAME' \
 --format="value(name)")

執行 gcloud ai deploy-model 指令，將模型部署至端點：
```
gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
 --project=PROJECT_ID \
 --region=LOCATION \
 --model=$MODEL_ID \
 --display-name="DEPLOYED_MODEL_NAME" \
 --machine-type=a3-highgpu-8g \
 --traffic-split=0=100 \
 --accelerator=type=nvidia-h100-80gb,count=8 \
 --multihost-gpu-node-count=2
```
將 DEPLOYED_MODEL_NAME 替換為已部署模型的名稱。這可以與模型顯示名稱相同 (MODEL_DISPLAY_NAME)。

部署 DeepSeek-V3 等大型模型可能需要較長的時間，超過預設的部署逾時時間。如果 deploy-model 指令逾時，部署程序會繼續在背景執行。

deploy-model 指令會傳回作業 ID，可用於檢查作業何時完成。您可以輪詢作業狀態，直到回應包含 "done": true 為止。使用下列指令輪詢狀態：
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
將 OPERATION_ID 替換為上一個指令傳回的作業 ID。

從已部署的模型取得線上推論結果

本節說明如何將線上推論要求傳送至部署 DeepSeek-V3 模型的專用公開端點。

執行 gcloud projects describe 指令，取得專案編號：

PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

傳送原始預測要求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
-d '{
   "prompt": "Write a short story about a robot.",
   "stream": false,
   "max_tokens": 50,
   "temperature": 0.7
   }'

傳送即時通訊完成要求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
-d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'

如要啟用串流功能，請將 "stream" 的值從 false 變更為 true。

清除所用資源

如要從端點取消部署模型並刪除端點，請執行下列指令：

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_DISPLAY_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

如要刪除模型，請執行下列指令：

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet

後續步驟

如要深入瞭解如何使用 vLLM 在 Vertex AI 上部署多主機 GPU，請參閱「使用 vLLM 和 GPU 提供純文字和多模態模型」。
瞭解如何建立自己的 vLLM 多節點映像檔。請注意，自訂容器映像檔必須與 Vertex AI 相容。