使用多主机 GPU 部署进行 DeepSeek-V3 模型部署

概览

Vertex AI 支持多主机 GPU 部署，以部署超出单个 GPU 节点内存容量的模型，例如 DeepSeek-V3、DeepSeek-R1 和 Meta LLama3.1 405（非量化版本）。

本指南介绍了如何使用 Vertex AI 上的多主机图形处理器 (GPU) 和 vLLM 来部署 DeepSeek-V3 模型。其他模型的设置类似。如需了解详情，请参阅适用于文本和多模态语言模型的 vLLM 服务。

在开始之前，请确保您熟悉以下内容：

请使用价格计算器根据您的预计用量来估算费用。

容器

为了支持多主机部署，本指南使用来自 Model Garden 的预构建 vLLM 容器映像，该映像集成了 Ray。Ray 支持在多个 GPU 节点上运行模型所需的分布式处理。此容器还支持使用 Chat Completions API 处理流式传输请求。

如果需要，您可以创建自己的 vLLM 多节点映像。请注意，此自定义容器映像需要与 Vertex AI 兼容。

准备工作

在开始部署模型之前，完成本部分中列出的前提条件。

设置 Google Cloud 项目

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the Vertex AI API.

Roles required to enable APIs

To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.

Enable the API

In the Google Cloud console, activate Cloud Shell.

Activate Cloud Shell

At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.

申请 GPU 配额

如需部署 DeepSeek-V3，您需要两台 a3-highgpu-8g 虚拟机，每台虚拟机配备八个 H100 GPU，总共需要 16 个 H100 GPU。您可能需要申请增加 H100 GPU 配额，因为默认值小于 16。

如需查看 H100 GPU 配额，请前往 Google Cloud 控制台的配额和系统限制页面。

进入“配额和系统限制”
申请配额调整。

上传模型

如需将模型作为 Model 资源上传到 Vertex AI，请运行 gcloud ai models upload 命令，如下所示：

gcloud ai models upload \
    --region=LOCATION \
    --project=PROJECT_ID \
    --display-name=MODEL_DISPLAY_NAME \
    --container-image-uri=us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-vllm-serve:20250312_0916_RC01 \
    --container-args='^;^/vllm-workspace/ray_launcher.sh;python;-m;vllm.entrypoints.api_server;--host=0.0.0.0;--port=7080;--model=deepseek-ai/DeepSeek-V3;--tensor-parallel-size=8;--pipeline-parallel-size=2;--gpu-memory-utilization=0.82;--max-model-len=163840;--max-num-seqs=64;--enable-chunked-prefill;--kv-cache-dtype=auto;--trust-remote-code;--disable-log-requests' \
    --container-deployment-timeout-seconds=7200 \
    --container-ports=7080 \
    --container-env-vars=MODEL_ID=deepseek-ai/DeepSeek-V3

进行以下替换：

LOCATION：您在其中使用 Vertex AI 的区域
PROJECT_ID：您的 Google Cloud 项目的 ID
MODEL_DISPLAY_NAME：您希望用于模型的显示名称

创建专用在线推理端点

为了支持聊天完成请求，Model Garden 容器需要专用端点。专用端点目前处于预览状态，不支持 Google Cloud CLI，因此您需要使用 REST API 来创建端点。

如需创建专用端点，请运行以下命令：

PROJECT_ID=PROJECT_ID
REGION=LOCATION
ENDPOINT="${REGION}-aiplatform.googleapis.com"

curl \
  -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  https://${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/endpoints \
  -d '{
    "displayName": "ENDPOINT_DISPLAY_NAME",
    "dedicatedEndpointEnabled": true
    }'

进行以下替换：

ENDPOINT_DISPLAY_NAME：端点的显示名称

部署模型

运行 gcloud ai endpoints list 命令，获取在线推理端点的端点 ID：

ENDPOINT_ID=$(gcloud ai endpoints list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'ENDPOINT_DISPLAY_NAME' \
 --format="value(name)")

运行 gcloud ai models list 命令，来获取模型的模型 ID：

MODEL_ID=$(gcloud ai models list \
 --project=PROJECT_ID \
 --region=LOCATION \
 --filter=display_name~'MODEL_DISPLAY_NAME' \
 --format="value(name)")

运行 gcloud ai deploy-model 命令，来将模型部署到端点：
```
gcloud alpha ai endpoints deploy-model $ENDPOINT_ID \
 --project=PROJECT_ID \
 --region=LOCATION \
 --model=$MODEL_ID \
 --display-name="DEPLOYED_MODEL_NAME" \
 --machine-type=a3-highgpu-8g \
 --traffic-split=0=100 \
 --accelerator=type=nvidia-h100-80gb,count=8 \
 --multihost-gpu-node-count=2
```
将 DEPLOYED_MODEL_NAME 替换为所部署的模型的名称。这可以与模型显示名称 (MODEL_DISPLAY_NAME) 相同。

部署 DeepSeek-V3 等大型模型可能需要比默认部署超时更长的时间。如果 deploy-model 命令超时，部署过程会继续在后台运行。

deploy-model 命令会返回操作 ID，可用于检查操作完成时间。您可以轮询操作状态，直到响应包含 "done": true。可使用以下命令轮询状态：
```
gcloud ai operations describe \
--region=LOCATION \
OPERATION_ID
```
将 OPERATION_ID 替换为上一条命令返回的操作 ID。

从已部署的模型获取在线推理

本部分介绍了如何向部署了 DeepSeek-V3 模型的专用公共端点发送在线推理请求。

运行 gcloud projects describe 命令，来获取项目编号：

PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

发送原始预测请求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}:rawPredict \
-d '{
   "prompt": "Write a short story about a robot.",
   "stream": false,
   "max_tokens": 50,
   "temperature": 0.7
   }'

发送对话补全请求：

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${ENDPOINT_ID}.${REGION}-${PROJECT_NUMBER}.prediction.vertexai.goog/v1/projects/${PROJECT_NUMBER}/locations/${REGION}/endpoints/${ENDPOINT_ID}/chat/completions \
-d '{"stream":false, "messages":[{"role": "user", "content": "Summer travel plan to Paris"}], "max_tokens": 40,"temperature":0.4,"top_k":10,"top_p":0.95, "n":1}'

如需启用流式传输，请将 "stream" 的值从 false 更改为 true。

清理

为避免产生额外的 Vertex AI 费用，请删除您在本教程中创建的 Google Cloud 资源：

如需从端点取消部署模型并删除端点，请运行以下命令：

ENDPOINT_ID=$(gcloud ai endpoints list \
   --region=LOCATION \
   --filter=display_name=ENDPOINT_DISPLAY_NAME \
   --format="value(name)")

DEPLOYED_MODEL_ID=$(gcloud ai endpoints describe $ENDPOINT_ID \
   --region=LOCATION \
   --format="value(deployedModels.id)")

gcloud ai endpoints undeploy-model $ENDPOINT_ID \
  --region=LOCATION \
  --deployed-model-id=$DEPLOYED_MODEL_ID

gcloud ai endpoints delete $ENDPOINT_ID \
   --region=LOCATION \
   --quiet

如需删除模型，请运行以下命令：

MODEL_ID=$(gcloud ai models list \
   --region=LOCATION \
   --filter=display_name=DEPLOYED_MODEL_NAME \
   --format="value(name)")

gcloud ai models delete $MODEL_ID \
   --region=LOCATION \
   --quiet

后续步骤

如需详细了解如何在 Vertex AI 上使用 vLLM 部署多主机 GPU，请参阅适用于文本和多模态语言模型的 vLLM 服务。
了解如何创建自己的 vLLM 多节点映像。请注意，您的自定义容器映像需要与 Vertex AI 兼容。