此页面由 Cloud Translation API 翻译。

在裸金属上部署 DeepSeek-R1 671B 或 Llama 3.1 405B 等 LLM

概览

本指南介绍了如何跨多个节点使用图形处理器 (GPU)，在 Google Distributed Cloud（纯软件）的裸金属上提供先进的大语言模型 (LLM)，例如 DeepSeek-R1 671B 或 Llama 3.1 405B。

本指南演示了如何使用可移植的开源技术（Kubernetes、vLLM 和 LeaderWorkerSet (LWS) API）在裸金属集群上部署和提供 AI/机器学习工作负载。Google Distributed Cloud 扩展了 GKE，使其可用于本地环境，同时还提供 GKE 的精细控制、可扩缩性、弹性、可移植性和成本效益等优势。

背景

本部分介绍本指南中使用的关键技术，包括本指南中用作示例的两个 LLM：DeepSeek-R1 和 Llama 3.1 405B。

DeepSeek-R1

DeepSeek-R1 是由 DeepSeek 提供的 671B 参数大型语言模型，专为在各种基于文本的任务中实现逻辑推理、数学推理和实时问题解决而设计。Google Distributed Cloud 可处理 DeepSeek-R1 的计算需求，并通过可扩缩资源、分布式计算和高效网络支持其功能。

如需了解详情，请参阅 DeepSeek 文档。

Llama 3.1 405B

Llama 3.1 405B 是由 Meta 提供的大语言模型，专为各种自然语言处理任务（包括文本生成、翻译和问答）而设计。Google Distributed Cloud 提供所需的强大基础设施，以支持这种规模的模型的分布式训练和服务需求。

如需了解详情，请参阅 Llama 文档。

Google Distributed Cloud 托管式 Kubernetes 服务

Google Distributed Cloud 提供各种服务，包括适用于裸金属的 Google Distributed Cloud（纯软件），该服务非常适合用于在您自己的数据中心部署和管理 AI/机器学习工作负载。Google Distributed Cloud 是一项托管式 Kubernetes 服务，可简化容器化应用的部署、扩缩和管理。Google Distributed Cloud 提供必要的基础设施（包括可扩缩资源、分布式计算和高效网络），以满足 LLM 的计算需求。

如需详细了解关键 Kubernetes 概念，请参阅开始了解 Kubernetes。如需详细了解 Google Distributed Cloud 及其如何帮助您扩缩、自动执行和管理 Kubernetes，请参阅适用于裸金属的 Google Distributed Cloud（纯软件）概览。

GPU

利用图形处理器 (GPU)，您可以加速特定工作负载，例如机器学习和数据处理。Google Distributed Cloud 提供配备这些强大 GPU 的节点，让您能够配置集群，以在机器学习和数据处理任务中实现最佳性能。Google Distributed Cloud 提供了一系列机器类型选项以用于节点配置，包括配备 NVIDIA H100、L4 和 A100 GPU 的机器类型。

如需了解详情，请参阅设置和使用 NVIDIA GPU。

LeaderWorkerSet (LWS)

LeaderWorkerSet (LWS) 是一种 Kubernetes 部署 API，可解决 AI/机器学习多节点推理工作负载的常见部署模式。多节点服务利用多个 Pod（每个 Pod 可能在不同的节点上运行）来处理分布式推理工作负载。LWS 可将多个 Pod 视为一个群组，从而简化分布式模型部署的管理。

vLLM 和多主机服务

在提供计算密集型 LLM 时，我们建议使用 vLLM 并在多个 GPU 上运行工作负载。

vLLM 是一个经过高度优化的开源 LLM 服务框架，可提高 GPU 上的服务吞吐量，具有如下功能：

具有 PagedAttention 且经过优化的 Transformer 实现
连续批处理，可提高整体服务吞吐量
多个 GPU 上的分布式服务

您可以使用多个 GPU 节点来提供模型，特别是对于无法放入单个 GPU 节点的计算密集型 LLM。vLLM 通过两种策略支持在多个 GPU 上运行工作负载：

张量并行处理会将转换器层中的矩阵乘法拆分到多个 GPU 上。不过，由于 GPU 之间需要通信，因此该策略需要快速网络，因此不太适合在多个节点上运行工作负载。
管道并行性会按层或垂直方向拆分模型。该策略不需要 GPU 之间进行持续通信，因此在跨节点运行模型时是一个更好的选择。

您可以在多节点服务中使用这两种策略。例如，如果使用两个节点，每个节点有八个 H100 GPU，您可以同时使用这两种策略：

双向流水线并行化，用于跨两个节点对模型进行分片
八向张量并行化，用于跨每个节点的八个 GPU 对模型进行分片

如需了解详情，请参阅 vLLM 文档。

为 Hugging Face 凭据创建 Kubernetes Secret

使用以下命令创建包含 Hugging Face 令牌的 Kubernetes Secret：

kubectl create secret generic hf-secret \
    --kubeconfig KUBECONFIG \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

将 KUBECONFIG 替换为您打算托管 LLM 的集群的 kubeconfig 文件的路径。

创建您自己的 vLLM 多节点映像

如需为 vLLM 促进跨节点通信，您可以使用 Ray。LeaderWorkerSet 仓库提供了一个 Dockerfile，其中包含用于对 vLLM 配置 Ray 的 bash 脚本。

如需创建自己的 vLLM 多节点映像，您需要克隆 LeaderWorkerSet 仓库，使用提供的 Dockerfile（用于为跨节点通信配置 Ray）构建 Docker 映像，然后将该映像推送到 Artifact Registry，以便在 Google Distributed Cloud 上进行部署。

构建容器

如需构建容器，请按以下步骤操作：

克隆 LeaderWorkerSet 仓库：

git clone https://github.com/kubernetes-sigs/lws.git

构建映像。

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile.GPU . -t vllm-multihost

将映像推送到 Artifact Registry

为了确保您的 Kubernetes 部署可以访问映像，请将其存储在 Google Cloud 项目中的 Artifact Registry 中：

docker image tag vllm-multihost ${IMAGE_NAME}
docker push ${IMAGE_NAME}

安装 LeaderWorkerSet

如需安装 LWS，请运行以下命令：

kubectl apply --server-side \
    --kubeconfig KUBECONFIG \
    -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

使用以下命令验证 LeaderWorkerSet 控制器是否在 lws-system 命名空间中运行：

kubectl get pod -n lws-system --kubeconfig KUBECONFIG

输出类似于以下内容：

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

部署 vLLM 模型服务器

如需部署 vLLM 模型服务器，请按以下步骤操作：

创建并应用清单，具体取决于您要部署的 LLM。

DeepSeek-R1

为 vLLM 模型服务器创建 YAML 清单 vllm-deepseek-r1-A3.yaml：

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

通过运行以下命令来应用清单：

kubectl apply -f vllm-deepseek-r1-A3.yaml \
    --kubeconfig KUBECONFIG

Llama 3.1 405B

为 vLLM 模型服务器创建 YAML 清单 vllm-llama3-405b-A3.yaml：

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

通过运行以下命令来应用清单：

kubectl apply -f vllm-llama3-405b-A3.yaml \
    --kubeconfig KUBECONFIG

使用以下命令查看正在运行的模型服务器的日志：

kubectl logs vllm-0 -c vllm-leader \
    --kubeconfig KUBECONFIG

输出应类似如下所示：

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

应用模型

运行以下命令，设置到模型的端口转发：

kubectl port-forward svc/vllm-leader 8080:8080 \
    --kubeconfig KUBECONFIG

使用 curl 与模型互动

如需使用 curl 与模型互动，请按以下说明操作：

DeepSeek-R1

在新的终端中，向服务器发送请求：

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

输出应类似如下所示：

{
  "id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
  "object": "text_completion",
  "created": 1738269669,
  "model": "deepseek-ai/DeepSeek-R1",
  "choices": [
    {
      "index": 0,
      "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 76,
    "total_tokens": 544,
    "completion_tokens": 468,
    "prompt_tokens_details": null
  }
}

Llama 3.1 405B

在新的终端中，向服务器发送请求：

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

输出应类似如下所示：

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}