本頁面由 Cloud Translation API 翻譯而成。

在裸機上提供 DeepSeek-R1 671B 或 Llama 3.1 405B 等 LLM

總覽

本指南說明如何使用多個節點的繪圖處理單元 (GPU)，在裸機上透過 Google Distributed Cloud (僅限軟體) 提供最先進的大型語言模型 (LLM)，例如 DeepSeek-R1 671B 或 Llama 3.1 405B。

本指南說明如何使用可攜式開放原始碼技術 (Kubernetes、vLLM 和 LeaderWorkerSet (LWS) API)，在裸機叢集上部署及提供 AI/ML 工作負載。Google Distributed Cloud 可擴充 GKE，供您在內部部署環境中使用，同時提供 GKE 的細微控制、擴充性、復原能力、可攜性和成本效益等優點。

背景

本節說明本指南中使用的主要技術，包括本指南中做為範例的兩個 LLM：DeepSeek-R1 和 Llama 3.1 405B。

DeepSeek-R1

DeepSeek-R1 是 DeepSeek 開發的 6710 億參數大型語言模型，專為各種文字工作中的邏輯推論、數學推理和即時問題解決而設計。Google Distributed Cloud 可處理 DeepSeek-R1 的運算需求，並透過可擴充的資源、分散式運算和高效網路，支援其功能。

詳情請參閱 DeepSeek 說明文件。

Llama 3.1 405B

Llama 3.1 405B 是 Meta 的大型語言模型，適用於各種自然語言處理工作，包括生成文字、翻譯和回答問題。Google Distributed Cloud 提供強大的基礎架構，可支援這類模型的分散式訓練和服務需求。

詳情請參閱 Llama 說明文件。

Google Distributed Cloud 代管式 Kubernetes 服務

Google Distributed Cloud 提供多種服務，包括適用於裸機的 Google Distributed Cloud (僅限軟體)，非常適合在您自己的資料中心部署及管理 AI/ML 工作負載。Google Distributed Cloud 是代管 Kubernetes 服務，可簡化容器化應用程式的部署、擴充及管理作業。Google Distributed Cloud 提供必要的基礎架構，包括可擴充的資源、分散式運算和高效能網路，可處理 LLM 的運算需求。

如要進一步瞭解 Kubernetes 的重要概念，請參閱「開始學習 Kubernetes」。如要進一步瞭解 Google Distributed Cloud，以及這項服務如何協助您調度資源、自動化及管理 Kubernetes，請參閱 Google Distributed Cloud (僅限軟體) 裸機總覽。

GPU

圖形處理單元 (GPU) 可加速處理特定工作負載，例如機器學習和資料處理。Google Distributed Cloud 支援搭載這些強大 GPU 的節點，讓您設定叢集，在機器學習和資料處理工作方面發揮最佳效能。Google Distributed Cloud 提供多種節點設定的機器類型選項，包括搭載 NVIDIA H100、L4 和 A100 GPU 的機器類型。

詳情請參閱「設定及使用 NVIDIA GPU」。

LeaderWorkerSet (LWS)

LeaderWorkerSet (LWS) 是 Kubernetes 部署 API，可解決 AI/機器學習多節點推論工作負載的常見部署模式。多節點服務會運用多個 Pod (每個 Pod 可能在不同節點上執行)，處理分散式推論工作負載。LWS 可將多個 Pod 視為一組，簡化分散式模型服務的管理作業。

vLLM 和多主機服務

提供需要大量運算的 LLM 時，建議使用 vLLM，並在 GPU 上執行工作負載。

vLLM 是經過高度最佳化的開放原始碼 LLM 服務架構，可提高 GPU 的服務輸送量，並提供下列功能：

使用 PagedAttention 實作最佳化轉換器
持續批次處理，提升整體放送輸送量
在多個 GPU 上分散式提供服務

對於無法放入單一 GPU 節點的運算密集型 LLM，您可以使用多個 GPU 節點來提供模型。vLLM 支援使用兩種策略，跨 GPU 執行工作負載：

張量平行化會將 Transformer 層中的矩陣乘法運算分割到多個 GPU 上。不過，由於 GPU 之間需要通訊，這項策略需要快速網路，因此不太適合跨節點執行工作負載。
管線平行化會依層或垂直分割模型。這項策略不需要 GPU 之間持續通訊，因此在跨節點執行模型時，是較好的選擇。

您可以在多節點服務中使用這兩種策略。舉例來說，使用兩個節點時，每個節點有八個 H100 GPU，您可以使用以下兩種策略：

雙向管道平行處理，可將模型分散到兩個節點
八向張量平行處理，將模型分散到每個節點的八個 GPU 上

詳情請參閱 vLLM 說明文件。

為 Hugging Face 憑證建立 Kubernetes 密鑰

使用下列指令建立包含 Hugging Face 權杖的 Kubernetes Secret：

kubectl create secret generic hf-secret \
    --kubeconfig KUBECONFIG \
    --from-literal=hf_api_token=${HF_TOKEN} \
    --dry-run=client -o yaml | kubectl apply -f -

將 KUBECONFIG 替換為您要用來代管 LLM 的叢集 kubeconfig 檔案路徑。

建立自己的 vLLM 多節點映像檔

如要促進 vLLM 的跨節點通訊，可以使用 Ray。LeaderWorkerSet 存放區提供 Dockerfile，其中包含用於設定 Ray 和 vLLM 的 Bash 指令碼。

如要建立自己的 vLLM 多節點映像檔，請複製 LeaderWorkerSet 存放區，使用提供的 Dockerfile 建構 Docker 映像檔 (這會設定 Ray 進行跨節點通訊)，然後將該映像檔推送至 Artifact Registry，以便在 Google Distributed Cloud 上部署。

建構容器

如要建構容器，請按照下列步驟操作：

複製 LeaderWorkerSet 存放區：

git clone https://github.com/kubernetes-sigs/lws.git

建構映像檔。

cd lws/docs/examples/vllm/build/ && docker build -f Dockerfile.GPU . -t vllm-multihost

將映像檔推送至 Artifact Registry

為確保 Kubernetes 部署作業可以存取映像檔，請將映像檔儲存在 Google Cloud 專案的 Artifact Registry 中：

docker image tag vllm-multihost ${IMAGE_NAME}
docker push ${IMAGE_NAME}

安裝 LeaderWorkerSet

如要安裝 LWS，請執行下列指令：

kubectl apply --server-side \
    --kubeconfig KUBECONFIG \
    -f https://github.com/kubernetes-sigs/lws/releases/latest/download/manifests.yaml

使用下列指令，驗證 LeaderWorkerSet 控制器是否在 lws-system 命名空間中執行：

kubectl get pod -n lws-system --kubeconfig KUBECONFIG

輸出結果會與下列內容相似：

NAME                                      READY   STATUS    RESTARTS   AGE
lws-controller-manager-5c4ff67cbd-9jsfc   2/2     Running   0          6d23h

部署 vLLM 模型伺服器

如要部署 vLLM 模型伺服器，請按照下列步驟操作：

根據要部署的 LLM 建立並套用資訊清單。

DeepSeek-R1

為 vLLM 模型伺服器建立 YAML 資訊清單 vllm-deepseek-r1-A3.yaml：

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model deepseek-ai/DeepSeek-R1 --tensor-parallel-size 8 --pipeline-parallel-size 2 --trust-remote-code --max-model-len 4096"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

執行下列指令來套用資訊清單：

kubectl apply -f vllm-deepseek-r1-A3.yaml \
    --kubeconfig KUBECONFIG

Llama 3.1 405B

為 vLLM 模型伺服器建立 YAML 資訊清單 vllm-llama3-405b-A3.yaml：

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 1
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        nodeSelector:
          cloud.google.com/gke-accelerator: nvidia-h100-80gb
        containers:
          - name: vllm-leader
            image: IMAGE_NAME
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); 
                python3 -m vllm.entrypoints.openai.api_server --port 8080 --model meta-llama/Meta-Llama-3.1-405B-Instruct --tensor-parallel-size 8 --pipeline-parallel-size 2"
            resources:
              limits:
                nvidia.com/gpu: "8"
            ports:
              - containerPort: 8080
            readinessProbe:
              tcpSocket:
                port: 8080
              initialDelaySeconds: 15
              periodSeconds: 10
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: IMAGE_NAME
            command:
              - sh
              - -c
              - "/vllm-workspace/ray_init.sh worker --ray_address=$(LWS_LEADER_ADDRESS)"
            resources:
              limits:
                nvidia.com/gpu: "8"
            env:
              - name: HUGGING_FACE_HUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: hf-secret
                    key: hf_api_token
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm   
        volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 15Gi
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-leader
spec:
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    leaderworkerset.sigs.k8s.io/name: vllm
    role: leader
  type: ClusterIP

執行下列指令來套用資訊清單：

kubectl apply -f vllm-llama3-405b-A3.yaml \
    --kubeconfig KUBECONFIG

執行下列指令，查看執行中模型伺服器的記錄：

kubectl logs vllm-0 -c vllm-leader \
    --kubeconfig KUBECONFIG

輸出內容應如下所示：

INFO 08-09 21:01:34 api_server.py:297] Route: /detokenize, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/models, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /version, Methods: GET
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/chat/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/completions, Methods: POST
INFO 08-09 21:01:34 api_server.py:297] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [7428]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

提供模型

執行下列指令，設定模型的通訊埠轉送：

kubectl port-forward svc/vllm-leader 8080:8080 \
    --kubeconfig KUBECONFIG

使用 curl 與模型互動

如要使用 curl 與模型互動，請按照下列操作說明進行：

DeepSeek-R1

在新的終端機中，將要求傳送至伺服器：

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-ai/DeepSeek-R1",
    "prompt": "I have four boxes. I put the red box on the bottom and put the blue box on top. Then I put the yellow box on top the blue. Then I take the blue box out and put it on top. And finally I put the green box on the top. Give me the final order of the boxes from bottom to top. Show your reasoning but be brief",
    "max_tokens": 1024,
    "temperature": 0
}'

畫面會顯示如下的輸出內容：

{
  "id": "cmpl-f2222b5589d947419f59f6e9fe24c5bd",
  "object": "text_completion",
  "created": 1738269669,
  "model": "deepseek-ai/DeepSeek-R1",
  "choices": [
    {
      "index": 0,
      "text": ".\n\nOkay, let's see. The user has four boxes and is moving them around. Let me try to visualize each step. \n\nFirst, the red box is placed on the bottom. So the stack starts with red. Then the blue box is put on top of red. Now the order is red (bottom), blue. Next, the yellow box is added on top of blue. So now it's red, blue, yellow. \n\nThen the user takes the blue box out. Wait, blue is in the middle. If they remove blue, the stack would be red and yellow. But where do they put the blue box? The instruction says to put it on top. So after removing blue, the stack is red, yellow. Then blue is placed on top, making it red, yellow, blue. \n\nFinally, the green box is added on the top. So the final order should be red (bottom), yellow, blue, green. Let me double-check each step to make sure I didn't mix up any steps. Starting with red, then blue, then yellow. Remove blue from the middle, so yellow is now on top of red. Then place blue on top of that, so red, yellow, blue. Then green on top. Yes, that seems right. The key step is removing the blue box from the middle, which leaves yellow on red, then blue goes back on top, followed by green. So the final order from bottom to top is red, yellow, blue, green.\n\n**Final Answer**\nThe final order from bottom to top is \\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}.\n</think>\n\n1. Start with the red box at the bottom.\n2. Place the blue box on top of the red box. Order: red (bottom), blue.\n3. Place the yellow box on top of the blue box. Order: red, blue, yellow.\n4. Remove the blue box (from the middle) and place it on top. Order: red, yellow, blue.\n5. Place the green box on top. Final order: red, yellow, blue, green.\n\n\\boxed{red}, \\boxed{yellow}, \\boxed{blue}, \\boxed{green}",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "prompt_logprobs": null
    }
  ],
  "usage": {
    "prompt_tokens": 76,
    "total_tokens": 544,
    "completion_tokens": 468,
    "prompt_tokens_details": null
  }
}

Llama 3.1 405B

在新的終端機中，將要求傳送至伺服器：

curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
}'

畫面會顯示如下的輸出內容：

{"id":"cmpl-0a2310f30ac3454aa7f2c5bb6a292e6c",
"object":"text_completion","created":1723238375,"model":"meta-llama/Meta-Llama-3.1-405B-Instruct","choices":[{"index":0,"text":" top destination for foodies, with","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":5,"total_tokens":12,"completion_tokens":7}}