本頁面由 Cloud Translation API 翻譯而成。

透過經濟實惠且高可用性的 GPU 佈建策略，在 GKE 上提供 LLM

自動駕駛標準

本指南說明如何盡量降低 GKE 上 LLM 服務工作負載的成本。本教學課程會結合彈性啟動、Spot VM 和自訂運算類別設定檔，以具備成本效益的方式進行推論。

本指南以 Mixtral 8x7b 為例，說明如何部署 LLM。

本指南適用於機器學習 (ML) 工程師、平台管理員和操作員，以及有興趣使用 Kubernetes 容器自動化調度管理功能提供 LLM 服務的資料和 AI 專家。如要進一步瞭解我們在 Google Cloud 內容中提及的常見角色和範例工作，請參閱「常見的 GKE 使用者角色和工作」。

彈性啟動定價

如果工作負載需要視情況動態佈建資源，且最多可保留七天，不需要複雜的配額管理，並能以具成本效益的方式存取，建議使用彈性啟動。彈性啟動功能採用 Dynamic Workload Scheduler，並按照 Dynamic Workload Scheduler 定價計費：

vCPU、GPU 和 TPU 可享折扣 (最高 53%)。
即付即用。

背景

本節說明可用的技術，您可根據 AI/機器學習工作負載的需求，使用這些技術取得運算資源，包括 GPU 加速器。在 GKE 中，這些技術稱為「加速器取得策略」。

GPU

圖形處理單元 (GPU) 可加速處理特定工作負載，例如機器學習和資料處理。GKE 提供搭載這些強大 GPU 的節點，可提升機器學習和資料處理工作的效能。GKE 提供各種節點設定適用的機器類型選項，包括搭載 NVIDIA H100、A100 和 L4 GPU 的機器類型。

詳情請參閱「關於 GKE 中的 GPU」。

彈性啟動佈建模式

彈性啟動佈建模式採用 Dynamic Workload Scheduler，屬於 GPU 消耗類型。在這種模式下，GKE 會保留您的 GPU 要求，並在容量可用時自動佈建資源。如果工作負載需要 GPU 容量的時間有限 (最多七天)，且沒有固定開始日期，建議使用彈性啟動模式。詳情請參閱 flex-start。

Spot VM

如果您的工作負載能夠容忍節點經常發生中斷的情形，也可以將 GPU 搭配 Spot VM 一起使用。使用 Spot VM 或彈性啟動可降低執行 GPU 的費用。結合使用 Spot VM 和彈性啟動功能，可在 Spot VM 容量不足時提供備援選項。

詳情請參閱搭配 GPU 節點集區使用 Spot VM。

自訂運算類別

您可以透過自訂運算類別要求 GPU。自訂運算類別可讓您定義節點設定的階層，供 GKE 在決定節點資源調度時優先考量，確保工作負載在您選取的硬體上執行。詳情請參閱「關於自訂運算類別」。

事前準備

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project:
Check for the roles
1. In the Google Cloud console, go to the IAM page.
  Go to IAM
2. Select the project.
3. In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
4. For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
1. In the Google Cloud console, go to the IAM page.
  前往「IAM」頁面
2. 選取專案。
3. 按一下「授予存取權」。
4. 在「New principals」(新增主體) 欄位中，輸入您的使用者 ID。這通常是 Google 帳戶的電子郵件地址。
5. 在「Select a role」(選取角色) 清單中，選取角色。
6. 如要授予其他角色，請按一下「新增其他角色」，然後新增每個其他角色。
7. 按一下 [Save]。

確認您擁有執行 1.32.2-gke.1652000 以上版本的 GKE Autopilot 或 Standard 叢集。叢集必須啟用節點自動佈建功能並設定 GPU 限制。
如果沒有 Hugging Face 帳戶，請先建立一個。
請確認專案有足夠的 NVIDIA L4 GPU 先占配額。詳情請參閱「先佔配額」。

取得模型存取權

如果沒有，請產生新的 Hugging Face 權杖：

依序點選「Your Profile」(你的個人資料) >「Settings」(設定) >「Access Tokens」(存取權杖)。
選取「New Token」。
指定所選名稱和至少 Read 的角色。
選取「產生權杖」。

建立自訂運算類別設定檔

在本節中，您將建立自訂運算類別設定檔。自訂運算類別設定檔會定義工作負載使用的多個運算資源類型和關係。

在 Google Cloud 控制台中，按一下Google Cloud 控制台中的「啟用 Cloud Shell」，啟動 Cloud Shell 工作階段。工作階段會在 Google Cloud 控制台的底部窗格中開啟。

建立 dws-flex-start.yaml 資訊清單檔案：

apiVersion: cloud.google.com/v1
kind: ComputeClass
metadata:
  name: dws-model-inference-class
spec:
  priorities:
    - machineType: g2-standard-24
      spot: true
    - machineType: g2-standard-24
      flexStart:
        enabled: true
        nodeRecycling:
          leadTimeSeconds: 3600
  nodePoolAutoCreation:
    enabled: true

套用 dws-flex-start.yaml 資訊清單：
```
kubectl apply -f dws-flex-start.yaml
```

GKE 會部署搭載 L4 加速器的 g2-standard-24 機器。GKE 會使用運算類別，優先使用 Spot VM，其次是彈性啟動。

部署 LLM 工作負載

使用下列指令建立包含 Hugging Face 權杖的 Kubernetes Secret：

kubectl create secret generic model-inference-secret \
    --from-literal=HUGGING_FACE_TOKEN=HUGGING_FACE_TOKEN \
    --dry-run=client -o yaml | kubectl apply -f -

將 HUGGING_FACE_TOKEN 替換成您的 Hugging Face 存取權杖。

建立名為 mixtral-deployment.yaml 的檔案：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-mixtral-ccc
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      nodeSelector:
        cloud.google.com/compute-class: dws-model-inference-class
      containers:
      - name: llm
        image: us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu124.2-3.ubuntu2204.py311
        resources:
          requests:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
          limits:
            cpu: "5"
            memory: "40Gi"
            nvidia.com/gpu: "2"
        env:
        - name: MODEL_ID
          value: mistralai/Mixtral-8x7B-Instruct-v0.1
        - name: NUM_SHARD
          value: "2"
        - name: PORT
          value: "8080"
        - name: QUANTIZE
          value: bitsandbytes-nf4
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: model-inference-secret
              key: HUGGING_FACE_TOKEN
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
          - mountPath: /tmp
            name: ephemeral-volume
      volumes:
        - name: dshm
          emptyDir:
              medium: Memory
        - name: ephemeral-volume
          ephemeral:
            volumeClaimTemplate:
              metadata:
                labels:
                  type: ephemeral
              spec:
                accessModes: ["ReadWriteOnce"]
                storageClassName: "premium-rwo"
                resources:
                  requests:
                    storage: 100Gi

在這個資訊清單中，mountPath 欄位設為 /tmp，因為這是為文字生成推論 (TGI) 設定的 Deep Learning Container (DLC) 中 HF_HOME 環境變數的路徑，而不是 TGI 預設映像檔中設定的預設 /data 路徑。下載的模型會儲存在這個目錄中。

部署模型：
```
kubectl apply -f  mixtral-deployment.yaml
```
GKE 會排定部署新 Pod 的時間，這會觸發節點集區自動調度器新增第二個節點，然後部署模型的第二個副本。

確認模型狀態：

watch kubectl get deploy inference-mixtral-ccc

如果模型部署成功，輸出內容會類似如下：

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
inference-mixtral-ccc  1/1     1            1           10m

如要退出手錶模式，請按 CTRL + C。

等待容器下載並開始提供模型：
```
watch "kubectl logs $(kubectl get pods -l app=llm -o custom-columns=:metadata.name --no-headers) | tail"
```
如要退出手錶模式，請按 CTRL + C。

注意： 如果看到 HTTP 403 Forbidden 狀態碼，而不是下載記錄，可能需要前往 Hugging Face 的 Mixtral 8x7b 存放區，並簽署同意聲明。

查看 GKE 佈建的節點集區：

kubectl get nodes -L cloud.google.com/gke-nodepool

輸出結果會與下列內容相似：

  NAME                                                  STATUS   ROLES    AGE   VERSION               GKE-NODEPOOL
  gke-flex-na-nap-g2-standard--0723b782-fg7v   Ready    <none>   10m   v1.32.3-gke.1152000   nap-g2-standard-24-spot-gpu2-1gbdlbxz
  gke-flex-nap-zo-default-pool-09f6fe53-fzm8   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
  gke-flex-nap-zo-default-pool-09f6fe53-lv2v   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool
  gke-flex-nap-zo-default-pool-09f6fe53-pq6m   Ready    <none>   32m   v1.32.3-gke.1152000   default-pool

建立的節點集區名稱會指出機器類型。在本例中，GKE 佈建了 Spot VM。

公開模型：
```
kubectl expose deployment/inference-mixtral-ccc --port 8080 --name=llm-service
```
成功： 您已成功使用彈性啟動、Spot VM 和自訂運算類別設定檔，提供 LLM 服務，並最佳化 GPU 佈建和成本。現在可以與模型互動。

使用 `curl` 與模型互動

本節說明如何執行基本推論測試，驗證已部署的模型。

設定通訊埠轉送至模型：

kubectl port-forward service/llm-service 8080:8080

輸出結果會與下列內容相似：

Forwarding from 127.0.0.1:8080 -> 8080

在新的終端機工作階段中，使用 curl 與模型對話：

curl http://localhost:8080/v1/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
    "model": "mixtral-8x7b-instruct-gptq",
    "prompt": "<s>[INST]Who was the first president of the United States?[/INST]",
    "max_tokens": 40}'

輸出看起來類似以下內容：

George Washington was a Founding Father and the first president of the United States, serving from 1789 to 1797.

清除所用資源

如要避免系統向您的 Google Cloud 帳戶收取本頁面所用資源的費用，請刪除含有相關資源的專案，或者保留專案但刪除個別資源。

刪除專案

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

刪除個別資源

刪除您從本指南建立的 Kubernetes 資源：

kubectl delete deployment inference-mixtral-ccc
kubectl delete service llm-service
kubectl delete computeclass dws-model-inference-class
kubectl delete secret model-inference-secret

刪除叢集：

gcloud container clusters delete CLUSTER_NAME

後續步驟

進一步瞭解如何使用彈性啟動訓練小型工作負載。
進一步瞭解 GKE 中的 GPU。