本頁面由 Cloud Translation API 翻譯而成。

透過 Ray 在 L4 GPU 提供大型語言模型

自動駕駛標準

本指南說明如何使用 Ray 和 Ray Operator 外掛程式，透過 Google Kubernetes Engine (GKE) 提供大型語言模型 (LLM) 服務。Ray 架構提供端對端 AI/機器學習平台，可用於訓練、微調及推論機器學習工作負載。Ray Serve 是 Ray 中的架構，可用來提供 Hugging Face 的熱門大型語言模型。

閱讀本指南前，請務必先熟悉要在本教學課程中提供的模型。您可以放送下列任一模型：

本頁內容適用於機器學習 (ML) 工程師，以及協助處理 ML 工作負載的平台管理員和營運人員。如要進一步瞭解我們在 Google Cloud 內容中提及的常見角色和範例工作，請參閱「常見的 GKE 使用者角色和工作」。

本指南包含以下步驟：

建立 Autopilot 或 Standard GKE 叢集，並啟用 Ray Operator 外掛程式。
部署 RayService 資源，從 Hugging Face 下載及提供大型語言模型 (LLM)。
部署對話介面，並透過大型語言模型進行對話。

事前準備

開始之前，請確認您已完成下列工作：

啟用 Google Kubernetes Engine API。

啟用 Google Kubernetes Engine API

如要使用 Google Cloud CLI 執行這項工作，請安裝並初始化 gcloud CLI。如果您先前已安裝 gcloud CLI，請執行 gcloud components update 指令，取得最新版本。較舊的 gcloud CLI 版本可能不支援執行本文件中的指令。
注意：如果是現有的 gcloud CLI 安裝項目，請務必設定 compute/region 屬性。如果您主要使用區域叢集，請改為設定 compute/zone。設定預設位置後，即可避免 gcloud CLI 發生下列錯誤：One of [--zone, --region] must be supplied: Please specify location。如果叢集位置與您設定的預設位置不同，您可能需要在特定指令中指定位置。

如果沒有 Hugging Face 帳戶，請先建立一個。
確認你已取得 Hugging Face 權杖。
確認您有權存取要使用的 Hugging Face 模型。通常只要簽署協議，並在 Hugging Face 模型頁面中向模型擁有者要求存取權，即可取得授權。
請確認您在 us-central1 區域有 GPU 配額。詳情請參閱「GPU 配額」。

準備環境

在 Google Cloud 控制台中啟動 Cloud Shell 執行個體：
開啟 Cloud Shell

複製範例存放區：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/llm
export TUTORIAL_HOME=`pwd`

設定預設環境變數：
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export COMPUTE_REGION=us-central1
export CLUSTER_VERSION=CLUSTER_VERSION
export HF_TOKEN=HUGGING_FACE_TOKEN
```
更改下列內容：
- PROJECT_ID：您的 Google Cloud 專案 ID。
- CLUSTER_VERSION：要使用的 GKE 版本。必須為 1.30.1 或之後。
- HUGGING_FACE_TOKEN：您的 Hugging Face 存取權杖。

建立具有 GPU 節點集區的叢集

您可以使用 Ray Operator 外掛程式，在 GKE Autopilot 或標準叢集中，透過 Ray 在 L4 GPU 上提供大型語言模型。一般來說，我們建議使用 Autopilot 叢集，享有全代管 Kubernetes 體驗。如果您的用途需要高延展性，或想進一步控管叢集設定，請改為選擇 Standard 叢集。如要為工作負載選擇最合適的 GKE 作業模式，請參閱「選擇 GKE 作業模式」。

使用 Cloud Shell 建立 Autopilot 或 Standard 叢集：

Autopilot

建立啟用 Ray Operator 外掛程式的 Autopilot 叢集：

gcloud container clusters create-auto rayserve-cluster \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

標準

建立啟用 Ray Operator 外掛程式的標準叢集：

gcloud container clusters create rayserve-cluster \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION} \
    --machine-type=g2-standard-24 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=2 \
    --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest

為 Hugging Face 憑證建立 Kubernetes 密鑰

在 Cloud Shell 中，執行下列操作來建立 Kubernetes Secret：

設定 kubectl 與叢集通訊：

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${COMPUTE_REGION}

建立包含 Hugging Face 權杖的 Kubernetes Secret：

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

部署 LLM 模型

您複製的 GitHub 存放區包含每個模型的目錄，其中包含 RayService 設定。每個模型的設定都包含下列元件：

Ray Serve 部署作業：Ray Serve 部署作業，包括資源設定和執行階段依附元件。
模型：Hugging Face 模型 ID。
Ray 叢集：基礎 Ray 叢集和每個元件所需的資源，包括頭部和工作站 Pod。

Gemma 2B IT

部署模型：
```
kubectl apply -f gemma-2b-it/
```

等待 RayService 資源準備就緒：

kubectl get rayservice gemma-2b-it -o yaml

輸出結果會與下列內容相似：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service gemma-2b-it-serve-svc

輸出結果會與下列內容相似：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-2b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Gemma 7B IT

部署模型：
```
kubectl apply -f gemma-7b-it/
```

等待 RayService 資源準備就緒：

kubectl get rayservice gemma-7b-it -o yaml

輸出結果會與下列內容相似：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service gemma-7b-it-serve-svc

輸出結果會與下列內容相似：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-7b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 2 7B

部署模型：
```
kubectl apply -f llama-2-7b/
```

等待 RayService 資源準備就緒：

kubectl get rayservice llama-2-7b -o yaml

輸出結果會與下列內容相似：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service llama-2-7b-serve-svc

輸出結果會與下列內容相似：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-2-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 3 8B

部署模型：
```
kubectl apply -f llama-3-8b/
```

等待 RayService 資源準備就緒：

kubectl get rayservice llama-3-8b -o yaml

輸出結果會與下列內容相似：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service llama-3-8b-serve-svc

輸出結果會與下列內容相似：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-3-8b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Mistral 7B

部署模型：
```
kubectl apply -f mistral-7b/
```

等待 RayService 資源準備就緒：

kubectl get rayservice mistral-7b -o yaml

輸出結果會與下列內容相似：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service mistral-7b-serve-svc

輸出結果會與下列內容相似：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
mistral-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

提供模型

Llama2 7B 和 Llama3 8B 模型使用 OpenAI API 聊天規格。其他模型僅支援文字生成，這項技術會根據提示生成文字。

設定通訊埠轉送

設定通訊埠轉送至推論伺服器：

Gemma 2B IT

kubectl port-forward svc/gemma-2b-it-serve-svc 8000:8000

Gemma 7B IT

kubectl port-forward svc/gemma-7b-it-serve-svc 8000:8000

Llama2 7B

kubectl port-forward svc/llama-7b-serve-svc 8000:8000

Llama 3 8B

kubectl port-forward svc/llama-3-8b-serve-svc 8000:8000

Mistral 7B

kubectl port-forward svc/mistral-7b-serve-svc 8000:8000

使用 curl 與模型互動

使用 curl 與模型對話：

Gemma 2B IT

在新的終端機工作階段中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Gemma 7B IT

在新的終端機工作階段中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Llama2 7B

在新的終端機工作階段中：

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Llama-2-7b-chat-hf",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Llama 3 8B

在新的終端機工作階段中：

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Mistral 7B

在新的終端機工作階段中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

由於您提供的模型不會保留任何記錄，因此必須將每則訊息和回覆傳回模型，才能建立互動式對話體驗。以下範例說明如何使用 Llama 3 8B 模型建立互動式對話：

使用 curl 與模型建立對話：

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."},
        {"role": "assistant", "content": " \n1. Java\n2. Python\n3. C++\n4. C#\n5. JavaScript"},
        {"role": "user", "content": "Can you give me a brief description?"}
      ],
      "temperature": 0.7
}'

輸出結果會與下列內容相似：

{
  "id": "cmpl-3cb18c16406644d291e93fff65d16e41",
  "object": "chat.completion",
  "created": 1719035491,
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a brief description of each:\n\n1. **Java**: A versatile language for building enterprise-level applications, Android apps, and web applications.\n2. **Python**: A popular language for data science, machine learning, web development, and scripting, known for its simplicity and ease of use.\n3. **C++**: A high-performance language for building operating systems, games, and other high-performance applications, with a focus on efficiency and control.\n4. **C#**: A modern, object-oriented language for building Windows desktop and mobile applications, as well as web applications using .NET.\n5. **JavaScript**: A versatile language for client-side scripting on the web, commonly used for creating interactive web pages, web applications, and mobile apps.\n\nNote: These descriptions are brief and don't do justice to the full capabilities and uses of each language."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 73,
    "total_tokens": 245,
    "completion_tokens": 172
  }
}

(選用) 連線至對話介面

您可以使用 Gradio 建構網頁應用程式，與模型互動。Gradio 是 Python 程式庫，內含 ChatInterface 包裝函式，可為聊天機器人建立使用者介面。如果是 Llama 2 7B 和 Llama 3 7B，您在部署 LLM 模型時已安裝 Gradio。

設定通訊埠轉送至 gradio 服務：

kubectl port-forward service/gradio 8080:8080 &

在瀏覽器中開啟 http://localhost:8080，即可與模型對話。

使用模型多路複用功能提供多個模型

模型多工是一種技術，可在同一個 Ray 叢集中提供多個模型。您可以透過要求標頭或負載平衡，將流量導向特定模型。

在本範例中，您會建立多工處理的 Ray Serve 應用程式，其中包含兩個模型：Gemma 7B IT 和 Llama 3 8B。

部署 RayService 資源：
```
kubectl apply -f model-multiplexing/
```

等待 RayService 資源準備就緒：

kubectl get rayservice model-multiplexing -o yaml

輸出內容如下所示：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立 Kubernetes Service：

kubectl get service model-multiplexing-serve-svc

輸出結果會與下列內容相似：

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-multiplexing-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

設定通訊埠轉送至 Ray Serve 應用程式：

kubectl port-forward svc/model-multiplexing-serve-svc 8000:8000

將要求傳送至 Gemma 7B IT 模型：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: google/gemma-7b-it" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

輸出結果會與下列內容相似：

{"text": ["What are the top 5 most popular programming languages? Please be brief.\n\n1. JavaScript\n2. Java\n3. C++\n4. Python\n5. C#"]}

將要求傳送至 Llama 3 8B 模型：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: meta-llama/Meta-Llama-3-8B-Instruct" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

輸出結果會與下列內容相似：

{"text": ["What are the top 5 most popular programming languages? Please be brief. Here are your top 5 most popular programming languages, based on the TIOBE Index, a widely used measure of the popularity of programming languages.\r\n\r\n1. **Java**: Used in Android app development, web development, and enterprise software development.\r\n2. **Python**: A versatile language used in data science, machine learning, web development, and automation.\r\n3. **C++**: A high-performance language used in game development, system programming, and high-performance computing.\r\n4. **C#**: Used in Windows and web application development, game development, and enterprise software development.\r\n5. **JavaScript**: Used in web development, mobile app development, and server-side programming with technologies like Node.js.\r\n\r\nSource: TIOBE Index (2022).\r\n\r\nThese rankings can vary depending on the source and methodology used, but this gives you a general idea of the most popular programming languages."]}

排除 serve_multiplexed_model_id 標頭，將要求傳送至隨機模型：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

輸出內容是先前步驟的輸出內容之一。

使用模型組合功能撰寫多個模型

模型組合是一種技術，可將多個模型組合到單一應用程式中。模型組合可讓您在多個 LLM 中串連輸入和輸出內容，並將模型當做單一應用程式進行擴充。

在本範例中，您會將 Gemma 7B IT 和 Llama 3 8B 這兩個模型，組合為單一應用程式。第一個模型是助理模型，會回答提示中提出的問題。第二個模型是摘要模型。助理模型輸出內容會串連至摘要模型輸入內容。最終結果是助理模型回覆的摘要版本。

部署 RayService 資源：
```
kubectl apply -f model-composition/
```

等待 RayService 資源準備就緒：

kubectl get rayservice model-composition -o yaml

輸出內容如下所示：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

在這個輸出內容中，status: RUNNING 表示 RayService 資源已準備就緒。

確認 GKE 為 Ray Serve 應用程式建立服務：

kubectl get service model-composition-serve-svc

輸出結果會與下列內容相似：

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-composition-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

向模型傳送要求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

輸出結果會與下列內容相似：

{"text": ["\n\n**Sure, here is a summary in a single sentence:**\n\nThe most popular programming language for machine learning is Python due to its ease of use, extensive libraries, and growing community."]}

刪除專案

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

刪除個別資源

如果您使用現有專案，且不想刪除專案，可以刪除個別資源。

刪除叢集：

gcloud container clusters delete rayserve-cluster

後續步驟

瞭解如何運用 GKE 平台的自動化調度管理功能，執行最佳化的 AI/機器學習工作負載。
在 GKE Standard 模式下使用 GPU 訓練模型
如要瞭解如何在 GKE 上使用 RayServe，請查看 GitHub 中的程式碼範例。