使用 Ray 在 L4 GPU 上提供 LLM

Autopilot Standard

本指南演示了如何将 Ray 和 Ray Operator 插件与 Google Kubernetes Engine (GKE) 搭配使用来为大语言模型 (LLM) 供给数据。

在本指南中，您可以提供以下任何模型：

本指南还介绍了 Ray Serve 框架支持的模型部署方法，例如模型多路复用和模型组合。

背景

Ray 框架提供了端到端 AI/机器学习平台，用于机器学习工作负载的训练、微调和推理。Ray Serve 是 Ray 中的一个框架，可用来通过 Hugging Face 为热门 LLM 供给数据。

根据模型的数据格式，GPU 数量也会有所不同。在本指南中，您的模型可以使用一个或两个 L4 GPU。

本指南介绍以下步骤：

创建启用了 Ray Operator 插件的 Autopilot 或 Standard GKE 集群。
部署一个 RayService 资源来通过 Hugging Face 下载大语言模型 (LLM) 并为其供给数据。
通过 LLM 部署一个聊天界面和对话框。

准备工作

在开始之前，请确保您已执行以下任务：

启用 Google Kubernetes Engine API。

启用 Google Kubernetes Engine API

如果您要使用 Google Cloud CLI 执行此任务，请安装并初始化 gcloud CLI。如果您之前安装了 gcloud CLI，请运行 gcloud components update 以获取最新版本。
注意：对于现有 gcloud CLI 安装，请务必设置 compute/region 和 compute/zone 属性。通过设置默认位置，您可以避免 gcloud CLI 中出现以下错误：One of [--zone, --region] must be supplied: Please specify location。

如果您还没有 Hugging Face 账号，请创建一个。
确保您拥有 Hugging Face 令牌。
确保您有权访问要使用的 Hugging Face 模型。通常需要在 Hugging Face 模型页面上签署相关协议并向模型所有者申请使用权，来获得授权。
确保在 us-central1 区域中拥有 GPU 配额。如需了解详情，请参阅 GPU 配额。

准备环境

在 Google Cloud 控制台中，启动 Cloud Shell 实例：
打开 Cloud Shell

克隆示例代码库：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/gke-ray/rayserve/llm
export TUTORIAL_HOME=`pwd`

设置默认环境变量：
```
gcloud config set project PROJECT_ID
export PROJECT_ID=$(gcloud config get project)
export COMPUTE_REGION=us-central1
export CLUSTER_VERSION=CLUSTER_VERSION
export HF_TOKEN=HUGGING_FACE_TOKEN
```
请替换以下内容：
- PROJECT_ID：您的 Google Cloud 项目 ID。
- CLUSTER_VERSION：要使用的 GKE 版本。必须为 1.30.1 或更高版本。
- HUGGING_FACE_TOKEN：您的 Hugging Face 访问令牌。

创建具有 GPU 节点池的集群

您可以使用 Ray Operator 插件，在 GKE Autopilot 或 Standard 集群中通过 Ray 在 L4 GPU 上为 LLM 供给数据。我们通常建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如果您的使用场景需要高可扩缩性，或者您希望对集群配置有更多的掌控权，则可以选择 Standard 集群。如需选择最适合您的工作负载的 GKE 操作模式，请参阅选择 GKE 操作模式。

使用 Cloud Shell 创建 Autopilot 或 Standard 集群：

Autopilot

创建启用了 Ray Operator 插件的 Autopilot 集群：

gcloud container clusters create-auto rayserve-cluster \
    --enable-ray-operator \
    --cluster-version=${CLUSTER_VERSION} \
    --location=${COMPUTE_REGION}

标准

创建启用了 Ray Operator 插件的 Standard 集群：

gcloud container clusters create rayserve-cluster \
    --addons=RayOperator \
    --cluster-version=${CLUSTER_VERSION} \
    --machine-type=g2-standard-24 \
    --location=${COMPUTE_ZONE} \
    --num-nodes=2 \
    --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest

为 Hugging Face 凭据创建 Kubernetes Secret

在 Cloud Shell 中，执行以下操作以创建 Kubernetes Secret：

配置 kubectl 以与您的集群通信：

gcloud container clusters get-credentials ${CLUSTER_NAME} --location=${COMPUTE_REGION}

创建包含 Hugging Face 令牌的 Kubernetes Secret：

kubectl create secret generic hf-secret \
  --from-literal=hf_api_token=${HF_TOKEN} \
  --dry-run=client -o yaml | kubectl apply -f -

部署 LLM 模型

您克隆的 GitHub 代码库有每个模型对应的目录，其中包含相应的 RayService 配置。每个模型的配置包括以下组件：

Ray Serve 部署：Ray Serve 部署，其中包含资源配置和运行时依赖项。
模型：Hugging Face 模型 ID。
Ray 集群：底层 Ray 集群以及每个组件所需的资源，其中包含头 Pod 和工作器 Pod。

Gemma 2B IT

部署模型：
```
kubectl apply -f gemma-2b-it/
```

等待 RayService 资源准备就绪：

kubectl get rayservice gemma-2b-it -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service gemma-2b-it-serve-svc

输出类似于以下内容：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-2b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Gemma 7B IT

部署模型：
```
kubectl apply -f gemma-7b-it/
```

等待 RayService 资源准备就绪：

kubectl get rayservice gemma-7b-it -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service gemma-7b-it-serve-svc

输出类似于以下内容：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
gemma-7b-it-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 2 7B

部署模型：
```
kubectl apply -f llama-2-7b/
```

等待 RayService 资源准备就绪：

kubectl get rayservice llama-2-7b -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service llama-2-7b-serve-svc

输出类似于以下内容：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-2-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Llama 3 8B

部署模型：
```
kubectl apply -f llama-3-8b/
```

等待 RayService 资源准备就绪：

kubectl get rayservice llama-3-8b -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service llama-3-8b-serve-svc

输出类似于以下内容：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
llama-3-8b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

Mistral 7B

部署模型：
```
kubectl apply -f mistral-7b/
```

等待 RayService 资源准备就绪：

kubectl get rayservice mistral-7b -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T02:51:52Z"
        serveDeploymentStatuses:
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T02:51:52Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service mistral-7b-serve-svc

输出类似于以下内容：

NAME                    TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
mistral-7b-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

应用模型

Llama2 7B 和 Llama3 8B 模型使用 OpenAI API 聊天规范。其他模型仅支持文本生成，即根据提示生成文本。

设置端口转发

设置到推理服务器的端口转发：

Gemma 2B IT

kubectl port-forward svc/gemma-2b-it-serve-svc 8000:8000

Gemma 7B IT

kubectl port-forward svc/gemma-7b-it-serve-svc 8000:8000

Llama2 7B

kubectl port-forward svc/llama-7b-serve-svc 8000:8000

Llama 3 8B

kubectl port-forward svc/llama-3-8b-serve-svc 8000:8000

Mistral 7B

kubectl port-forward svc/mistral-7b-serve-svc 8000:8000

使用 curl 与模型互动

使用 curl 与模型聊天：

Gemma 2B IT

在新的终端会话中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Gemma 7B IT

在新的终端会话中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

Llama2 7B

在新的终端会话中：

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Llama-2-7b-chat-hf",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Llama 3 8B

在新的终端会话中：

curl http://localhost:8000/v1/chat/completions     -H "Content-Type: application/json"     -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."}
      ],
      "temperature": 0.7
    }'

Mistral 7B

在新的终端会话中：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Be brief.", "max_tokens": 1024}'

由于您提供的模型不会保留任何历史记录，因此每条消息和回复都必须发送回模型，以打造交互式对话体验。以下示例展示了如何使用 Llama 3 8B 模型创建交互式对话：

使用 curl 创建与模型的对话：

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Meta-Llama-3-8B-Instruct",
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the top 5 most popular programming languages? Please be brief."},
        {"role": "assistant", "content": " \n1. Java\n2. Python\n3. C++\n4. C#\n5. JavaScript"},
        {"role": "user", "content": "Can you give me a brief description?"}
      ],
      "temperature": 0.7
}'

输出类似于以下内容：

{
  "id": "cmpl-3cb18c16406644d291e93fff65d16e41",
  "object": "chat.completion",
  "created": 1719035491,
  "model": "meta-llama/Meta-Llama-3-8B-Instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Here's a brief description of each:\n\n1. **Java**: A versatile language for building enterprise-level applications, Android apps, and web applications.\n2. **Python**: A popular language for data science, machine learning, web development, and scripting, known for its simplicity and ease of use.\n3. **C++**: A high-performance language for building operating systems, games, and other high-performance applications, with a focus on efficiency and control.\n4. **C#**: A modern, object-oriented language for building Windows desktop and mobile applications, as well as web applications using .NET.\n5. **JavaScript**: A versatile language for client-side scripting on the web, commonly used for creating interactive web pages, web applications, and mobile apps.\n\nNote: These descriptions are brief and don't do justice to the full capabilities and uses of each language."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 73,
    "total_tokens": 245,
    "completion_tokens": 172
  }
}

（可选）连接到聊天界面

您可以使用 Gradio 构建可让您与模型交互的 Web 应用。Gradio 是一个 Python 库，它有一个可为聊天机器人创建界面的 ChatInterface 封装容器。对于 Llama 2 7B 和 Llama 3 7B，您在部署 LLM 模型时便会安装 Gradio。

设置到 gradio Service 的端口转发：

kubectl port-forward service/gradio 8080:8080 &

在浏览器中打开 http://localhost:8080 以与模型聊天。

通过模型多路复用实现多个模型的数据供给

模型多路复用是一种可为同一 Ray 集群中的多个模型实现数据供给的方法。您可以使用请求标头或通过负载均衡将流量路由到特定模型。

在此示例中，您将创建一个由以下两个模型组成的多路复用 Ray Serve 应用：Gemma 7B IT 和 Llama 3 8B。

部署 RayService 资源：
```
kubectl apply -f model-multiplexing/
```

等待 RayService 资源准备就绪：

kubectl get rayservice model-multiplexing -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Kubernetes Service：

kubectl get service model-multiplexing-serve-svc

输出类似于以下内容：

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-multiplexing-serve-svc   ClusterIP   34.118.226.104   <none>        8000/TCP   45m

设置到 Ray Serve 应用的端口转发：

kubectl port-forward svc/model-multiplexing-serve-svc 8000:8000

向 Gemma 7B IT 模型发送请求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: google/gemma-7b-it" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

输出类似于以下内容：

{"text": ["What are the top 5 most popular programming languages? Please be brief.\n\n1. JavaScript\n2. Java\n3. C++\n4. Python\n5. C#"]}

向 Llama 3 8B 模型发送请求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" --header "serve_multiplexed_model_id: meta-llama/Meta-Llama-3-8B-Instruct" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

输出类似于以下内容：

{"text": ["What are the top 5 most popular programming languages? Please be brief. Here are your top 5 most popular programming languages, based on the TIOBE Index, a widely used measure of the popularity of programming languages.\r\n\r\n1. **Java**: Used in Android app development, web development, and enterprise software development.\r\n2. **Python**: A versatile language used in data science, machine learning, web development, and automation.\r\n3. **C++**: A high-performance language used in game development, system programming, and high-performance computing.\r\n4. **C#**: Used in Windows and web application development, game development, and enterprise software development.\r\n5. **JavaScript**: Used in web development, mobile app development, and server-side programming with technologies like Node.js.\r\n\r\nSource: TIOBE Index (2022).\r\n\r\nThese rankings can vary depending on the source and methodology used, but this gives you a general idea of the most popular programming languages."]}

不包含 serve_multiplexed_model_id 标头，向随机模型发送请求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What are the top 5 most popular programming languages? Please be brief.", "max_tokens": 200}'

该输出将是前面步骤的某一个输出。

通过模型组合实现多个模型的组合

模型组合是一种可将多个模型组合到单个应用中的方法。借助模型组合，您可以将多个 LLM 的输入和输出链接在一起，并将这些模型整合到单个应用中。

在此示例中，您需要将 Gemma 7B IT 和 Llama 3 8B 这两个模型组合到单个应用中。第一个模型是负责回答提示问题的助理型模型。第二个模型是摘要器模型。助理型模型的输出会链接到摘要器模型的输入中。最终结果是助理型模型所提供回答的摘要版本。

部署 RayService 资源：
```
kubectl apply -f model-composition/
```

等待 RayService 资源准备就绪：

kubectl get rayservice model-composition -o yaml

输出类似于以下内容：

status:
  activeServiceStatus:
    applicationStatuses:
      llm:
        healthLastUpdateTime: "2024-06-22T14:00:41Z"
        serveDeploymentStatuses:
          MutliModelDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
          VLLMDeployment_1:
            healthLastUpdateTime: "2024-06-22T14:00:41Z"
            status: HEALTHY
        status: RUNNING

在此输出中，status: RUNNING 表示 RayService 资源已准备就绪。

确认 GKE 为 Ray Serve 应用创建了该 Service：

kubectl get service model-composition-serve-svc

输出类似于以下内容：

NAME                           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
model-composition-serve-svc    ClusterIP   34.118.226.104   <none>        8000/TCP   45m

向模型发送请求：

curl -X POST http://localhost:8000/ -H "Content-Type: application/json" -d '{"prompt": "What is the most popular programming language for machine learning and why?", "max_tokens": 1000}'

输出类似于以下内容：

{"text": ["\n\n**Sure, here is a summary in a single sentence:**\n\nThe most popular programming language for machine learning is Python due to its ease of use, extensive libraries, and growing community."]}

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

逐个删除资源

如果您使用的是现有项目，并且不想将其删除，则可逐个删除不再需要的资源。

删除集群：

gcloud container clusters delete rayserve-cluster

后续步骤

了解如何使用 GKE 平台编排功能运行经过优化的 AI/机器学习工作负载。
在 GKE Standard 模式下使用 GPU 训练模型
查看 GitHub 中的示例代码，了解如何在 GKE 上使用 RayServe。