使用 TorchServe 在 GKE 上提供可伸缩 LLM

Autopilot

本教程介绍了如何使用 TorchServe 框架将可伸缩的机器学习 (ML) 模型部署到 Google Kubernetes Engine (GKE) 集群。您将部署一个预训练的 PyTorch 模型，该模型会根据用户请求生成预测。部署该模型后，您会获得一个预测网址，应用可以使用该网址发送预测请求。通过这种方法，您可以独立地扩缩模型和 Web 应用。在 Autopilot 上部署机器学习工作负载和应用时，GKE 会选择最高效的底层机器类型和大小来运行工作负载。

本教程面向机器学习 (ML) 工程师、平台管理员和操作员，以及有兴趣使用 GKE Autopilot 来减少节点配置、扩缩和升级的管理开销的数据和 AI 专家。如需详细了解我们在 Google Cloud 内容中提及的常见角色和示例任务，请参阅常见的 GKE 用户角色和任务。

在阅读本页面内容之前，请确保您熟悉 GKE Autopilot 模式。

教程应用简介

该应用是一个使用 Fast Dash 框架创建的小型 Python Web 应用。您将使用该应用将预测请求发送到 T5 模型。此应用会捕获用户文本输入和语言对，并将信息发送到模型。模型会翻译文本并将结果返回给应用，而应用将结果显示给用户。如需详细了解 Fast Dash，请参阅 Fast Dash 文档。

目标

从 Hugging Face 仓库中准备预训练的 T5 模型以提供服务，方法是将其打包为容器映像并将其推送到 Artifact Registry
将模型部署到 Autopilot 集群
部署与模型通信的 Fast Dash 应用
根据 Prometheus 指标自动扩缩模型

费用

在本文档中，您将使用 Google Cloud的以下收费组件：

您可使用价格计算器根据您的预计使用情况来估算费用。

新 Google Cloud 用户可能有资格申请免费试用。

完成本文档中描述的任务后，您可以通过删除所创建的资源来避免继续计费。如需了解详情，请参阅清理。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Install the Google Cloud CLI.

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:

gcloud services enable container.googleapis.com storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

Install the Google Cloud CLI.

注意：如果您之前安装了 gcloud CLI，请确保通过运行 gcloud components update 来获得最新版本。

如果您使用的是外部身份提供方 (IdP)，则必须先使用联合身份登录 gcloud CLI。

如需初始化 gcloud CLI，请运行以下命令：

gcloud init

Create or select a Google Cloud project.

Create a Google Cloud project:
```
gcloud projects create PROJECT_ID
```
Replace PROJECT_ID with a name for the Google Cloud project you are creating.
Select the Google Cloud project that you created:
```
gcloud config set project PROJECT_ID
```
Replace PROJECT_ID with your Google Cloud project name.

Verify that billing is enabled for your Google Cloud project.

Enable the Kubernetes Engine, Cloud Storage, Artifact Registry, and Cloud Build APIs:

gcloud services enable container.googleapis.com storage.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com

准备环境

克隆示例代码库并打开教程目录：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples.git
cd kubernetes-engine-samples/ai-ml/t5-model-serving

创建集群

运行以下命令：

gcloud container clusters create-auto ml-cluster \
    --release-channel=RELEASE_CHANNEL \
    --cluster-version=CLUSTER_VERSION \
    --location=us-central1

替换以下内容：

RELEASE_CHANNEL：集群的发布渠道。必须是 rapid、regular 或 stable 中的一个。选择具有 GKE 1.28.3-gke.1203000 版或更高版本的渠道以使用 L4 GPU。如需查看特定渠道中可用的版本，请参阅查看发布渠道的默认版本和可用版本。
CLUSTER_VERSION：要使用的 GKE 版本。必须为 1.28.3-gke.1203000 或更高版本。

此操作需要几分钟才能完成。

创建 Artifact Registry 代码库

在集群所在的区域中，使用 Docker 格式创建新的 Artifact Registry 标准仓库：

gcloud artifacts repositories create models \
    --repository-format=docker \
    --location=us-central1 \
    --description="Repo for T5 serving image"

验证仓库名称：

gcloud artifacts repositories describe models \
    --location=us-central1

输出类似于以下内容：

Encryption: Google-managed key
Repository Size: 0.000MB
createTime: '2023-06-14T15:48:35.267196Z'
description: Repo for T5 serving image
format: DOCKER
mode: STANDARD_REPOSITORY
name: projects/PROJECT_ID/locations/us-central1/repositories/models
updateTime: '2023-06-14T15:48:35.267196Z'

封装模型

在本部分中，您将使用 Cloud Build 将模型和服务框架打包到单个容器映像中，并将生成的映像推送到 Artifact Registry 仓库。

查看容器映像的 Dockerfile：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG BASE_IMAGE=pytorch/torchserve:0.12.0-cpu

FROM alpine/git

ARG MODEL_NAME=t5-small
ARG MODEL_REPO=https://huggingface.co/${MODEL_NAME}
ENV MODEL_NAME=${MODEL_NAME}
ENV MODEL_VERSION=${MODEL_VERSION}

RUN git clone "${MODEL_REPO}" /model

FROM ${BASE_IMAGE}

ARG MODEL_NAME=t5-small
ARG MODEL_VERSION=1.0
ENV MODEL_NAME=${MODEL_NAME}
ENV MODEL_VERSION=${MODEL_VERSION}

COPY --from=0 /model/. /home/model-server/
COPY handler.py \
     model.py \
     requirements.txt \
     setup_config.json /home/model-server/

RUN  torch-model-archiver \
     --model-name="${MODEL_NAME}" \
     --version="${MODEL_VERSION}" \
     --model-file="model.py" \
     --serialized-file="pytorch_model.bin" \
     --handler="handler.py" \
     --extra-files="config.json,spiece.model,tokenizer.json,setup_config.json" \
     --runtime="python" \
     --export-path="model-store" \
     --requirements-file="requirements.txt"

FROM ${BASE_IMAGE}

ENV PATH /home/model-server/.local/bin:$PATH
ENV TS_CONFIG_FILE /home/model-server/config.properties
# CPU inference will throw a warning cuda warning (not error)
# Could not load dynamic library 'libnvinfer_plugin.so.7'
# This is expected behaviour. see: https://stackoverflow.com/a/61137388
ENV TF_CPP_MIN_LOG_LEVEL 2

COPY --from=1 /home/model-server/model-store/ /home/model-server/model-store
COPY config.properties /home/model-server/

此 Dockerfile 定义了以下多阶段构建流程：

从 Hugging Face 仓库下载模型工件。
使用 PyTorch Serving Archive 工具打包模型。这将创建一个模型归档 (.mar) 文件，推理服务器会使用该文件加载模型。
使用 PyTorch Serve 构建最终映像。

使用 Cloud Build 构建并推送映像：

gcloud builds submit model/ \
    --region=us-central1 \
    --config=model/cloudbuild.yaml \
    --substitutions=_LOCATION=us-central1,_MACHINE=gpu,_MODEL_NAME=t5-small,_MODEL_VERSION=1.0

构建流程需要几分钟才能完成。如果您使用的模型大小大于 t5-small，构建流程可能需要花费明显更多的时间。

检查该映像是否在仓库中：

gcloud artifacts docker images list us-central1-docker.pkg.dev/PROJECT_ID/models

将 PROJECT_ID 替换为您的 Google Cloud项目 ID。

输出类似于以下内容：

IMAGE                                                     DIGEST         CREATE_TIME          UPDATE_TIME
us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small     sha256:0cd...  2023-06-14T12:06:38  2023-06-14T12:06:38

将封装的模型部署到 GKE

如需部署映像，本教程将使用 Kubernetes Deployment。Deployment 是一个 Kubernetes API 对象，可让您运行在集群节点中分布的多个 Pod 副本。

修改示例仓库中的 Kubernetes 清单以与您的环境匹配。

查看推理工作负载的清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: t5-inference
  labels:
    model: t5
    version: v1.0
    machine: gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      model: t5
      version: v1.0
      machine: gpu
  template:
    metadata:
      labels:
        model: t5
        version: v1.0
        machine: gpu
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-l4
      securityContext:
        fsGroup: 1000
        runAsUser: 1000
        runAsGroup: 1000
      containers:
        - name: inference
          image: us-central1-docker.pkg.dev/PROJECT_ID/models/t5-small:1.0-gpu
          imagePullPolicy: IfNotPresent
          args: ["torchserve", "--start", "--foreground"]
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "3000m"
              memory: 16Gi
              ephemeral-storage: 10Gi
            requests:
              nvidia.com/gpu: "1"
              cpu: "3000m"
              memory: 16Gi
              ephemeral-storage: 10Gi
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 8081
              name: management
            - containerPort: 8082
              name: metrics
          readinessProbe:
            httpGet:
              path: /ping
              port: http
            initialDelaySeconds: 120
            failureThreshold: 10
          livenessProbe:
            httpGet:
              path: /models/t5-small
              port: management
            initialDelaySeconds: 150
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: t5-inference
  labels:
    model: t5
    version: v1.0
    machine: gpu
spec:
  type: ClusterIP
  selector:
    model: t5
    version: v1.0
    machine: gpu
  ports:
    - port: 8080
      name: http
      targetPort: http
    - port: 8081
      name: management
      targetPort: management
    - port: 8082
      name: metrics
      targetPort: metrics

将 PROJECT_ID 替换为您的 Google Cloud项目 ID：
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/serving-gpu.yaml"
```
这可确保部署规范中的容器映像路径与 Artifact Registry 中的 T5 模型映像的路径匹配。

创建 Kubernetes 资源：

kubectl create -f kubernetes/serving-gpu.yaml

如需验证模型是否已成功部署，请执行以下操作：

获取 Deployment 和 Service 的状态：

kubectl get -f kubernetes/serving-gpu.yaml

等待输出显示准备就绪的 Pod，如下所示。首次拉取映像可能需要几分钟时间，具体取决于映像的大小。

NAME                            READY   UP-TO-DATE    AVAILABLE   AGE
deployment.apps/t5-inference    1/1     1             0           66s

NAME                    TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)                       AGE
service/t5-inference    ClusterIP   10.48.131.86    <none>        8080/TCP,8081/TCP,8082/TCP    66s

为 t5-inference Service 打开本地端口：

kubectl port-forward svc/t5-inference 8080

打开一个新的终端窗口，并向 Service 发送测试请求：
```
curl -v -X POST -H 'Content-Type: application/json' -d '{"text": "this is a test sentence", "from": "en", "to": "fr"}' "http://localhost:8080/predictions/t5-small/1.0"
```
如果测试请求失败且 Pod 连接关闭，请检查日志：
```
kubectl logs deployments/t5-inference
```
如果输出类似于以下内容，则表示 TorchServe 无法安装某些模型依赖项：
```
org.pytorch.serve.archive.model.ModelException: Custom pip package installation failed for t5-small
```
如需解决此问题，请重启 Deployment：
```
kubectl rollout restart deployment t5-inference
```
Deployment 控制器会创建一个新的 Pod。重复执行上述步骤，在新 Pod 上打开一个端口。

使用 Web 应用访问已部署的模型

如需使用 Fast Dash Web 应用访问已部署的模型，请完成以下步骤：

在 Artifact Registry 中构建 Fast Dash Web 应用并将其作为容器映像推送：

gcloud builds submit client-app/ \
    --region=us-central1 \
    --config=client-app/cloudbuild.yaml

在文本编辑器中打开 kubernetes/application.yaml，然后将 image: 字段中的 PROJECT_ID 替换为您的项目 ID。或者，运行以下命令：
```
sed -i "s/PROJECT_ID/PROJECT_ID/g" "kubernetes/application.yaml"
```
创建 Kubernetes 资源：
```
kubectl create -f kubernetes/application.yaml
```
Deployment 和 Service 可能需要一些时间才能完全预配。

如需检查状态，请运行以下命令：

kubectl get -f kubernetes/application.yaml

等待输出显示准备就绪的 Pod，如下所示：

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/fastdash   1/1     1            0           1m

NAME               TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/fastdash   NodePort   203.0.113.12    <none>        8050/TCP         1m

Web 应用现在正在运行，但未在外部 IP 地址上公开。如需访问 Web 应用，请打开本地端口：
```
kubectl port-forward service/fastdash 8050
```
在浏览器中，打开网页界面：
- 如果您使用的是本地 shell，请打开浏览器并转到 http://127.0.0.1:8050。
- 如果您使用的是 Cloud Shell，请点击 Web 预览，然后点击更改端口。将端口指定为 8050。
如需向 T5 模型发送请求，请在网页界面的 TEXT、FROM LANG 和 TO LANG 字段中指定相应值，然后点击提交。如需查看可用语言的列表，请参阅 T5 文档。

为模型启用自动扩缩功能

本部分介绍如何通过执行以下操作，根据 Google Cloud Managed Service for Prometheus 中的指标为模型启用自动扩缩功能：

安装自定义指标 Stackdriver 适配器
应用 PodMonitoring 和 HorizontalPodAutoscaling 配置

默认情况下，运行 1.25 版及更高版本的 Autopilot 集群会启用 Google Cloud Managed Service for Prometheus。

安装自定义指标 Stackdriver 适配器

此适配器可让您的集群使用 Prometheus 中的指标来做出 Kubernetes 自动扩缩决策。

部署适配器：

kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

为适配器创建 IAM 服务账号：

gcloud iam service-accounts create monitoring-viewer

向该 IAM 服务账号授予针对项目的 monitoring.viewer 角色和 iam.workloadIdentityUser 角色：

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/monitoring.viewer
gcloud iam service-accounts add-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"

将 PROJECT_ID 替换为您的 Google Cloud项目 ID。

为适配器的 Kubernetes ServiceAccount 添加注释，使其模拟 (impersonate) IAM 服务账号：

kubectl annotate serviceaccount custom-metrics-stackdriver-adapter \
    --namespace custom-metrics \
    iam.gke.io/gcp-service-account=monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com

重启适配器以传播更改：

kubectl rollout restart deployment custom-metrics-stackdriver-adapter \
    --namespace=custom-metrics

应用 PodMonitoring 和 HorizontalPodAutoscaling 配置

PodMonitoring 是一种 Google Cloud Managed Service for Prometheus 自定义资源，可让您在特定命名空间中启用指标提取和目标爬取。

在 TorchServe Deployment 所在的命名空间中部署 PodMonitoring 资源：
```
kubectl apply -f kubernetes/pod-monitoring.yaml
```

查看 HorizontalPodAutoscaler 清单：

# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: t5-inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: t5-inference
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metric:
        name: prometheus.googleapis.com|ts_queue_latency_microseconds|counter
      target:
        type: AverageValue
        averageValue: "30000"

HorizontalPodAutoscaler 根据请求队列的累计时长扩缩 T5 模型 Pod 数量。自动扩缩基于 ts_queue_latency_microseconds 指标，该指标显示累计队列时长（以微秒为单位）。

创建 HorizontalPodAutoscaler：
```
kubectl apply -f kubernetes/hpa.yaml
```

使用负载生成器验证自动扩缩

如需测试自动扩缩配置，请为服务应用生成负载。本教程使用 Locust 负载生成器将请求发送到模型的预测端点。

创建负载生成器：
```
kubectl apply -f kubernetes/loadgenerator.yaml
```
等待负载生成器 Pod 准备就绪。
在本地开放负载生成器网页界面：
```
kubectl port-forward svc/loadgenerator 8080
```
如果您看到错误消息，请在 Pod 运行时重试。
在浏览器中，打开负载生成器网页界面：
- 如果您使用的是本地 shell，请打开浏览器并转到 http://127.0.0.1:8080。
- 如果您使用的是 Cloud Shell，请点击 Web 预览，然后点击更改端口。输入端口 8080。
点击图表标签页以观察一段时间内的性能。
打开一个新的终端窗口，并观察 Pod 横向自动扩缩器的副本计数：
```
kubectl get hpa -w
```
副本数量会随着负载的增加而增加。扩容可能需要大约十分钟。新副本启动时，Locust 图表中的成功请求数会增加。
```
NAME           REFERENCE                 TARGETS           MINPODS   MAXPODS   REPLICAS   AGE
t5-inference   Deployment/t5-inference   71352001470m/7M   1         5        1           2m11s
```

建议

使用您将用于提供服务的基础 Docker 映像版本的同一版本构建模型。
如果您的模型具有特殊的软件包依赖项，或者依赖项的大小很大，请创建自定义版本的基础 Docker 映像。
监控模型依赖项软件包的树版本。请确保您的软件包依赖项支持彼此的版本。例如，Panda 2.0.3 版支持 NumPy 1.20.3 版及更高版本。
在 GPU 节点上运行 GPU 密集型模型，在 CPU 上运行 CPU 密集型模型。这样可以提高模型部署的稳定性，并确保高效使用节点资源。

观察模型性能

如需观察模型性能，您可以使用 Cloud Monitoring 中的 TorchServe 信息中心集成。在此信息中心内，您可以查看各种关键性能指标，例如令牌吞吐量、请求延迟时间和错误率。

如需使用 TorchServe 信息中心，您必须在 GKE 集群中启用 Google Cloud Managed Service for Prometheus，该服务会从 TorchServe 收集指标。TorchServe 默认以 Prometheus 格式公开指标；您无需安装其他导出器。

然后，您可以使用 TorchServe 信息中心查看指标。如需了解如何使用 Google Cloud Managed Service for Prometheus 从模型收集指标，请参阅 Cloud Monitoring 文档中的 TorchServe 可观测性指南。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

Delete a Google Cloud project:

gcloud projects delete PROJECT_ID

删除各个资源

删除 Kubernetes 资源：

kubectl delete -f kubernetes/loadgenerator.yaml
kubectl delete -f kubernetes/hpa.yaml
kubectl delete -f kubernetes/pod-monitoring.yaml
kubectl delete -f kubernetes/application.yaml
kubectl delete -f kubernetes/serving-gpu.yaml
kubectl delete -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter_new_resource_model.yaml

删除 GKE 集群：

gcloud container clusters delete "ml-cluster" \
    --location="us-central1" --quiet

删除 IAM 服务账号和 IAM 政策绑定：

gcloud projects remove-iam-policy-binding PROJECT_ID \
    --member "serviceAccount:monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com" \
    --role roles/monitoring.viewer
gcloud iam service-accounts remove-iam-policy-binding monitoring-viewer@PROJECT_ID.iam.gserviceaccount.com \
    --role roles/iam.workloadIdentityUser \
    --member "serviceAccount:PROJECT_ID.svc.id.goog[custom-metrics/custom-metrics-stackdriver-adapter]"
gcloud iam service-accounts delete monitoring-viewer

删除 Artifact Registry 中的映像。（可选）删除整个仓库。如需查看相关说明，请参阅有关删除映像的 Artifact Registry 文档。

组件概览

本部分介绍本教程中使用的组件，例如模型、Web 应用、框架和集群。

T5 模型简介

本教程使用预训练的多语言 T5 模型。T5 是一种文本到文本转换器，可将文本从一种语言转换为另一种语言。在 T5 中，输入和输出始终是文本字符串，而 BERT 样式的模型只能输出类标签或输入的 span。T5 模型也可用于总结、问答或文本分类等任务。该模型基于 Colossal Clean Crawled Corpus (C4) 和 Wiki-DPR 中的大量文本进行训练。

如需了解详情，请参阅 T5 模型文档。

Colin Raffel、Noam Shazeer、Adam Roberts、Katherine Lee、Sharan Narang、Michael Matena、Yanqi Zhou、Wei Li 和 Peter J. Liu 在《Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer》中提出了 T5 模型，发表于《Journal of Machine Learning Research》上。

T5 模型支持各种模型大小，这些模型具有不同的复杂程度，适合特定应用场景。本教程使用默认大小 t5-small，但您也可以选择其他大小。以下 T5 大小根据 Apache 2.0 许可分发：

t5-small：6000 万个参数
t5-base：2.2 亿个参数
t5-large：7.7 亿个参数。下载容量 3GB。
t5-3b：30 亿个参数。下载容量 11GB。
t5-11b：110 亿个参数。下载容量 45 GB。

如需了解其他可用的 T5 型号，请参阅 Hugging Face 仓库。

TorchServe 简介

TorchServe 是一种提供 PyTorch 模型的灵活工具。它为所有主要深度学习框架（包括 PyTorch、TensorFlow 和 ONNX）提供现成的支持。TorchServe 可用于在生产环境中部署模型，也可以用于快速原型设计和实验。

后续步骤

通过多个 GPU 提供 LLM。
探索有关 Google Cloud 的参考架构、图表和最佳做法。查看我们的 Cloud 架构中心。

使用 TorchServe 在 GKE 上提供可伸缩 LLM 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

教程应用简介

目标

费用

准备工作

准备环境

创建集群

创建 Artifact Registry 代码库

封装模型

将封装的模型部署到 GKE

使用 Web 应用访问已部署的模型

为模型启用自动扩缩功能

安装自定义指标 Stackdriver 适配器

应用 PodMonitoring 和 HorizontalPodAutoscaling 配置

使用负载生成器验证自动扩缩

建议

观察模型性能

清理

删除项目

删除各个资源

组件概览

T5 模型简介

TorchServe 简介

后续步骤

使用 TorchServe 在 GKE 上提供可伸缩 LLM