使用 GKE 推理网关提供 LLM

本教程介绍如何使用 GKE 推理网关在 Google Kubernetes Engine (GKE) 上部署大语言模型 (LLM)。本教程包含集群设置、模型部署、GKE 推理网关配置和处理 LLM 请求的步骤。

本教程适用于机器学习 (ML) 工程师、平台管理员和运维人员,以及希望使用 GKE 推理网关在 GKE 上部署和管理 LLM 应用的数据和 AI 专家。

在阅读本页面内容之前,请先熟悉以下内容:

背景

本部分介绍本教程中使用的关键技术。 如需详细了解模型服务概念和术语,以及 GKE 生成式 AI 功能如何提升和支持模型部署性能,请参阅 GKE 上的模型推理简介

vLLM

vLLM 是一种经过高度优化的开源 LLM 服务框架,可提高 GPU 上的服务吞吐量。主要功能包括:

  • 具有 PagedAttention 且经过优化的 Transformer(转换器)实现
  • 可提高整体服务吞吐量的连续批处理
  • 多个 GPU 上的张量并行处理和分布式服务

如需了解详情,请参阅 vLLM 文档

GKE 推理网关

GKE 推理网关增强了 GKE 在部署 LLM 方面的功能。它通过以下功能优化推理工作负载:

  • 基于负载指标的推理优化负载均衡。
  • 支持对 LoRA 适配器进行密集的多工作负载部署。
  • 模型感知路由,可简化操作。

如需了解详情,请参阅 GKE 推理网关简介

获取对模型的访问权限

如需将 Llama3.1 模型部署到 GKE,请签署许可同意协议并生成 Hugging Face 访问令牌。

您必须签署同意协议才能使用 Llama3.1 模型。请按照以下说明操作:

  1. 访问同意页面,并使用您的 Hugging Face 账号验证同意情况。
  2. 接受模型条款。

生成一个访问令牌

如需通过 Hugging Face 访问模型,您需要 Hugging Face 令牌

如果您还没有令牌,请按照以下步骤生成新令牌:

  1. 点击您的个人资料 > 设置 > 访问令牌
  2. 选择新建令牌 (New Token)。
  3. 指定您选择的名称和一个至少为 Read 的角色。
  4. 选择生成令牌
  5. 将生成的令牌复制到剪贴板。

准备环境

在本教程中,您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 中预安装了本教程所需的软件,包括 kubectlgcloud CLI

如需使用 Cloud Shell 设置您的环境,请执行以下步骤:

  1. 在 Google Cloud 控制台中,点击 Google Cloud 控制台中的 Cloud Shell 激活图标 激活 Cloud Shell 以启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。

  2. 设置默认环境变量:

    gcloud config set project PROJECT_ID
    gcloud config set billing/quota_project PROJECT_ID
    export PROJECT_ID=$(gcloud config get project)
    export REGION=REGION
    export CLUSTER_NAME=CLUSTER_NAME
    export HF_TOKEN=HF_TOKEN
    

    替换以下值:

    • PROJECT_ID:您的 Google Cloud项目 ID
    • REGION:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的 us-central1
    • CLUSTER_NAME:您的集群的名称。
    • HF_TOKEN:您之前生成的 Hugging Face 令牌。

创建和配置 Google Cloud 资源

创建 GKE 集群和节点池

在 GKE Autopilot 或 Standard 集群中的 GPU 上部署 LLM。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式

Autopilot

在 Cloud Shell 中,运行以下命令:

gcloud container clusters create-auto CLUSTER_NAME \
    --project=PROJECT_ID \
    --location=CONTROL_PLANE_LOCATION \
    --release-channel=rapid

替换以下值:

  • PROJECT_ID:您的 Google Cloud项目 ID
  • CONTROL_PLANE_LOCATION:集群控制平面的 Compute Engine 区域。提供支持要使用的加速器类型的区域,例如适用于 H100 GPU 的 us-central1
  • CLUSTER_NAME:您的集群的名称。

GKE 会根据所部署的工作负载的请求,创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。

Standard

  1. 在 Cloud Shell 中,运行以下命令以创建 Standard 集群:

    gcloud container clusters create CLUSTER_NAME \
        --project=PROJECT_ID \
        --location=CONTROL_PLANE_LOCATION \
        --workload-pool=PROJECT_ID.svc.id.goog \
        --release-channel=rapid \
        --num-nodes=1 \
        --enable-managed-prometheus \
        --monitoring=SYSTEM,DCGM \
        --gateway-api=standard
    

    替换以下值:

    • PROJECT_ID:您的 Google Cloud项目 ID
    • CONTROL_PLANE_LOCATION:集群控制平面的 Compute Engine 区域。提供支持要使用的加速器类型的区域,例如适用于 H100 GPU 的 us-central1
    • CLUSTER_NAME:您的集群的名称。

    集群创建可能需要几分钟的时间。

  2. 如需创建具有适当磁盘大小的节点池以运行 Llama-3.1-8B-Instruct 模型,请运行以下命令:

    gcloud container node-pools create gpupool \
        --accelerator type=nvidia-h100-80gb,count=2,gpu-driver-version=latest \
        --project=PROJECT_ID \
        --location=CONTROL_PLANE_LOCATION \
        --node-locations=CONTROL_PLANE_LOCATION-a \
        --cluster=CLUSTER_NAME \
        --machine-type=a3-highgpu-2g \
        --num-nodes=1 \
    

    GKE 会创建一个节点池,其中包含一个 H100 GPU。

设置授权以爬取指标

如需设置授权以爬取指标,请创建 inference-gateway-sa-metrics-reader-secret Secret:

kubectl apply -f - <<EOF
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: inference-gateway-metrics-reader
    rules:
    - nonResourceURLs:
      - /metrics
      verbs:
      - get
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: inference-gateway-sa-metrics-reader
      namespace: default
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: inference-gateway-sa-metrics-reader-role-binding
      namespace: default
    subjects:
    - kind: ServiceAccount
      name: inference-gateway-sa-metrics-reader
      namespace: default
    roleRef:
      kind: ClusterRole
      name: inference-gateway-metrics-reader
      apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: v1
    kind: Secret
    metadata:
      name: inference-gateway-sa-metrics-reader-secret
      namespace: default
      annotations:
        kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader
    type: kubernetes.io/service-account-token
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: inference-gateway-sa-metrics-reader-secret-read
    rules:
    - resources:
      - secrets
      apiGroups: [""]
      verbs: ["get", "list", "watch"]
      resourceNames: ["inference-gateway-sa-metrics-reader-secret"]
    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: gmp-system:collector:inference-gateway-sa-metrics-reader-secret-read
      namespace: default
    roleRef:
      name: inference-gateway-sa-metrics-reader-secret-read
      kind: ClusterRole
      apiGroup: rbac.authorization.k8s.io
    subjects:
    - name: collector
      namespace: gmp-system
      kind: ServiceAccount
    EOF

为 Hugging Face 凭据创建 Kubernetes Secret

在 Cloud Shell 中,执行以下操作:

  1. 配置 kubectl,使其能够与您的集群通信:

    gcloud container clusters get-credentials CLUSTER_NAME \
        --location=REGION
    

    替换以下值:

    • REGION:支持要使用的加速器类型的区域,例如适用于 L4 GPU 的 us-central1
    • CLUSTER_NAME:您的集群的名称。
  2. 创建包含 Hugging Face 令牌的 Kubernetes Secret:

    kubectl create secret generic hf-secret \
        --from-literal=hf_api_token=${HF_TOKEN} \
        --dry-run=client -o yaml | kubectl apply -f -
    

    HF_TOKEN 替换为您之前生成的 Hugging Face 令牌。

安装 InferenceObjectiveInferencePool CRD

在本部分中,您将为 GKE 推理网关安装必要的自定义资源定义 (CRD)。

CRD 可扩展 Kubernetes API。这样,您就可以定义新的资源类型。如需使用 GKE 推理网关,请在 GKE 集群中安装 InferencePoolInferenceObjective CRD,方法是运行以下命令:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.0/manifests.yaml

部署模型服务器

此示例使用 vLLM 模型服务器部署 Llama3.1 模型。相应部署将被标记为 app:vllm-llama3.1-8b-instruct。此部署还使用了 Hugging Face 中的两个名为 food-reviewcad-fabricator 的 LoRA 适配器。您可以根据自己的模型服务器和模型容器、服务端口以及部署名称来更新此部署。您可以选择在部署中配置 LoRA 适配器,或部署基础模型。

  1. 如需部署在 nvidia-h100-80gb 加速器类型上,请将以下清单保存为 vllm-llama3.1-8b-instruct.yaml。此清单定义了一个包含模型和模型服务器的 Kubernetes Deployment:

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: vllm-llama3.1-8b-instruct
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: vllm-llama3.1-8b-instruct
      template:
        metadata:
          labels:
            app: vllm-llama3.1-8b-instruct
        spec:
          containers:
            - name: vllm
              # Versions of vllm after v0.8.5 have an issue due to an update in NVIDIA driver path.
              # The following workaround can be used until the fix is applied to the vllm release
              # BUG: https://github.com/vllm-project/vllm/issues/18859
              image: "vllm/vllm-openai:latest"
              imagePullPolicy: Always
              command: ["sh", "-c"]
              args:
              - >-
                PATH=$PATH:/usr/local/nvidia/bin
                LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
                python3 -m vllm.entrypoints.openai.api_server
                --model meta-llama/Llama-3.1-8B-Instruct
                --tensor-parallel-size 1
                --port 8000
                --enable-lora
                --max-loras 2
                --max-cpu-loras 12
              env:
                # Enabling LoRA support temporarily disables automatic v1, we want to force it on
                # until 0.8.3 vLLM is released.
                - name: VLLM_USE_V1
                  value: "1"
                - name: PORT
                  value: "8000"
                - name: HUGGING_FACE_HUB_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-secret
                      key: hf_api_token
                - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING
                  value: "true"
              ports:
                - containerPort: 8000
                  name: http
                  protocol: TCP
              lifecycle:
                preStop:
                  # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep
                  # to give upstream gateways a chance to take us out of rotation. The time we wait
                  # is dependent on the time it takes for all upstreams to completely remove us from
                  # rotation. Older or simpler load balancers might take upwards of 30s, but we expect
                  # our deployment to run behind a modern gateway like Envoy which is designed to
                  # probe for readiness aggressively.
                  sleep:
                    # Upstream gateway probers for health should be set on a low period, such as 5s,
                    # and the shorter we can tighten that bound the faster that we release
                    # accelerators during controlled shutdowns. However, we should expect variance,
                    # as load balancers may have internal delays, and we don't want to drop requests
                    # normally, so we're often aiming to set this value to a p99 propagation latency
                    # of readiness -> load balancer taking backend out of rotation, not the average.
                    #
                    # This value is generally stable and must often be experimentally determined on
                    # for a given load balancer and health check period. We set the value here to
                    # the highest value we observe on a supported load balancer, and we recommend
                    # tuning this value down and verifying no requests are dropped.
                    #
                    # If this value is updated, be sure to update terminationGracePeriodSeconds.
                    #
                    seconds: 30
                  #
                  # IMPORTANT: preStop.sleep is beta as of Kubernetes 1.30 - for older versions
                  # replace with this exec action.
                  #exec:
                  #  command:
                  #  - /usr/bin/sleep
                  #  - 30
              livenessProbe:
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                # vLLM's health check is simple, so we can more aggressively probe it.  Liveness
                # check endpoints should always be suitable for aggressive probing.
                periodSeconds: 1
                successThreshold: 1
                # vLLM has a very simple health implementation, which means that any failure is
                # likely significant. However, any liveness triggered restart requires the very
                # large core model to be reloaded, and so we should bias towards ensuring the
                # server is definitely unhealthy vs immediately restarting. Use 5 attempts as
                # evidence of a serious problem.
                failureThreshold: 5
                timeoutSeconds: 1
              readinessProbe:
                httpGet:
                  path: /health
                  port: http
                  scheme: HTTP
                # vLLM's health check is simple, so we can more aggressively probe it.  Readiness
                # check endpoints should always be suitable for aggressive probing, but may be
                # slightly more expensive than readiness probes.
                periodSeconds: 1
                successThreshold: 1
                # vLLM has a very simple health implementation, which means that any failure is
                # likely significant,
                failureThreshold: 1
                timeoutSeconds: 1
              # We set a startup probe so that we don't begin directing traffic or checking
              # liveness to this instance until the model is loaded.
              startupProbe:
                # Failure threshold is when we believe startup will not happen at all, and is set
                # to the maximum possible time we believe loading a model will take. In our
                # default configuration we are downloading a model from HuggingFace, which may
                # take a long time, then the model must load into the accelerator. We choose
                # 10 minutes as a reasonable maximum startup time before giving up and attempting
                # to restart the pod.
                #
                # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash
                # loop forever. Be sure to set this appropriately.
                failureThreshold: 600
                # Set delay to start low so that if the base model changes to something smaller
                # or an optimization is deployed, we don't wait unnecessarily.
                initialDelaySeconds: 2
                # As a startup probe, this stops running and so we can more aggressively probe
                # even a moderately complex startup - this is a very important workload.
                periodSeconds: 1
                httpGet:
                  # vLLM does not start the OpenAI server (and hence make /health available)
                  # until models are loaded. This may not be true for all model servers.
                  path: /health
                  port: http
                  scheme: HTTP
    
              resources:
                limits:
                  nvidia.com/gpu: 1
                requests:
                  nvidia.com/gpu: 1
              volumeMounts:
                - mountPath: /data
                  name: data
                - mountPath: /dev/shm
                  name: shm
                - name: adapters
                  mountPath: "/adapters"
          # This is the second container in the Pod, a sidecar to the vLLM container.
          # It watches the ConfigMap and downloads LoRA adapters.
            - name: lora-adapter-syncer
              image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main
              imagePullPolicy: Always
              env:
                - name: DYNAMIC_LORA_ROLLOUT_CONFIG
                  value: "/config/configmap.yaml"
              volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths
              - name: config-volume
                mountPath: /config
          restartPolicy: Always
    
          # vLLM allows VLLM_PORT to be specified as an environment variable, but a user might
          # create a 'vllm' service in their namespace. That auto-injects VLLM_PORT in docker
          # compatible form as `tcp://<IP>:<PORT>` instead of the numeric value vLLM accepts
          # causing CrashLoopBackoff. Set service environment injection off by default.
          enableServiceLinks: false
    
          # Generally, the termination grace period needs to last longer than the slowest request
          # we expect to serve plus any extra time spent waiting for load balancers to take the
          # model server out of rotation.
          #
          # An easy starting point is the p99 or max request latency measured for your workload,
          # although LLM request latencies vary significantly if clients send longer inputs or
          # trigger longer outputs. Since steady state p99 will be higher than the latency
          # to drain a server, you may wish to slightly this value either experimentally or
          # via the calculation below.
          #
          # For most models you can derive an upper bound for the maximum drain latency as
          # follows:
          #
          #   1. Identify the maximum context length the model was trained on, or the maximum
          #      allowed length of output tokens configured on vLLM (llama2-7b was trained to
          #      4k context length, while llama3-8b was trained to 128k).
          #   2. Output tokens are the more compute intensive to calculate and the accelerator
          #      will have a maximum concurrency (batch size) - the time per output token at
          #      maximum batch with no prompt tokens being processed is the slowest an output
          #      token can be generated (for this model it would be about 10ms TPOT at a max
          #      batch size around 50, or 100 tokens/sec)
          #   3. Calculate the worst case request duration if a request starts immediately
          #      before the server stops accepting new connections - generally when it receives
          #      SIGTERM (for this model that is about 4096 / 100 ~ 40s)
          #   4. If there are any requests generating prompt tokens that will delay when those
          #      output tokens start, and prompt token generation is roughly 6x faster than
          #      compute-bound output token generation, so add 40% to the time from above (40s +
          #      16s = 56s)
          #
          # Thus we think it will take us at worst about 56s to complete the longest possible
          # request the model is likely to receive at maximum concurrency (highest latency)
          # once requests stop being sent.
          #
          # NOTE: This number will be lower than steady state p99 latency since we stop       receiving
          #       new requests which require continuous prompt token computation.
              # NOTE: The max timeout for backend connections from gateway to model servers should
          #       be configured based on steady state p99 latency, not drain p99 latency
          #
          #   5. Add the time the pod takes in its preStop hook to allow the load balancers to
          #      stop sending us new requests (56s + 30s = 86s).
          #
          # Because the termination grace period controls when the Kubelet forcibly terminates a
          # stuck or hung process (a possibility due to a GPU crash), there is operational safety
          # in keeping the value roughly proportional to the time to finish serving. There is also
          # value in adding a bit of extra time to deal with unexpectedly long workloads.
          #
          #   6. Add a 50% safety buffer to this time (86s * 1.5 ≈ 130s).
          #
          # One additional source of drain latency is that some workloads may run close to
          # saturation and have queued requests on each server. Since traffic in excess of the
          # max sustainable QPS will result in timeouts as the queues grow, we assume that failure
          # to drain in time due to excess queues at the time of shutdown is an expected failure
          # mode of server overload. If your workload occasionally experiences high queue depths
          # due to periodic traffic, consider increasing the safety margin above to account for
          # time to drain queued requests.
          terminationGracePeriodSeconds: 130
          nodeSelector:
            cloud.google.com/gke-accelerator: "nvidia-h100-80gb"
          volumes:
            - name: data
              emptyDir: {}
            - name: shm
              emptyDir:
                medium: Memory
            - name: adapters
              emptyDir: {}
            - name: config-volume
              configMap:
                name: vllm-llama3.1-8b-adapters
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: vllm-llama3.1-8b-adapters
    data:
      configmap.yaml: |
          vLLMLoRAConfig:
            name: vllm-llama3.1-8b-instruct
            port: 8000
            defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct
            ensureExist:
              models:
              - id: food-review
                source: Kawon/llama3.1-food-finetune_v14_r8
              - id: cad-fabricator
                source: redcathode/fabricator
    ---
    kind: HealthCheckPolicy
    apiVersion: networking.gke.io/v1
    metadata:
      name: health-check-policy
      namespace: default
    spec:
      targetRef:
        group: "inference.networking.k8s.io"
        kind: InferencePool
        name: vllm-llama3.1-8b-instruct
      default:
        config:
          type: HTTP
          httpHealthCheck:
              requestPath: /health
              port: 8000
    
  2. 将清单应用到您的集群:

    kubectl apply -f vllm-llama3.1-8b-instruct.yaml
    

创建 InferencePool 资源

InferencePool Kubernetes 自定义资源定义了一组采用相同的基础 LLM 和计算配置的 Pod。

InferencePool 自定义资源包含以下关键字段:

  • selector:指定此池包含的 Pod。此选择器中的标签必须与应用于模型服务器 Pod 的标签完全一致。
  • targetPort:定义模型服务器在 Pod 内使用的端口。

InferencePool 资源使 GKE 推理网关能够将流量路由到模型服务器 Pod。

如需使用 Helm 创建 InferencePool,请执行以下步骤:

helm install vllm-llama3.1-8b-instruct \
  --set inferencePool.modelServers.matchLabels.app=vllm-llama3.1-8b-instruct \
  --set provider.name=gke \
  --set healthCheckPolicy.create=false \
  --version v1.0.0 \
  oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool

更改以下字段以与您的 Deployment 相符:

  • inferencePool.modelServers.matchLabels.app:用于选择模型服务器 Pod 的标签的键。

此命令会创建一个 InferencePool 对象,该对象在逻辑上表示模型服务器部署,并引用 Selector 选择的 Pod 内的模型端点服务。

创建具有服务重要性的 InferenceObjective 资源

InferenceObjective 自定义资源定义了模型的服务参数,包括其优先级。您必须创建 InferenceObjective 资源来定义在 InferencePool 上部署哪些模型。这些资源可以引用 InferencePool 中模型服务器支持的基础模型或 LoRA 适配器。

metadata.name 字段指定模型的名称,priority 字段设置其服务重要性,poolRef 字段链接到部署模型所处的 InferencePool

如需创建 InferenceObjective,请执行以下步骤:

  1. 将以下示例清单保存为 inferenceobjective.yaml

    apiVersion: inference.networking.x-k8s.io/v1alpha2
    kind: InferenceObjective
    metadata:
      name: MODEL_NAME
    spec:
      priority: VALUE
      poolRef:
        name: INFERENCE_POOL_NAME
        kind: "InferencePool"
    

    替换以下内容:

    • MODEL_NAME:基础模型或 LoRA 适配器的名称。例如 food-review
    • VALUE:推理目标的优先级。这是一个整数,值越大表示请求越重要。例如 10
    • INFERENCE_POOL_NAME:您在上一步中创建的 InferencePool 的名称。例如 vllm-llama3.1-8b-instruct
  2. 将示例清单应用于集群:

    kubectl apply -f inferenceobjective.yaml
    

以下示例会创建两个 InferenceObjective 对象。第一个对象在 vllm-llama3.1-8b-instruct InferencePool 上配置 food-review LoRA 模型,优先级为 10。第二个对象将 llama3-base-model 配置为以更高的优先级部署,为 20

apiVersion: inference.networking.k8s.io/v1alpha1
kind: InferenceObjective
metadata:
  name: food-review
spec:
  priority: 10
  poolRef:
    name: vllm-llama3.1-8b-instruct
    kind: "InferencePool"
---
apiVersion: inference.networking.k8s.io/v1alpha1
kind: InferenceObjective
metadata:
  name: llama3-base-model
spec:
  priority: 20
  poolRef:
    name: vllm-llama3.1-8b-instruct
    kind: "InferencePool"

创建网关

网关资源充当外部流量进入 Kubernetes 集群的入口点。它定义用于接受传入连接的监听器。

GKE 推理网关支持 gke-l7-rilbgke-l7-regional-external-managed 网关类。如需了解详情,请参阅 GKE 文档中的网关类

如需创建网关,请执行以下步骤:

  1. 将以下示例清单保存为 gateway.yaml

    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: GATEWAY_NAME
    spec:
      gatewayClassName: gke-l7-regional-external-managed
      listeners:
        - protocol: HTTP # Or HTTPS for production
          port: 80 # Or 443 for HTTPS
          name: http
    

    GATEWAY_NAME 替换为网关资源的唯一名称。例如 inference-gateway

  2. 将清单应用到您的集群:

    kubectl apply -f gateway.yaml
    

创建 HTTPRoute 资源

在本部分中,您将创建一个 HTTPRoute 资源来定义网关如何将传入的 HTTP 请求路由到您的 InferencePool

HTTPRoute 资源定义 GKE 网关如何将传入的 HTTP 请求路由到后端服务,即您的 InferencePool。它指定匹配规则(例如,标头或路径)以及应将流量转发到的后端。

如需创建 HTTPRoute,请执行以下步骤:

  1. 将以下示例清单保存为 httproute.yaml

    apiVersion: gateway.networking.k8s.io/v1
    kind: HTTPRoute
    metadata:
      name: HTTPROUTE_NAME
    spec:
      parentRefs:
      - name: GATEWAY_NAME
      rules:
      - matches:
        - path:
            type: PathPrefix
            value: PATH_PREFIX
        backendRefs:
        - name: INFERENCE_POOL_NAME
          group: inference.networking.k8s.io
          kind: InferencePool
    

    替换以下内容:

    • HTTPROUTE_NAMEHTTPRoute 资源的唯一名称。例如 my-route
    • GATEWAY_NAME:您创建的 Gateway 资源的名称。例如 inference-gateway
    • PATH_PREFIX:用于匹配传入请求的路径前缀。例如,/ 可匹配所有路径。
    • INFERENCE_POOL_NAME:要将流量路由到的 InferencePool 资源的名称。例如 vllm-llama3.1-8b-instruct
  2. 将清单应用到您的集群:

    kubectl apply -f httproute.yaml
    

发送推理请求

配置 GKE 推理网关后,您便可以向已部署的模型发送推理请求。

如需发送推理请求,请执行以下步骤:

  • 检索网关端点。
  • 构建格式正确的 JSON 请求。
  • 使用 curl 将请求发送到 /v1/completions 端点。

这样一来,您就可以根据输入提示和指定参数生成文本。

  1. 如需获取网关端点,请运行以下命令:

    IP=$(kubectl get gateway/GATEWAY_NAME -o jsonpath='{.status.addresses[0].value}')
    PORT=80
    

    GATEWAY_NAME 替换为您的网关资源名称。

  2. 如需使用 curl/v1/completions 端点发送请求,请运行以下命令:

    curl -i -X POST http://${IP}:${PORT}/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MODEL_NAME",
        "prompt": "PROMPT_TEXT",
        "max_tokens": MAX_TOKENS,
        "temperature": "TEMPERATURE"
    }'
    

    替换以下内容:

    • MODEL_NAME:要使用的模型或 LoRA 适配器的名称。
    • PROMPT_TEXT:模型的输入提示。
    • MAX_TOKENS:回答中可生成的 token 数量上限。
    • TEMPERATURE:控制输出的随机性。使用值 0 可获得确定性输出,使用更高的值则可获得更具创造性的输出。

请注意以下几点:

  • 请求正文:请求正文可以包含其他参数,例如 stoptop_p。如需查看完整的选项列表,请参阅 OpenAI API 规范
  • 错误处理:在客户端代码中实现适当的错误处理,以处理响应中可能出现的错误。例如,检查 curl 响应中的 HTTP 状态代码。非 200 状态代码通常表示错误。
  • 身份验证和授权:对于生产部署,请使用身份验证和授权机制保护您的 API 端点。在请求中添加相应的标头(例如 Authorization)。

为推理网关配置可观测性

GKE 推理网关可用于深入了解推理工作负载的健康状况、性能和行为。这有助于您发现和解决问题、优化资源利用率,并确保应用的可靠性。您可以通过 Metrics Explorer 在 Cloud Monitoring 中查看这些可观测性指标。

如需为 GKE 推理网关配置可观测性,请参阅配置可观测性

删除已部署的资源

为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用,请运行以下命令:

gcloud container clusters delete CLUSTER_NAME \
    --location=CONTROL_PLANE_LOCATION

替换以下值:

  • CONTROL_PLANE_LOCATION:集群控制平面的 Compute Engine 区域
  • CLUSTER_NAME:您的集群的名称。

后续步骤