使用 Kueue 部署批处理系统

Autopilot Standard

本教程介绍了如何使用 Kueue 在 Google Kubernetes Engine (GKE) 上安排 Job，从而优化可用资源。在本教程中，您将学习如何使用 Kueue 有效管理和安排批量作业、提高资源利用率并简化工作负载管理。您为两个租户团队设置了一个共享集群，其中每个团队都有自己的命名空间，并且每个团队创建的 Job 都共享全局资源。您还可以将 Kueue 配置为根据您定义的资源配额来安排 Job。

本教程适用于想要使用 GKE 实现批量系统的云架构师和平台工程师。如需详细了解我们在 Google Cloud内容中提及的常见角色和示例任务，请参阅常见的 GKE Enterprise 用户角色和任务。

在阅读本页面之前，请确保您熟悉以下内容：

背景

Job 是运行到完成的应用，例如机器学习、渲染、模拟、分析、CI/CD 和类似工作负载。

Kueue 是一个云原生 Job 调度器，使用默认的 Kubernetes 调度器、Job 控制器和集群自动扩缩器来提供端到端批处理系统。Kueue 会根据配额和层次结构，在团队之间共享资源，从而实现 Job 排队，确定 Job 应等待的时间和应开始的时间。

Kueue 具有以下特征：

它针对云架构进行了优化，其中资源是异构、可互换且可扩缩。
它提供了一组 API，用于管理弹性配额和管理 Job 队列。
它不会重新实现自动扩缩、Pod 调度或 Job 生命周期管理等现有功能。
Kueue 内置了对 Kubernetesbatch/v1.Job API 的支持。
它可以与其他作业 API 集成。

Kueue 会将任何 API 定义的作业用作工作负载，以避免与特定 Kubernetes Job API 混淆。

目标

创建 GKE 集群
创建 ResourceFlavor
创建 ClusterQueue
创建 LocalQueue
创建 Job 并观察允许的工作负载

费用

本教程使用 Google Cloud的以下收费组件：

您可使用价格计算器根据您的预计使用情况来估算费用。

完成本教程后，请删除您创建的资源，以避免继续计费。如需了解详情，请参阅清理。

准备工作

设置项目

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

设置 Google Cloud CLI 的默认值

在 Google Cloud 控制台中，启动 Cloud Shell 实例：
打开 Cloud Shell

下载此示例应用的源代码：

git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
cd kubernetes-engine-samples/batch/kueue-intro

设置默认环境变量：
```
gcloud config set project PROJECT_ID
gcloud config set compute/region COMPUTE_REGION
```
替换以下值：
- PROJECT_ID：您的 Google Cloud 项目 ID。
- COMPUTE_REGION：Compute Engine 区域。

创建 GKE 集群

创建名为 kueue-autopilot 的 GKE Autopilot 集群：
```
gcloud container clusters create-auto kueue-autopilot \
  --release-channel "rapid" --region COMPUTE_REGION
```
Autopilot 集群是全代管式，并具有内置自动扩缩功能。详细了解 GKE Autopilot。

Kueue 还支持使用节点自动预配功能和标准自动扩缩节点池的标准 GKE。

注意：Autopilot 集群创建过程最多可能需要五分钟才能完成。
创建集群后，结果类似于以下内容：
```
  NAME: kueue-autopilot
  LOCATION: us-central1
  MASTER_VERSION: 1.26.2-gke.1000
  MASTER_IP: 35.193.173.228
  MACHINE_TYPE: e2-medium
  NODE_VERSION: 1.26.2-gke.1000
  NUM_NODES: 3
  STATUS: RUNNING
```
其中，kueue-autopilot 的 STATUS 是 RUNNING。

获取用于集群的身份验证凭据

gcloud container clusters get-credentials kueue-autopilot

在集群上安装 Kueue：

VERSION=VERSION
kubectl apply --server-side -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

将 VERSION 替换为最新版本的 Kueue。如需详细了解 Kueue 版本，请参阅 Kueue 版本。

等待 Kueue Pod 准备就绪：

watch kubectl -n kueue-system get pods

在继续下一步之前，输出应类似于以下内容：

NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-66d8bb946b-wr2l2   2/2     Running   0          3m36s

创建两个名为 team-a 和 team-b 的新命名空间：

kubectl create namespace team-a
kubectl create namespace team-b

创建 ResourceFlavor

ResourceFlavor 是一种对象，用于表示集群内节点的变体，即将它们与节点标签和污点相关联。例如，您可以使用 ResourceFlavor 表示具有不同预配保证（例如 Spot 与按需）、架构（例如 x86 与 ARM CPU）、品牌和型号（例如 Nvidia A100 与 T4 GPU）的虚拟机。

在本教程中，kueue-autopilot 集群具有同构资源。因此，请为 CPU、内存、临时存储和 GPU 创建一个 ResourceFlavor，而不使用标签或污点。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor # This ResourceFlavor will be used for all the resources

部署 ResourceFlavor：

kubectl apply -f flavors.yaml

创建 ClusterQueue

ClusterQueue 是集群级对象，用于管理 CPU、内存、GPU 等资源池。它负责管理 ResourceFlavor，并限制其用量并决定工作负载的允许顺序。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cluster-queue
spec:
  namespaceSelector: {} # Available to all namespaces
  queueingStrategy: BestEffortFIFO # Default queueing strategy
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu", "ephemeral-storage"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 10
      - name: "memory"
        nominalQuota: 10Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 10
      - name: "ephemeral-storage"
        nominalQuota: 10Gi

部署 ClusterQueue：

kubectl apply -f cluster-queue.yaml

使用顺序由 .spec.queueingStrategy 确定，其中有两种配置：

BestEffortFIFO
- 默认排队策略配置。
- 工作负载准入遵循先进先出 (FIFO) 规则，但如果配额不足以允许队列头部的工作负载，则将尝试队列中的下一项工作负载。
StrictFIFO
- 保证 FIFO 语义。
- 队列头部的工作负载可以阻止将更多工作负载加入队列，直到该工作负载获得准入许可。

在 cluster-queue.yaml 中，您可以创建一个名为 cluster-queue 的新 ClusterQueue。此 ClusterQueue 使用 flavors.yaml 中创建的变种管理四个资源：cpu、memory、nvidia.com/gpu 和 ephemeral-storage。配额由工作负载 Pod 规范中的请求使用。

每个变种都包含表示为 .spec.resourceGroups[].flavors[].resources[].nominalQuota 的使用限制。在这种情况下，当且仅当满足以下条件时，ClusterQueue 才允许工作负载：

CPU 请求的总和小于或等于 10
内存请求的总和小于或等于 10Gi
GPU 请求的总和小于或等于 10
使用的存储空间总和小于或等于 10Gi

创建 LocalQueue

LocalQueue 是一个命名空间对象，接受来自命名空间中用户的工作负载。不同命名空间的 LocalQueue 可以指向同一个 ClusterQueue，它们可以在其中共享资源的配额。在这种情况下，命名空间 team-a 和 team-b 中的 LocalQueue 指向 .spec.clusterQueue 下的同一 ClusterQueue cluster-queue。

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a # LocalQueue under team-a namespace
  name: lq-team-a
spec:
  clusterQueue: cluster-queue # Point to the ClusterQueue
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-b # LocalQueue under team-b namespace
  name: lq-team-b
spec:
  clusterQueue: cluster-queue # Point to the ClusterQueue

每个团队都在其命名空间中将其工作负载发送到 LocalQueue。然后，ClusterQueue 会分配这些资源。

部署 LocalQueue：

kubectl apply -f local-queue.yaml

创建 Job 并观察允许的工作负载

在本部分中，您将在命名空间 team-a 中创建 Kubernetes Job。Kubernetes 中的 Job 控制器会创建一个或多个 Pod，并确保它们成功执行特定任务。

命名空间 team-a 中的作业具有以下属性：

它指向 lq-team-a LocalQueue。
它通过将 nodeSelector 字段设置为 nvidia-tesla-t4 来请求 GPU 资源。
它由三个 Pod 并行休眠 10 秒组成。根据 ttlSecondsAfterFinished 字段中定义的值，Job 会在 60 秒后清理。
它需要 1,500 milliCPU、1536 Mi 内存、1,536 Mi 临时存储空间和三个 GPU，因为有三个 Pod。

apiVersion: batch/v1
kind: Job
metadata:
  namespace: team-a # Job under team-a namespace
  generateName: sample-job-team-a-
  annotations:
    kueue.x-k8s.io/queue-name: lq-team-a # Point to the LocalQueue
spec:
  ttlSecondsAfterFinished: 60 # Job will be deleted after 60 seconds
  parallelism: 3 # This Job will have 3 replicas running at the same time
  completions: 3 # This Job requires 3 completions
  suspend: true # Set to true to allow Kueue to control the Job when it starts
  template:
    spec:
      nodeSelector:
        cloud.google.com/gke-accelerator: "nvidia-tesla-t4" # Specify the GPU hardware
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:latest
        args: ["10s"] # Sleep for 10 seconds
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
            ephemeral-storage: "512Mi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "500m"
            memory: "512Mi"
            ephemeral-storage: "512Mi"
            nvidia.com/gpu: "1"
      restartPolicy: Never

Job 还在文件 job-team-b.yaml（其命名空间属于 team-b）下创建，其中请求表示具有不同需求的不同团队。

如需了解详情，请参阅在 Autopilot 中部署 GPU 工作负载。

在新终端中，观察每两秒钟刷新一次的 ClusterQueue 的状态：
```
watch -n 2 kubectl get clusterqueue cluster-queue -o wide
```
在新终端中，观察节点的状态：
```
watch -n 2 kubectl get nodes -o wide
```
在新终端中，每 10 秒从命名空间 team-a 和 team-b 创建到 LocalQueue 的 Job：
```
./create_jobs.sh job-team-a.yaml job-team-b.yaml 10
```
观察正在排队的 Job、在 ClusterQueue 中允许的 Job，以及使用 GKE Autopilot 启动的节点。

注意：在节点纵向扩容时，出现 Pod 的警告（发出消息 Unschedulable）是正常的。

从命名空间 team-a 获取 Job：

kubectl -n team-a get jobs

结果类似于以下内容：

NAME                      COMPLETIONS   DURATION   AGE
sample-job-team-b-t6jnr   3/3           21s        3m27s
sample-job-team-a-tm7kc   0/3                      2m27s
sample-job-team-a-vjtnw   3/3           30s        3m50s
sample-job-team-b-vn6rp   0/3                      40s
sample-job-team-a-z86h2   0/3                      2m15s
sample-job-team-b-zfwj8   0/3                      28s
sample-job-team-a-zjkbj   0/3                      4s
sample-job-team-a-zzvjg   3/3           83s        4m50s

复制上一步中的 Job 名称，并通过 Workloads API 观察 Job 的准入状态和事件：
```
kubectl -n team-a describe workload JOB_NAME
```
当待处理 Job 开始从 ClusterQueue 增加时，请在正在运行的脚本上按 CTRL + C 结束脚本。
完成所有 Job 后，请注意节点是否缩减。

注意：纵向缩容流程最多可能需要两分钟。

清理

为避免因本教程中使用的资源导致您的 Google Cloud 账号产生费用，请删除包含这些资源的项目，或者保留项目但删除各个资源。

删除项目

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

逐个删除资源

删除 Kueue 配额系统：

kubectl delete -n team-a localqueue lq-team-a
kubectl delete -n team-b localqueue lq-team-b
kubectl delete clusterqueue cluster-queue
kubectl delete resourceflavor default-flavor

删除 Kueue 清单：

VERSION=VERSION
kubectl delete -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

删除集群：

gcloud container clusters delete kueue-autopilot --region=COMPUTE_REGION