このページは Cloud Translation API によって翻訳されました。

GKE の Namespace 間で割り当てを共有するジョブキューイングシステムを実装する

Standard

このチュートリアルでは、Kueue を使用して、ジョブキューイングシステムの実装、Google Kubernetes Engine（GKE）上の異なる Namespace 間のワークロードリソースと割り当て共有の構成、クラスタの使用率の最大化について説明します。

背景

インフラストラクチャエンジニアやクラスタ管理者としては、Namespace の間で使用率を最大化することが非常に重要です。ある Namespace 内のジョブのバッチが、その Namespace に割り当てられた割り当てをすべて使用できない場合があります。一方、別の Namespace には保留中のジョブが複数存在する場合があります。異なる Namespace のジョブ間でクラスタリソースを効率的に利用し、割り当て管理の柔軟性を高めるために、Kueue でコホートを構成できます。コホートとは、未使用の割り当てを互いに借用できる ClusterQueues のグループです。ClusterQueue は、CPU、メモリ、ハードウェアアクセラレータなどのリソースのプールを管理します。

このようなコンセプト全体の定義については、Kueue のドキュメントをご覧ください。

目標

このチュートリアルは、割り当ての共有で Kueue を使用して Kubernetes にジョブキューイングシステムを実装するインフラストラクチャエンジニアまたはクラスタ管理者を対象としています。

このチュートリアルでは、2 つの異なる Namespace に 2 つのチームを設定し、それぞれのチームが専用のリソースを持ちながら互いのリソースを利用できるようにします。3 つ目のリソースセットは、ジョブが蓄積したときにスピルオーバーとして使用します。

Prometheus オペレーターを使用して、異なる Namespace 内のジョブとリソース割り当てをモニタリングします。

このチュートリアルでは、次の必要な手順について説明します。

GKE クラスタを作成する
ResourceFlavors を作成する
各チームに ClusterQueue と LocalQueue を作成する
Job を作成し、許可されたワークロードを監視する
未使用の割り当てをコホートで借用する
Spot VM を管理するスピルオーバー ClusterQueue を追加する

料金

このチュートリアルでは、課金対象である次の Google Cloudコンポーネントを使用します。

料金計算ツールを使うと、予想使用量に基づいて費用の見積もりを出すことができます。

このチュートリアルを終了した後、作成したリソースを削除すると、それ以上の請求は発生しません。詳細については、クリーンアップをご覧ください。

始める前に

プロジェクトを設定する

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

In the Google Cloud console, on the project selector page, click Create project to begin creating a new Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the GKE API.

Enable the API

Google Cloud CLI のデフォルト値を設定する

Google Cloud コンソールで、Cloud Shell インスタンスを起動します。
Cloud Shell を開く
このサンプルアプリのソースコードをダウンロードします。
```
git clone https://github.com/GoogleCloudPlatform/kubernetes-engine-samples
```
デフォルトの環境変数を設定します。
```
gcloud config set project PROJECT_ID
gcloud config set compute/region COMPUTE_REGION
```
次の値を置き換えます。
- PROJECT_ID: Google Cloud プロジェクト ID。
- COMPUTE_REGION: Compute Engine のリージョン。

GKE クラスタを作成する

kueue-cohort という名前の GKE クラスタを作成します。

デフォルトプールに 6 つのノードがあり（ゾーンごとに 2 つ）、自動スケーリングされないクラスタを作成します。これは、チームで使用できる最初のリソースになるため、競合状態が発生します。

両方のチームがそれぞれのキューに送信するワークロードを、Kueue がどのように管理しているかは、後ほど説明します。
```
  gcloud container clusters create kueue-cohort --region COMPUTE_REGION \
  --release-channel rapid --machine-type e2-standard-4 --num-nodes 2
```
注: このステップの完了には、最大 5 分かかることがあります。
クラスタが作成されると、結果は次のようになります。
```
  kubeconfig entry generated for kueue-cohort.
  NAME: kueue-cohort
  LOCATION: us-central1
  MASTER_VERSION: 1.26.2-gke.1000
  MASTER_IP: 35.224.108.58
  MACHINE_TYPE: e2-medium
  NODE_VERSION: 1.26.2-gke.1000
  NUM_NODES: 6
  STATUS: RUNNING
```
ここで、kueue-cluster の STATUS は RUNNING です。
spot という名前のノードプールを作成します。

このノードプールは Spot VM を使用し、自動スケーリングが有効になっています。開始時は 0 ノードですが、後であふれた場合の容量としてチームで使用できるようにします。
```
gcloud container node-pools create spot --cluster=kueue-cohort --region COMPUTE_REGION  \
--spot --enable-autoscaling --max-nodes 20 --num-nodes 0 \
--machine-type e2-standard-4
```
Kueue のリリースバージョンをクラスタにインストールします。
```
VERSION=VERSION
kubectl apply -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
```
VERSION は、文字 v の後に Kueue の最新バージョンを付けたものに置き換えます（例: v0.4.0）。Kueue バージョンの詳細については、Kueue リリースをご覧ください。

注: リポジトリのサンプルでは、v0.3.0 以降が必要です。
Kueue コントローラの準備が整うまで待ちます。
```
watch kubectl -n kueue-system get pods
```
出力が次のようになると続行できます。
```
NAME                                        READY   STATUS    RESTARTS   AGE
kueue-controller-manager-6cfcbb5dc5-rsf8k   2/2     Running   0          3m
```
team-a と team-b という 2 つの新しい Namespace を作成します。
```
kubectl create namespace team-a
kubectl create namespace team-b
```
ジョブは Namespace ごとに生成されます。

ResourceFlavors を作成する

ResourceFlavor は、各種 VM（Spot とオンデマンドなど）、アーキテクチャ（x86 CPU と ARM CPU など）、ブランドとモデル（Nvidia A100 GPU と T4 GPU など）などのクラスタノードのリソースバリエーションを表します。

ResourceFlavor は、ノードラベルと taint を使用して、クラスタ内のノードのセットを照合します。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: on-demand # This ResourceFlavor will be used for the CPU resource
spec:
  nodeLabels:
    cloud.google.com/gke-provisioning: standard # This label was applied automatically by GKE
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: spot # This ResourceFlavor will be used as added resource for the CPU resource
spec:
  nodeLabels:  
    cloud.google.com/gke-provisioning: spot # This label was applied automatically by GKE

このマニフェストの内容:

ResourceFlavor on-demand のラベルが cloud.google.com/gke-provisioning: standard に設定されています。
ResourceFlavor spot のラベルが cloud.google.com/gke-provisioning: spot に設定されています。

ワークロードに ResourceFlavor が割り当てられると、Kueue は ResourceFlavor に定義されたノードラベルに一致するノードにワークロードの Pod を割り当てます。

ResourceFlavor をデプロイします。

kubectl apply -f flavors.yaml

ClusterQueue と LocalQueue を作成する

2 つの ClusterQueue（cq-team-a と cq-team-b）とそれに対応する LocalQueues（lq-team-a と lq-team-b）には、それぞれ team-a と team-b という Namespace が設定されます。

ClusterQueues は、CPU、メモリ、ハードウェアアクセラレータなどのリソースのプールを管理するクラスタスコープオブジェクトです。Batch 管理者は、これらのオブジェクトをバッチユーザーにのみ表示するように制限できます。

LocalQueue は、バッチユーザーが一覧表示できる Namespace 付きのオブジェクトです。LocalQueues は CluterQueues を指し、CluterQueues で LocalQueue のワークロードを実行するためのリソースが割り当てられます。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cq-team-a
spec:
  cohort: all-teams # cq-team-a and cq-team-b share the same cohort
  namespaceSelector:
    matchLabels:
      kubernetes.io/metadata.name: team-a #Only team-a can submit jobs direclty to this queue, but will be able to share it through the cohort
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: on-demand
      resources:
      - name: "cpu"
        nominalQuota: 10
        borrowingLimit: 5
      - name: "memory"
        nominalQuota: 10Gi
        borrowingLimit: 15Gi
    - name: spot # This ClusterQueue doesn't have nominalQuota for spot, but it can borrow from others
      resources:
      - name: "cpu"
        nominalQuota: 0
      - name: "memory"
        nominalQuota: 0
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a # LocalQueue under team-a namespace
  name: lq-team-a
spec:
  clusterQueue: cq-team-a # Point to the ClusterQueue team-a-cq

ClusterQueue を使用すると、リソースに複数のフレーバーを持つことができます。この場合、どちらの ClusterQueue にも on-demand と spot の 2 つのフレーバーがあり、それぞれが cpu リソースを提供しています。ResourceFlavor spot の割り当ては 0 に設定されており、ここでは使用されません。

どちらの ClusterQueue も、.spec.cohort で定義された all-teams という同じコホートを共有します。複数の ClusterQueue が同じコホートを共有する場合、互いに未使用の割り当てを借用できます。

コホートの仕組みと借用の意味については、Kueue のドキュメントをご覧ください。

ClusterQueues と LocalQueues をデプロイします。

kubectl apply -f cq-team-a.yaml
kubectl apply -f cq-team-b.yaml

（省略可）kube-prometheus を使用してワークロードをモニタリングする

Prometheus を使用して、アクティブな Kueue ワークロードと保留中の Kueue ワークロードをモニタリングできます。起動中のワークロードをモニタリングして各 ClusterQueue の負荷を監視するには、Namespace monitoring の下にあるクラスタに kube-prometheus をデプロイします。

Prometheus オペレーターのソースコードをダウンロードします。
```
cd
git clone https://github.com/prometheus-operator/kube-prometheus.git
```
CustomResourceDefinitions（CRD）を作成します。
```
kubectl create -f kube-prometheus/manifests/setup
```
モニタリングコンポーネントを作成します。
```
kubectl create -f kube-prometheus/manifests
```
prometheus-operator が Kueue コンポーネントから指標をスクレイピングできるようにします。
```
kubectl apply -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/prometheus.yaml
```

作業ディレクトリを変更します。

cd kubernetes-engine-samples/batch/kueue-cohort

GKE クラスタで実行されている Prometheus サービスへのポート転送を設定します。
```
kubectl --namespace monitoring port-forward svc/prometheus-k8s 9090
```
ブラウザで localhost:9090 で Prometheus ウェブ UI を開きます。

Cloud Shell で次のようにします。
1. [ウェブでプレビュー] をクリックします。
2. [ポートを変更] をクリックし、ポート番号を 9090 に設定します。
3. [変更してプレビュー] をクリックします。
次の Prometheus ウェブ UI が表示されます。
[式] クエリボックスに次のクエリを入力して、cq-team-a ClusterQueue のアクティブなワークロードをモニタリングする最初のパネルを作成します。
```
kueue_pending_workloads{cluster_queue="cq-team-a", status="active"} or kueue_admitted_active_workloads{cluster_queue="cq-team-a"}
```
[パネルを追加] をクリックします。
[式] クエリボックスに次のクエリを入力して、cq-team-b ClusterQueue のアクティブなワークロードをモニタリングする別のパネルを作成します。
```
kueue_pending_workloads{cluster_queue="cq-team-b", status="active"} or kueue_admitted_active_workloads{cluster_queue="cq-team-b"}
```
[パネルを追加] をクリックします。
[式] クエリボックスに次のクエリを入力して、クラスタ内のノード数をモニタリングするパネルを作成します。
```
count(kube_node_info)
```

（省略可）Google Cloud Managed Service for Prometheus を使用してワークロードをモニタリングする

Google Cloud Managed Service for Prometheus を使用して、アクティブな Kueue ワークロードと保留中の Kueue ワークロードをモニタリングできます。指標の完全なリストについては、Kueue のドキュメントをご覧ください。

指標へのアクセス用に Identity と RBAC を設定します。

次の構成では、Google Cloud Managed Service for Prometheus コレクタの指標アクセスを提供する 4 つの Kubernetes リソースが作成されます。
- kueue-system Namespace 内の kueue-metrics-reader という名前の ServiceAccount が、キューの指標にアクセスする際の認証に使用されます。
- kueue-metrics-reader サービスアカウントに関連付けられたシークレットには、コレクタで使用される認証トークンが保存されます。このトークンは、Kueue デプロイによって公開される指標エンドポイントで認証に使用されます。
- kueue-system Namespace 内の kueue-secret-reader という名前のロール。サービスアカウントトークンを含むシークレットの読み取りを許可します。
- kueue-metrics-reader サービスアカウントに kueue-metrics-reader ClusterRole を付与する ClusterRoleBinding。
```
apiVersion: v1
kind: ServiceAccount
metadata:
 name: kueue-metrics-reader
 namespace: kueue-system
---
apiVersion: v1
kind: Secret
metadata:
 name: kueue-metrics-reader-token
 namespace: kueue-system
 annotations:
   kubernetes.io/service-account.name: kueue-metrics-reader
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
 name: kueue-secret-reader
 namespace: kueue-system
rules:
-   resources:
 -   secrets
 apiGroups: [""]
 verbs: ["get", "list", "watch"]
 resourceNames: ["kueue-metrics-reader-token"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
 name: kueue-metrics-reader
subjects:
-   kind: ServiceAccount
 name: kueue-metrics-reader
 namespace: kueue-system
roleRef:
 kind: ClusterRole
 name: kueue-metrics-reader
 apiGroup: rbac.authorization.k8s.io
```

Google Cloud Managed Service for Prometheus の RoleBinding を構成します。

Autopilot クラスタと Standard クラスタのどちらを使用するかによって、RoleBinding を gke-gmp-system Namespace または gmp-system Namespace のいずれかに作成する必要があります。このリソースにより、コレクタサービスアカウントは kueue-metrics-reader-token シークレットへのアクセスが可能になり、Kueue 指標の認証とスクレイピングを行うことができます。

Autopilot

  apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: gmp-system:collector:kueue-secret-reader
    namespace: kueue-system
  roleRef:
    name: kueue-secret-reader
    kind: Role
    apiGroup: rbac.authorization.k8s.io
  subjects:
  -   name: collector
    namespace: gke-gmp-system
    kind: ServiceAccount

Standard

  apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: gmp-system:collector:kueue-secret-reader
    namespace: kueue-system
  roleRef:
    name: kueue-secret-reader
    kind: Role
    apiGroup: rbac.authorization.k8s.io
  subjects:
  -   name: collector
    namespace: gmp-system
    kind: ServiceAccount

Pod Monitoring リソースを構成します。

次のリソースは、Kueue のデプロイのモニタリングを構成し、指標が HTTPS 経由で /metrics パスで公開されることを指定します。指標のスクレイピング時に認証に kueue-metrics-reader-token シークレットを使用します。

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: kueue
namespace: kueue-system
spec:
selector:
 matchLabels:
   control-plane: controller-manager
endpoints:
-   port: https
 interval: 30s
 path: /metrics
 scheme: https
 tls:
   insecureSkipVerify: true
 authorization:
   type: Bearer
   credentials:
     secret:
       name: kueue-metrics-reader-token
       key: token

エクスポートした指標をクエリする

Kueue ベースのシステムをモニタリングするための PromQL クエリの例

これらの PromQL クエリを使用すると、ジョブのスループット、キューごとのリソース使用率、ワークロードの待ち時間などの主要なキュー指標をモニタリングして、システムのパフォーマンスを把握し、潜在的なボトルネックを特定できます。

ジョブのスループット

これにより、各 cluster_queue について、5 分間のワークロードの許可レートが 1 秒あたりで計算されます。この指標は、キューごとに分類してボトルネックを特定し、合計することでシステム全体のスループットを把握するのに役立ちます。

クエリ:

sum(rate(kueue_admitted_workloads_total[5m])) by (cluster_queue)

リソース使用率

これは、metrics.enableClusterQueueResources が有効になっていることを前提としています。各キューの現在の CPU 使用量と公称 CPU 割り当ての比率を計算します。値が 1 に近いほど、使用率が高いことを示します。リソースラベルを変更することで、メモリやその他のリソースに合わせて調整できます。

カスタム構成されたリリースバージョンの Kueue をクラスタにインストールするには、Kueue のドキュメントをご覧ください。

クエリ:

sum(kueue_cluster_queue_resource_usage{resource="cpu"}) by (cluster_queue) / sum(kueue_cluster_queue_nominal_quota{resource="cpu"}) by (cluster_queue)

キューの待ち時間

これにより、特定のキュー内のワークロードの 90 パーセンタイル待ち時間が提供されます。分位値を変更して（中央値の場合は 0.5、99 パーセンタイルの場合は 0.99 など）、待ち時間の分布を把握できます。

クエリ:

histogram_quantile(0.9, kueue_admission_wait_time_seconds_bucket{cluster_queue="QUEUE_NAME"})

Job を作成し、許可されたワークロードを監視する

このセクションでは、Namespace team-a と team-b に Kubernetes Job を作成します。Kubernetes の Job コントローラは、1 つ以上の Pod を作成し、特定のタスクが正常に実行されるようにします。

10 秒間スリープする両方の ClusterQueue に 3 つの並列 Job を生成し、3 つ Job が完了すると終了します。60 秒後にクリーンアップされます。

apiVersion: batch/v1
kind: Job
metadata:
  namespace: team-a # Job under team-a namespace
  generateName: sample-job-team-a-
  labels:
    kueue.x-k8s.io/queue-name: lq-team-a # Point to the LocalQueue
spec:
  ttlSecondsAfterFinished: 60 # Job will be deleted after 60 seconds
  parallelism: 3 # This Job will have 3 replicas running at the same time
  completions: 3 # This Job requires 3 completions
  suspend: true # Set to true to allow Kueue to control the Job when it starts
  template:
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:latest
        args: ["10s"] # Sleep for 10 seconds
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
      restartPolicy: Never

job-team-a.yaml は team-a Namespace に Job を作成し、LocalQueue lq-team-a と ClusterQueue cq-team-a を指します。

同様に、job-team-b.yaml は team-b Namespace に Job を作成し、LocalQueue lq-team-b と ClusterQueue cq-team-b を指します。

新しいターミナルを開始し、このスクリプトを実行すると 1 秒ごとに Job が生成されます。
```
./create_jobs.sh job-team-a.yaml 1
```
別のターミナルを開始して、team-b Namespace の Job を作成します。
```
./create_jobs.sh job-team-b.yaml 1
```
Job が Prometheus のキューに格納されていることを確認します。次のコマンドを使用します。
```
watch -n 2 kubectl get clusterqueues -o wide
```

出力例を以下に示します。

    NAME        COHORT      STRATEGY         PENDING WORKLOADS   ADMITTED WORKLOADS
    cq-team-a   all-teams   BestEffortFIFO   0                   5
    cq-team-b   all-teams   BestEffortFIFO   0                   4

未使用の割り当てをコホートで借用する

ClusterQueue が最大容量に達するとは限りません。ワークロードが ClusterQueue 間で均等に分散されていなければ、割り当ての使用量は最大化されません。ClusterQueue が相互に同じコホートを共有していれば、ClusterQueue は割り当ての使用率を最大化するために、他の ClusterQueue から割り当てを借用できます。

Job が、ClusterQueue の cq-team-a と cq-team-b の両方のキューに入れられたら、対応するターミナルで CTRL+c を押して team-b Namespace のスクリプトを停止します。
Namespace team-b の保留中の Job がすべて処理されると、Namespace team-a の Job が cq-team-b 内の使用可能なリソースを借用できます。
```
kubectl describe clusterqueue cq-team-a
```
cq-team-a と cq-team-b は all-teams という同じコホートを共有しているため、これらの ClusterQueues は使用されていないリソースを共有できます。
```
  Flavors Usage:
    Name:  on-demand
    Resources:
      Borrowed:  5
      Name:      cpu
      Total:     15
      Borrowed:  5Gi
      Name:      memory
      Total:     15Gi
```

team-b Namespace のスクリプトを再開します。

./create_jobs.sh job-team-b.yaml 3

cq-team-a から借用したリソースが 0 に戻り、cq-team-b のリソースが独自のワークロードに使用されます。

kubectl describe clusterqueue cq-team-a

  Flavors Usage:
    Name:  on-demand
    Resources:
      Borrowed:  0
      Name:      cpu
      Total:     9
      Borrowed:  0
      Name:      memory
      Total:     9Gi

Spot VM で割り当てを増やす

保留中のワークロードの高い需要を満たす場合など、割り当てを一時的に増やす必要がある場合は、コホートに ClusterQueue を追加して需要に対応するように Kueue を構成できます。未使用のリソースを含む ClusterQueue は、それらのリソースを同じコホートに属する別の ClusterQueue と共有できます。

チュートリアルの最初で、Spot VM と spot という名前の ResourceFlavor を使用して spot という名前のノードプールを作成し、ラベルを cloud.google.com/gke-provisioning: spot に設定しました。このノードプールとそれを表す ResourceFlavor を使用するために、ClusterQueue を作成します。

コホートを all-teams に設定して、cq-spot という新しい ClusterQueue を作成します。

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: spot-cq
spec:
  cohort: all-teams # Same cohort as cq-team-a and cq-team-b
  resourceGroups:
  - coveredResources: ["cpu", "memory"]
    flavors:
    - name: spot
      resources:
      - name: "cpu"
        nominalQuota: 40
      - name: "memory"
        nominalQuota: 144Gi

この ClusterQueue は cq-team-a と cq-team-b で同じコホートを共有するため、ClusterQueue の cq-team-a と cq-team-b はどちらも最大 15 個の CPU リクエストと 15 Gi のメモリのリソースを借用できます。

kubectl apply -f cq-spot.yaml

Prometheus で、同じコホートを共有する cq-spot が割り当てが追加されたので、cq-team-a と cq-team-b の両方で許可されたワークロードの急増状況を確認します。次のコマンドを使用します。
```
watch -n 2 kubectl get clusterqueues -o wide
```
Prometheus で、クラスタ内のノードの数を確認します。次のコマンドを使用します。
```
watch -n 2 kubectl get nodes -o wide
```
両方のスクリプトを停止するには、team-a と team-b の各 Namespace で CTRL+c を押します。

クリーンアップ

このチュートリアルで使用したリソースについて、Google Cloud アカウントに課金されないようにするには、リソースを含むプロジェクトを削除するか、プロジェクトを維持して個々のリソースを削除します。

プロジェクトを削除する

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

個々のリソースを削除する

Kueue 割り当てシステムを削除します。

kubectl delete -n team-a localqueue lq-team-a
kubectl delete -n team-b localqueue lq-team-b
kubectl delete clusterqueue cq-team-a
kubectl delete clusterqueue cq-team-b
kubectl delete clusterqueue cq-spot
kubectl delete resourceflavor default
kubectl delete resourceflavor on-demand
kubectl delete resourceflavor spot

Kueue マニフェストを削除します。

VERSION=VERSION
kubectl delete -f \
  https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

クラスタを削除します。

gcloud container clusters delete kueue-cohort --region=COMPUTE_REGION

GKE の Namespace 間で割り当てを共有するジョブ キューイング システムを実装する

背景

目標

料金

始める前に

プロジェクトを設定する

Google Cloud CLI のデフォルト値を設定する

GKE クラスタを作成する

ResourceFlavors を作成する

ClusterQueue と LocalQueue を作成する

（省略可）kube-prometheus を使用してワークロードをモニタリングする

（省略可）Google Cloud Managed Service for Prometheus を使用してワークロードをモニタリングする

Autopilot

Standard

エクスポートした指標をクエリする

Kueue ベースのシステムをモニタリングするための PromQL クエリの例

ジョブのスループット

リソース使用率

キューの待ち時間

Job を作成し、許可されたワークロードを監視する

未使用の割り当てをコホートで借用する

Spot VM で割り当てを増やす

クリーンアップ

プロジェクトを削除する

個々のリソースを削除する

次のステップ

GKE の Namespace 間で割り当てを共有するジョブキューイングシステムを実装する