Schedule GKE workloads with Topology Aware Scheduling

This page shows you how to schedule workloads on Hypercompute Cluster with GKE using Topology Aware Scheduling (TAS) and Kueue.

AI and ML workloads require significant Pod-to-Pod communication. Because of this requirement, network bandwidth between Pods directly impacts workload execution time and cost. This bandwidth depends on the placement of virtual machines (VMs) within the data center.

What is Topology Aware Scheduling (TAS)?

TAS can significantly improve the efficiency of large language model (LLM) training. TAS strategically places workers on the network topology to minimize communication overhead during gradient aggregation, which requires workers to communicate in a specific rank order. By minimizing network hops between sequentially communicating workers, TAS reduces network contention and optimizes bandwidth utilization, leading to faster convergence and shorter training times. With increasingly-large LLM models, TAS is essential for maximizing the performance and scalability of distributed training.

Before you begin

Before you start, make sure you have performed the following tasks:

  • Enable the Google Kubernetes Engine API.
  • Enable Google Kubernetes Engine API
  • If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.

Connect to your cluster to run kubectl commands

  1. Run the following command to connect to your cluster, replacing CLUSTER_NAME with the name of your cluster:

    gcloud container clusters get-credentials CLUSTER_NAME
    

Topology of GKE nodes on A3 Ultra VMs for Hypercompute Cluster

Hypercompute Clusters with GKE use A3 Ultra VMs. You can understand the physical topology of these GKE nodes by referring to the following node labels:

  • cloud.google.com/gce-topology-block: the organization-specific ID of the reserved block in which the VM is located. A block is a collection of sub-blocks that are connected by a layer of distributed network fabric.
  • cloud.google.com/gce-topology-subblock: the organization-specific ID of the sub-block in which the VM is located. A sub-block is a group of hosts and associated connectivity hardware. For A3 Ultra, these hosts are connected by a large-scale distributed network fabric called Jupiter, that offers low predictable latency and flat bandwidth across all the hosts.
  • cloud.google.com/gce-topology-host: the organization-specific ID of the host on which the VM is located. A host is a single physical server machine in the data center. Each GKE node is provisioned on a VM instance that is provisioned on top of a physical host.
  • kubernetes.io/hostname: the hostname of the Kubernetes node. This is typically also the GKE node name.

To learn more about the terms used with Hypercompute Cluster, see Terminology.

View the physical topology of nodes in your GKE cluster

Run the following command to get the node labels for your GKE cluster nodes in a specific node pool:

kubectl get nodes -l cloud.google.com/gke-nodepool=NODE_POOL_NAME \
    -ocustom-columns='NAME:.metadata.name,BLOCK:.metadata.labels.cloud\.google\.com/gce-topology-block,SUBBLOCK:.metadata.labels.cloud\.google\.com/gce-topology-subblock,HOST:.metadata.labels.cloud\.google\.com/gce-topology-host' | sort -k2,4

Replace NODE_POOL_NAME with the name of the node pool.

The output displays the block, sub-block, and host of each of the GKE nodes in the node pool.

You can use this topology information to optimize Pod placement for your AI workloads using TAS.

Prepare to schedule workloads with TAS using Kueue

We recommend using TAS with Kueue, a Kubernetes-native system that manages quotas and how jobs should consume them.

Install Kueue with TAS enabled

TAS requires Kueue 0.10.0 or later, and must be explicitly enabled.

You can install Kueue and enable TAS through a Kueue manifest, or Kueue's helm chart:

Kueue manifest

  1. Install Kueue:

    kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.0/manifests.yaml
    
  2. Enable TAS in Kueue:

    kubectl -n kueue-system patch deployment kueue-controller-manager \
      --type json \
      -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", \
      "value": "--feature-gates=TopologyAwareScheduling=true"}]'
    

Helm chart

  1. Install Kueue with TAS enabled using a Helm chart:

    helm install kueue oci://us-central1-docker.pkg.dev/k8s-staging-images/charts/kueue --version="v0.10.0" --create-namespace --namespace=kueue-system --set "controllerManager.featureGates[0].name=TopologyAwareScheduling,controllerManager.featureGates[0].enabled=true"
    

Configure Kueue

After installation, you must configure Kueue to understand the infrastructure that it's managing. Typically, Kueue requires a ClusterQueue resource quota definition of either static infrastructure, or dynamic infrastructure with cluster autoscaling enabled. The ClusterQueue will admit a Workload if and only if the resources requested by the workload are less than or equal to the pool of resources defined in the ClusterQueue. After this, Kueue admits workloads using TAS in the following way:

  • TAS workloads: Kueue checks both the topology of the physical infrastructure and its current usage.
  • Non-TAS workloads: Kueue doesn't check the topology of the physical infrastructure. Kueue manages the entire quota defined in the config and leaves node assignment to kube-scheduler.

Review the following examples to understand two ways that you can provide a ClusterQueue resource quota definition to Kueue:

  • Very high quota: Kueue practically never stops admission of a workload based on the requested resources. Based on the TAS definitions, Kueue may or may not admit workloads based on the infrastructure topology.
  • Realistic quota: Kueue will admit the Workload if and only if the resources requested by the Workload are within these resource quota limits. Based on the TAS definitions, Kueue will then check the infrastructure topology before admitting the Workload.

All references to resource quota in the following sections refer to ClusterQueue resource quota.

Very high resource quota

The following example uses a very high resource quota, such that Kueue practically never stops a workload based on the available resource quota, but rather uses the topology information of available nodes to try and match the topology with the requirements of the workload:

  apiVersion: kueue.x-k8s.io/v1alpha1
  kind: Topology
  metadata:
    name: "gke-default"
  spec:
    levels:
    - nodeLabel: "cloud.google.com/gce-topology-block"
    - nodeLabel: "cloud.google.com/gce-topology-subblock"
    - nodeLabel: "cloud.google.com/gce-topology-host"
    - nodeLabel: "kubernetes.io/hostname"
  ---
  kind: ResourceFlavor
  apiVersion: kueue.x-k8s.io/v1beta1
  metadata:
    name: "tas-flavor"
  spec:
    nodeLabels:
      cloud.google.com/gke-nodepool: "NODE_POOL_NAME"
    topologyName: "gke-default"
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: NoSchedule
  ---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    name: "tas-cluster-queue"
  spec:
    namespaceSelector: {}
    resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
      - name: "tas-flavor"
        resources:
        - name: "nvidia.com/gpu"
          nominalQuota: 10000000
  ---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: LocalQueue
  metadata:
    namespace: "default"
    name: "tas-user-queue"
  spec:
    clusterQueue: "tas-cluster-queue"

To use this resource quota definition, replace NODE_POOL_NAME, and save the YAML file as kueue-tas-config-very-high-quota.yaml.

Then, run the following command:

kubectl create -f kueue-tas-config-very-high-quota.yaml

Realistic resource quota

The previous example only configured GPU resources. However, Kueue can manage all Kubernetes-compatible resources.

The following example defines a more-realistic resource quota, including CPU, memory, and GPU. This is for 100 a3-ultragpu-8g machines. A single machine has 224 vCPUs, 2944 GB of memory, and 8 GPUs:

  apiVersion: kueue.x-k8s.io/v1alpha1
  kind: Topology
  metadata:
    name: "gke-default"
  spec:
    levels:
    - nodeLabel: "cloud.google.com/gce-topology-block"
    - nodeLabel: "cloud.google.com/gce-topology-subblock"
    - nodeLabel: "cloud.google.com/gce-topology-host"
    - nodeLabel: "kubernetes.io/hostname"
  ---
  kind: ResourceFlavor
  apiVersion: kueue.x-k8s.io/v1beta1
  metadata:
    name: "tas-flavor"
  spec:
    nodeLabels:
      cloud.google.com/gke-nodepool: "NODE_POOL_NAME" 
    topologyName: "gke-default"
    tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: NoSchedule
  ---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: ClusterQueue
  metadata:
    name: "tas-cluster-queue"
  spec:
    namespaceSelector: {} # match all
    resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
      - name: "tas-flavor"
        resources:
        # numbers below represent quota of 100 a3-ultragpu-8g machines
        - name: "cpu"
          nominalQuota: 22400
        - name: "memory"
          nominalQuota: 294400Gi
        - name: "nvidia.com/gpu"
          nominalQuota: 800
  ---
  apiVersion: kueue.x-k8s.io/v1beta1
  kind: LocalQueue
  metadata:
    namespace: "default"
    name: "tas-user-queue"
  spec:
    clusterQueue: "tas-cluster-queue"

To use this resource quota definition, replace NODE_POOL_NAME, and save the YAML file as kueue-tas-config-real-quota.yaml.

Then, run the following command:

kubectl create -f kueue-tas-config-real-quota.yaml

Verify successful application

If the application was successful, the output should look like the following:

    topology.kueue.x-k8s.io/gke-default created
    resourceflavor.kueue.x-k8s.io/tas-flavor created
    clusterqueue.kueue.x-k8s.io/tas-cluster-queue created
    localqueue.kueue.x-k8s.io/tas-user-queue created

Schedule workloads with TAS using Kueue

The following scenarios demonstrate how you can instruct Kueue and TAS to manage common workload and infrastructure combinations using topology request types and topology request levels:

  • The following are the available topology request types (preferred or required):

    • kueue.x-k8s.io/podset-preferred-topology: Kueue will prioritize scheduling the entire workload within a given topology level, but will still admit a workload that doesn't fit within this topology level. For a workload that might have fit in a single topology level, Kueue might schedule that workload across multiple instances of that topology level.
    • kueue.x-k8s.io/podset-required-topology: Kueue will continue trying to admit this workload until the entire workload can fit within the chosen topology level.
  • The following are the available topology request levels, letting you be more or less specific about the physical infrastructure where you prefer or require your Job to run:

    • cloud.google.com/gce-topology-block
    • cloud.google.com/gce-topology-subblock
    • cloud.google.com/gce-topology-host
    • kubernetes.io/hostname

To schedule workloads using these values, use the following Job YAML file:

apiVersion: batch/v1
kind: Job
metadata:
  generateName: JOB_NAME
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
spec:
  parallelism: NUMBER_OF_REPLICAS
  completions: NUMBER_OF_REPLICAS
  completionMode: Indexed
  template:
    metadata:
      annotations:
        ANNOTATIONS_STRING
    spec:
      containers:
      - name: dummy-job
        image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
        args: ["60s"]
        resources:
          requests:
            nvidia.com/gpu: "1"
          limits:
            nvidia.com/gpu: "1"
      restartPolicy: Never

Replace the following variables:

  • JOB_NAME: A name for the Job.
  • NUMBER_OF_REPLICAS: The number of Pods that are running in parallel.
  • ANNOTATIONS_STRING: See the following table:
Requested topology type and level Description ANNOTATIONS_STRING
Preferred to run within a hostname (recommended) This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will schedule your Pods as compactly as possible. kueue.x-k8s.io/podset-preferred-topology: "kubernetes.io/hostname"
Required to run within a host

This configuration will admit your workload if and only if there is a host available with enough resources to satisfy your workload's resource requirements.

This is useful when there are multiple VMs per host (for example, smaller machine types) or multiple Pods can run on a single node. In such cases, if the workload is admitted, it will run on a single host.

kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-host"
Preferred to run within a host This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a host and will use additional hosts if needed. kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-host"
Required to run within a sub-block This configuration will admit your workload if and only if there is a sub-block available with enough resources to satisfy your workload's resource requirements. kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-sub-block"
Preferred to run within a sub-block This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a sub-block and will use additional sub-blocks if needed. In this case, Kueue will rank higher a sub-block with more available capacity even if it is fragmented compared to a sub-block with just enough capacity to satisfy the requirements. kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-sub-block"
Required to run within a block This configuration will admit your workload if and only if the resources available within a block satisfy your workload's resource requirements. If admitted, Kueue will minimize the number of sub-blocks and hosts to schedule the workload. This might result in fragmentation of your available capacity. kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
Preferred to run within a block This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a block and will use additional blocks if needed. kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block"

Schedule workloads using PodGroup with TAS using Kueue

When using PodGroups, you must specify three additional fields for every Pod in a PodGroup:

Depending on the ML framework you use, a leader of a PodGroup can either require a GPU or not require a GPU. Because of a limitation of Kueue, these cases need to be handled differently. The following examples demonstrate how to create a PodGroup of three Pods with one leader and two workers.

Case 1: Leader is also a worker and requires a GPU

If the leader is one of the workers and also requires a GPU, then the leader can have any number within the PodGroup. For simplicity, in the following example the index of the leader is 0:

apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-leader-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "0"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  containers:
  - name: leader
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
  restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-1-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "1"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-2-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
    kueue.x-k8s.io/pod-group-pod-index: "2"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

Case 2: Leader is not a worker and doesn't require a GPU

If the leader isn't one of the workers because of the Kueue limitation, the leader must have the last index in the PodGroup, because of how Kueue creates PodSets. If the leader doesn't have the last index and the first worker doesn't use the first index, Kueue won't apply rank assignments.

See the following example:

---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-leader-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "2"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  containers:
  - name: leader
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        cpu: "1"
      limits:
        cpu: "1"
  restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-0-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "0"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
  generateName: tas-podgroup-worker-1-
  labels:
    kueue.x-k8s.io/queue-name: tas-user-queue
    kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
    kueue.x-k8s.io/pod-group-pod-index: "1"
  annotations:
    kueue.x-k8s.io/pod-group-total-count: "3"
    kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
  restartPolicy: Never
  containers:
  - name: worker
    image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
    args: ["600s"]
    resources:
      requests:
        nvidia.com/gpu: "1"
      limits:
        nvidia.com/gpu: "1"

What's next