This page shows you how to schedule workloads on Hypercompute Cluster with GKE (Preview) using Topology Aware Scheduling (TAS) and Kueue.
AI and ML workloads require significant Pod-to-Pod communication. Because of this requirement, network bandwidth between Pods directly impacts workload execution time and cost. This bandwidth depends on the placement of virtual machines (VMs) within the data center.
What is Topology Aware Scheduling (TAS)?
TAS can significantly improve the efficiency of large language model (LLM) training. TAS strategically places workers on the network topology to minimize communication overhead during gradient aggregation, which requires workers to communicate in a specific rank order. By minimizing network hops between sequentially communicating workers, TAS reduces network contention and optimizes bandwidth utilization, leading to faster convergence and shorter training times. With increasingly-large LLM models, TAS is essential for maximizing the performance and scalability of distributed training.
Before you begin
Before you start, make sure you have performed the following tasks:
- Enable the Google Kubernetes Engine API. Enable Google Kubernetes Engine API
- If you want to use the Google Cloud CLI for this task,
install and then
initialize the
gcloud CLI. If you previously installed the gcloud CLI, get the latest
version by running
gcloud components update
.
Connect to your cluster to run kubectl commands
Run the following command to connect to your cluster, replacing
CLUSTER_NAME
with the name of your cluster:gcloud container clusters get-credentials CLUSTER_NAME
Topology of GKE nodes on A3 Ultra VMs for Hypercompute Cluster
Hypercompute Clusters with GKE use A3 Ultra VMs. You can understand the physical topology of these GKE nodes by referring to the following node labels:
cloud.google.com/gce-topology-block
: the organization-specific ID of the reserved block in which the VM is located. A block is a collection of sub-blocks that are connected by a layer of distributed network fabric.cloud.google.com/gce-topology-subblock
: the organization-specific ID of the sub-block in which the VM is located. A sub-block is a group of hosts and associated connectivity hardware. For A3 Ultra, these hosts are connected by a large-scale distributed network fabric called Jupiter, that offers low predictable latency and flat bandwidth across all the hosts.cloud.google.com/gce-topology-host
: the organization-specific ID of the host on which the VM is located. A host is a single physical server machine in the data center. Each GKE node is provisioned on a VM instance that is provisioned on top of a physical host.kubernetes.io/hostname
: the hostname of the Kubernetes node. This is typically also the GKE node name.
To learn more about the terms used with Hypercompute Cluster, see Terminology.
View the physical topology of nodes in your GKE cluster
Run the following command to get the node labels for your GKE cluster nodes in a specific node pool:
kubectl get nodes -l cloud.google.com/gke-nodepool=NODE_POOL_NAME \
-ocustom-columns='NAME:.metadata.name,BLOCK:.metadata.labels.cloud\.google\.com/gce-topology-block,SUBBLOCK:.metadata.labels.cloud\.google\.com/gce-topology-subblock,HOST:.metadata.labels.cloud\.google\.com/gce-topology-host' | sort -k2,4
Replace NODE_POOL_NAME
with the name of the node pool.
The output displays the block, sub-block, and host of each of the GKE nodes in the node pool.
You can use this topology information to optimize Pod placement for your AI workloads using TAS.
Prepare to schedule workloads with TAS using Kueue
We recommend using TAS with Kueue, a Kubernetes-native system that manages quotas and how jobs should consume them.
Install Kueue with TAS enabled
TAS requires Kueue 0.10.0 or later, and must be explicitly enabled.
You can install Kueue and enable TAS through a Kueue manifest, or Kueue's helm chart:
Kueue manifest
Install Kueue:
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.0/manifests.yaml
Enable TAS in Kueue:
kubectl -n kueue-system patch deployment kueue-controller-manager \ --type json \ -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", \ "value": "--feature-gates=TopologyAwareScheduling=true"}]'
Helm chart
Install Kueue with TAS enabled using a Helm chart:
helm install kueue oci://us-central1-docker.pkg.dev/k8s-staging-images/charts/kueue --version="v0.10.0" --create-namespace --namespace=kueue-system --set "controllerManager.featureGates[0].name=TopologyAwareScheduling,controllerManager.featureGates[0].enabled=true"
Configure Kueue
After installation, you must configure Kueue to understand the infrastructure that it's managing. Typically, Kueue requires a ClusterQueue resource quota definition of either static infrastructure, or dynamic infrastructure with cluster autoscaling enabled. The ClusterQueue will admit a Workload if and only if the resources requested by the workload are less than or equal to the pool of resources defined in the ClusterQueue. After this, Kueue admits workloads using TAS in the following way:
- TAS workloads: Kueue checks both the topology of the physical infrastructure and its current usage.
- Non-TAS workloads: Kueue doesn't check the topology of the physical infrastructure. Kueue manages the entire quota defined in the config and leaves node assignment to kube-scheduler.
Review the following examples to understand two ways that you can provide a ClusterQueue resource quota definition to Kueue:
- Very high quota: Kueue practically never stops admission of a workload based on the requested resources. Based on the TAS definitions, Kueue may or may not admit workloads based on the infrastructure topology.
- Realistic quota: Kueue will admit the Workload if and only if the resources requested by the Workload are within these resource quota limits. Based on the TAS definitions, Kueue will then check the infrastructure topology before admitting the Workload.
All references to resource quota in the following sections refer to ClusterQueue resource quota.
Very high resource quota
The following example uses a very high resource quota, such that Kueue practically never stops a workload based on the available resource quota, but rather uses the topology information of available nodes to try and match the topology with the requirements of the workload:
apiVersion: kueue.x-k8s.io/v1alpha1
kind: Topology
metadata:
name: "gke-default"
spec:
levels:
- nodeLabel: "cloud.google.com/gce-topology-block"
- nodeLabel: "cloud.google.com/gce-topology-subblock"
- nodeLabel: "cloud.google.com/gce-topology-host"
- nodeLabel: "kubernetes.io/hostname"
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta1
metadata:
name: "tas-flavor"
spec:
nodeLabels:
cloud.google.com/gke-nodepool: "NODE_POOL_NAME"
topologyName: "gke-default"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "tas-cluster-queue"
spec:
namespaceSelector: {}
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: "tas-flavor"
resources:
- name: "nvidia.com/gpu"
nominalQuota: 10000000
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "tas-user-queue"
spec:
clusterQueue: "tas-cluster-queue"
To use this resource quota definition, replace
NODE_POOL_NAME
, and save the YAML file as
kueue-tas-config-very-high-quota.yaml
.
Then, run the following command:
kubectl create -f kueue-tas-config-very-high-quota.yaml
Realistic resource quota
The previous example only configured GPU resources. However, Kueue can manage all Kubernetes-compatible resources.
The following example defines a more-realistic resource quota, including CPU,
memory, and GPU. This is for 100 a3-ultragpu-8g
machines. A single machine has
224 vCPUs, 2944 GB of memory, and 8 GPUs:
apiVersion: kueue.x-k8s.io/v1alpha1
kind: Topology
metadata:
name: "gke-default"
spec:
levels:
- nodeLabel: "cloud.google.com/gce-topology-block"
- nodeLabel: "cloud.google.com/gce-topology-subblock"
- nodeLabel: "cloud.google.com/gce-topology-host"
- nodeLabel: "kubernetes.io/hostname"
---
kind: ResourceFlavor
apiVersion: kueue.x-k8s.io/v1beta1
metadata:
name: "tas-flavor"
spec:
nodeLabels:
cloud.google.com/gke-nodepool: "NODE_POOL_NAME"
topologyName: "gke-default"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: NoSchedule
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "tas-cluster-queue"
spec:
namespaceSelector: {} # match all
resourceGroups:
- coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
flavors:
- name: "tas-flavor"
resources:
# numbers below represent quota of 100 a3-ultragpu-8g machines
- name: "cpu"
nominalQuota: 22400
- name: "memory"
nominalQuota: 294400Gi
- name: "nvidia.com/gpu"
nominalQuota: 800
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: "default"
name: "tas-user-queue"
spec:
clusterQueue: "tas-cluster-queue"
To use this resource quota definition, replace
NODE_POOL_NAME
, and save the YAML file as
kueue-tas-config-real-quota.yaml
.
Then, run the following command:
kubectl create -f kueue-tas-config-real-quota.yaml
Verify successful application
If the application was successful, the output should look like the following:
topology.kueue.x-k8s.io/gke-default created
resourceflavor.kueue.x-k8s.io/tas-flavor created
clusterqueue.kueue.x-k8s.io/tas-cluster-queue created
localqueue.kueue.x-k8s.io/tas-user-queue created
Schedule workloads with TAS using Kueue
The following scenarios demonstrate how you can instruct Kueue and TAS to manage common workload and infrastructure combinations using topology request types and topology request levels:
The following are the available topology request types (preferred or required):
kueue.x-k8s.io/podset-preferred-topology
: Kueue will prioritize scheduling the entire workload within a given topology level, but will still admit a workload that doesn't fit within this topology level. For a workload that might have fit in a single topology level, Kueue might schedule that workload across multiple instances of that topology level.kueue.x-k8s.io/podset-required-topology
: Kueue will continue trying to admit this workload until the entire workload can fit within the chosen topology level.
The following are the available topology request levels, letting you be more or less specific about the physical infrastructure where you prefer or require your Job to run:
cloud.google.com/gce-topology-block
cloud.google.com/gce-topology-subblock
cloud.google.com/gce-topology-host
kubernetes.io/hostname
To schedule workloads using these values, use the following Job YAML file:
apiVersion: batch/v1
kind: Job
metadata:
generateName: JOB_NAME
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
spec:
parallelism: NUMBER_OF_REPLICAS
completions: NUMBER_OF_REPLICAS
completionMode: Indexed
template:
metadata:
annotations:
ANNOTATIONS_STRING
spec:
containers:
- name: dummy-job
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["60s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
restartPolicy: Never
Replace the following variables:
JOB_NAME
: A name for the Job.NUMBER_OF_REPLICAS
: The number of Pods that are running in parallel.ANNOTATIONS_STRING
: See the following table:
Requested topology type and level | Description | ANNOTATIONS_STRING |
---|---|---|
Preferred to run within a hostname (recommended) | This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will schedule your Pods as compactly as possible. | kueue.x-k8s.io/podset-preferred-topology: "kubernetes.io/hostname" |
Required to run within a host |
This configuration will admit your workload if and only if there is a host available with enough resources to satisfy your workload's resource requirements. This is useful when there are multiple VMs per host (for example, smaller machine types) or multiple Pods can run on a single node. In such cases, if the workload is admitted, it will run on a single host. |
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-host" |
Preferred to run within a host | This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a host and will use additional hosts if needed. | kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-host" |
Required to run within a sub-block | This configuration will admit your workload if and only if there is a sub-block available with enough resources to satisfy your workload's resource requirements. | kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-sub-block" |
Preferred to run within a sub-block | This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a sub-block and will use additional sub-blocks if needed. In this case, Kueue will rank higher a sub-block with more available capacity even if it is fragmented compared to a sub-block with just enough capacity to satisfy the requirements. | kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-sub-block" |
Required to run within a block | This configuration will admit your workload if and only if the resources available within a block satisfy your workload's resource requirements. If admitted, Kueue will minimize the number of sub-blocks and hosts to schedule the workload. This might result in fragmentation of your available capacity. | kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block" |
Preferred to run within a block | This configuration will admit your workload as long as there are enough resources available to satisfy your workload's resource requirements, even if the capacity is fragmented. Kueue will try to schedule your Pods within a block and will use additional blocks if needed. | kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" |
Schedule workloads using PodGroup with TAS using Kueue
When using PodGroups, you must specify three additional fields for every Pod in a PodGroup:
- Labels:
- kueue.x-k8s.io/pod-group-name: the name of a PodGroup used for aggregation.
- kueue.x-k8s.io/pod-group-pod-index: the index of each individual Pod within the PodGroup.
- Annotations:
- kueue.x-k8s.io/pod-group-total-count: the total count of Pods within a PodGroup.
Depending on the ML framework you use, a leader of a PodGroup can either require a GPU or not require a GPU. Because of a limitation of Kueue, these cases need to be handled differently. The following examples demonstrate how to create a PodGroup of three Pods with one leader and two workers.
Case 1: Leader is also a worker and requires a GPU
If the leader is one of the workers and also requires a GPU, then the leader can have any number within the PodGroup. For simplicity, in the following example the index of the leader is 0:
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-leader-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
kueue.x-k8s.io/pod-group-pod-index: "0"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
containers:
- name: leader
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-worker-1-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
kueue.x-k8s.io/pod-group-pod-index: "1"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
restartPolicy: Never
containers:
- name: worker
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-worker-2-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group"
kueue.x-k8s.io/pod-group-pod-index: "2"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
restartPolicy: Never
containers:
- name: worker
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Case 2: Leader is not a worker and doesn't require a GPU
If the leader isn't one of the workers because of the Kueue limitation, the leader must have the last index in the PodGroup, because of how Kueue creates PodSets. If the leader doesn't have the last index and the first worker doesn't use the first index, Kueue won't apply rank assignments.
See the following example:
---
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-leader-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
kueue.x-k8s.io/pod-group-pod-index: "2"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
containers:
- name: leader
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
cpu: "1"
limits:
cpu: "1"
restartPolicy: Never
---
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-worker-0-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
kueue.x-k8s.io/pod-group-pod-index: "0"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
restartPolicy: Never
containers:
- name: worker
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
---
apiVersion: v1
kind: Pod
metadata:
generateName: tas-podgroup-worker-1-
labels:
kueue.x-k8s.io/queue-name: tas-user-queue
kueue.x-k8s.io/pod-group-name: "tas-podgroup-example-group2"
kueue.x-k8s.io/pod-group-pod-index: "1"
annotations:
kueue.x-k8s.io/pod-group-total-count: "3"
kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-block"
spec:
restartPolicy: Never
containers:
- name: worker
image: gcr.io/k8s-staging-perf-tests/sleep:v0.1.0
args: ["600s"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
What's next
- To learn about managing common events relevant to GKE clusters and AI workloads, see Manage Hypercompute Clusters with GKE.
- To learn more about scheduling Jobs on GKE with Kueue, see Deploy a batch system using Kueue.