Create a custom Hypercompute Cluster with GKE

This page shows you how to create your own Hypercompute Cluster with Google Kubernetes Engine (GKE) to support your AI and ML workloads, using A3 Ultra GPUs. GKE is the open, portable, extensible, and highly scalable platform for Hypercompute Cluster. GKE provides a single platform surface to run a diverse set of workloads for your organization's. This includes high performance distributed pre-training, model fine-tuning, model inference, application serving, and supporting services. GKE reduces the operational burden of managing multiple platforms.

The instructions on this page explain how to create a GKE cluster manually for maximum flexibility in configuring your cluster based on the needs of your workload. Alternatively, you can choose to use Cluster Toolkit to quickly deploy your cluster with default settings that reflect best practices for many use cases. For instructions on how to do this, see Create a Hypercompute Cluster with GKE with default configuration.

For creating your cluster manually, you can choose one of the following cluster configuration options:

Create a GKE cluster without using GPUDirect RDMA, if you're not planning to run distributed AI workloads.
Create a GKE cluster with GPUDirect RDMA, if you're planning to run distributed AI workloads.

Before you begin

Before you start, make sure you have performed the following tasks:

Enable the Google Kubernetes Engine API.

Enable Google Kubernetes Engine API

If you want to use the Google Cloud CLI for this task, install and then initialize the gcloud CLI. If you previously installed the gcloud CLI, get the latest version by running gcloud components update.
Note: For existing gcloud CLI installations, make sure to set the compute/region and compute/zone properties. By setting default locations, you can avoid errors in gcloud CLI like the following: One of [--zone, --region] must be supplied: Please specify location.

Ensure that you have enough quota for A3 Ultra GPUs. To request more quota, follow the instructions in GPU quota. To ensure that your cluster has capacity, you can follow the instructions to reserve capacity.

Requirements

The following requirements apply to Hypercompute Cluster with GKE:

The H200 GPUs in A3 Ultra VMs require a minimum of 550 GPU driver version, which is available in GKE 1.31 as latest driver version. For A3 Ultra, you must set gpu-driver-version=latest with GKE 1.31. For GKE version 1.31.5-gke.1169000 or later, GKE, by default, automatically installs 550 GPU driver versions on A3 Ultra nodes.
To use GPUDirect RDMA, the following additional requirements apply:
- Use GKE patch version 1.31.5-gke.1169000 or higher.
- The GKE nodes must use a Container-Optimized OS node image. Ubuntu and Windows node images are not supported.
- Your GKE workload must use all available GPUs and your Pod must use all available secondary NICs on a single GKE node. Multiple Pods cannot share RDMA on a single GKE node.
- This setup runs a NCCL test. To run this NCCL test, you must have at least 2 VM quota (16 GPUs if using a3-ultragpu-8g).

Create a cluster

Follow the instructions in this section to create a GKE cluster that meets the requirements for Hypercompute Cluster with GKE. You can choose between creating a cluster with or without GPUDirect RDMA.

For both options, edit the following placeholders with your own values for use in the instructions:

PROJECT_ID: your Google Cloud project ID.
CLUSTER_NAME: the name of your new cluster.
NODE_POOL_NAME: the name of the node pool.
COMPUTE_REGION: the region of your new cluster. For dense reservations, you can replace this with --zone=COMPUTE_ZONE if you're creating a zonal cluster. When creating node pools in a regional cluster, you can use --node-locations to specify the zones for your GKE nodes.
CLUSTER_VERSION: the version of your new cluster, which must be greater than 1.31.5-gke.1169000 to use GPUDirect RDMA.
GPU_TYPE: the type of GPU accelerator. For example, nvidia-h200-141gb for A3 Ultra virtual machines (VMs).
AMOUNT: the number of GPUs to attach to nodes in the node pool. For a3-ultragpu-8g, the amount will be 8 GPUs.
DRIVER_VERSION: the NVIDIA driver version to install. It can be one of the following values:
- default: Install the default driver version for your GKE node version. The H200 GPUs (A3 Ultra VMs) require a minimum of 550 GPU driver versions. For GKE version 1.31.5-gke.1169000 or later, if you omit the gpu-driver-version flag, this flag is the default option and GKE automatically installs 550 GPU driver versions on A3 Ultra nodes.
- latest: Install the latest available driver version for your GKE version. Available only for nodes that are using Container-Optimized OS.
- disabled: Skip automatic driver installation. You must manually install a driver after you create the node pool.
To see the default and latest GPU driver versions for GKE node versions, see Manually install NVIDIA GPU drivers.
MACHINE_TYPE: The Compute Engine machine type for the nodes. For example, a3-ultragpu-8g for A3 Ultra VMs.
NUM_NODES: The number of nodes for the node pool.
Reservation affinity: the --reservation-affinity flag can take the values of specific or any. For high performance distributed AI workloads, we recommend using a specific reservation. When using specific, you must specify the --reservation flag, which takes the following values:
- RESERVATION_NAME/reservationBlocks/BLOCK_NAME: RESERVATION_NAME is the name of your reservation, and BLOCK_NAME is the name of a specific block within the reservation. Both of these values can be obtained by querying your reservation. If using a shared reservation, you must also include the PROJECT_ID using the following format: projects/PROJECT_ID/reservationsRESERVATION_NAME/reservationBlocks/BLOCK_NAME.

Create a cluster without GPUDirect RDMA

For creating a cluster without GPUDirect RDMA, you can use one of the following node pool configuration options:

Use a CPU-based default node pool and add additional node pools with GPUs. We recommend this approach, which allows the default node pool to run other services.
Use a GPU-based default node pool.

Use separate node pool

Create the cluster:

gcloud container clusters create CLUSTER_NAME \
    --region=COMPUTE_REGION

Create the GPU-based node pool:

gcloud container node-pools create NODE_POOL_NAME \
    --region COMPUTE_REGION --cluster CLUSTER_NAME \
    --node-locations COMPUTE_ZONE \
    --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
    --machine-type MACHINE_TYPE \
    --num-nodes=NUM_NODES \
    --reservation-affinity=specific \
    --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME

Use default node pool

Create the cluster, with a GPU-based default node pool:

  gcloud container clusters create CLUSTER_NAME \
    --region=COMPUTE_REGION \
    --cluster-version=CLUSTER_VERSION \
    --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
    --machine-type=MACHINE_TYPE  \
    --num-nodes=NUM_NODES \
    --reservation-affinity=specific \
    --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME

Create a cluster with GPUDirect RDMA

For distributed AI workloads, multiple GPU nodes are often linked together to work as a single computer. The A3 Ultra VMs come with the Titanium ML network adapter which is built on NVIDIA ConnectX-7 (CX7) network interface cards (NICs). A3 Ultra VMs deliver non-blocking 3.2 Tbps of inter-node GPU-to-GPU traffic using RDMA over Converged Ethernet (RoCE), enabling scaling and collaboration across multiple GPUs delivering a high-performance cloud experience for AI workloads.

To create your GKE clusters manually using GPUDirect TCPX (A3 High VMs) or TCPXO (A3 Mega VMs), see maximize GPU network bandwidth in Standard mode clusters.

To create your GKE clusters manually with GPUDirect RDMA, you'll complete the following steps, which are described in the next sections:

Create VPCs and subnets
Create GKE cluster and GPU node pool with multi-networking
Create GKE network objects
Install the RDMA binary and configure NCCL
Deploy and run a NCCL test
Configure your POD manifests for GPUDirect-RDMA

Create VPCs and subnets

A3 Ultra GPUs have the following configuration:

Eight NVIDIA H200 GPUs per virtual machine connected with NVLink
Two Intel Emerald Rapids CPUs
Eight 400 Gbps CX-7 network interface cards (NICs) for GPU-to-GPU networking
Two 200 Gbps Google Titanium network interface cards (NICs) for external services

AI and ML workloads, such as distributed training, require powerful acceleration to optimize performance by reducing job completion times. For such workloads that require high performance, high throughput, and low latency, GPUDirect RDMA reduces the overhead required to transfer payloads to and from GPUs significantly improving throughput at scale compared to GPUs that don't use GPUDirect.

One of the Google Titanium NICs that is associated with the CPU uses the default network in GKE, so you don't have to create a new VPC for this NIC as long as you have enough IP ranges for the default network.

You can create one VPC for the second CPU Titanium NIC (gVNIC) and another VPC for the eight CX-7 RDMA NICs using these commands.

Set environment variables to match your deployment:
```
export REGION="COMPUTE_REGION"
export ZONE="COMPUTE_ZONE"
export PROJECT="PROJECT_ID"
export GVNIC_NETWORK_PREFIX="a3ultra-gvnic"
export RDMA_NETWORK_PREFIX="a3ultra-rdma"
```
Replace the following variables:
- COMPUTE_REGION: the region of your cluster.
- COMPUTE_ZONE: the zone of your A3 Ultra node pool.
- PROJECT_ID: your Google Cloud project ID.

Create two VPC networks:

# Create a VPC for the additional Google Titanium CPU NIC
gcloud compute --project=${PROJECT?} \
  networks create \
  ${GVNIC_NETWORK_PREFIX?}-net \
  --subnet-mode=custom

gcloud compute --project=${PROJECT?} \
  networks subnets create \
  ${GVNIC_NETWORK_PREFIX?}-sub \
  --network=${GVNIC_NETWORK_PREFIX?}-net \
  --region=${REGION?} \
  --range=192.168.0.0/24

gcloud compute --project=${PROJECT?} \
  firewall-rules create \
  ${GVNIC_NETWORK_PREFIX?}-internal \
  --network=${GVNIC_NETWORK_PREFIX?}-net \
  --action=ALLOW \
  --rules=tcp:0-65535,udp:0-65535,icmp \
  --source-ranges=192.168.0.0/16

# Create HPC VPC for the RDMA NICs with 8 subnets.
gcloud beta compute --project=${PROJECT?} \
  networks create ${RDMA_NETWORK_PREFIX?}-net \
  --network-profile=${ZONE?}-vpc-roce \
  --subnet-mode=custom

# Create subnets for the HPC VPC.
for N in $(seq 0 7); do
  gcloud compute --project=${PROJECT?} \
    networks subnets create \
    ${RDMA_NETWORK_PREFIX?}-sub-$N \
    --network=${RDMA_NETWORK_PREFIX?}-net \
    --region=${REGION?} \
    --range=192.168.$((N+1)).0/24 &  # offset to avoid overlap with gvnics
done

Create the GKE cluster and GPU node pool with multi-networking

Create the cluster:
```
gcloud container clusters create CLUSTER_NAME \
  --region=COMPUTE_REGION \
  --cluster-version=CLUSTER_VERSION \
  --enable-dataplane-v2 --enable-ip-alias --enable-multi-networking \
  [--services-ipv4-cidr=SERVICE_CIDR] \
  --cluster-ipv4-cidr=POD_CIDR
```
Replace the following variables if you use the optional flags:
- SERVICE_CIDR and POD_CIDR: Optionally, you can explicitly provide the secondary ranges for services and Pods. You must ensure that these ranges don't overlap with subnet ranges for additional node networks. For example, SERVICE_CIDR=10.65.0.0/19 and POD_CIDR=10.64.0.0/19.

Create the node pool:

gcloud container node-pools create NODE_POOL_NAME \
  --region COMPUTE_REGION --cluster CLUSTER_NAME \
  --node-locations COMPUTE_ZONE \
  --accelerator type=GPU_TYPE,count=AMOUNT,gpu-driver-version=DRIVER_VERSION \
  --machine-type MACHINE_TYPE \
  --num-nodes=NUM_NODES \
  --reservation-affinity=specific \
  --reservation=RESERVATION_NAME/reservationBlocks/BLOCK_NAME  \
  --additional-node-network network=${GVNIC_NETWORK_PREFIX}-net,subnetwork=${GVNIC_NETWORK_PREFIX}-sub \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-0 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-1 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-2 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-3 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-4 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-5 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-6 \
  --additional-node-network network=${RDMA_NETWORK_PREFIX}-net,subnetwork=${RDMA_NETWORK_PREFIX}-sub-7

Create the GKE network objects

The VPC networks created in the previous section need to be configured through GKE network parameter sets. Specifically, the second CPU Titanium NIC (gVNIC) needs to be configured in NetDevice mode and each of the eight CX-7 RDMA NICs need to be configured in RDMA mode.

This command uses the following names:

CPU Titanium NIC (gVNIC) VPC is named ${GVNIC_NETWORK_PREFIX?}-net with subnet named ${GVNIC_NETWORK_PREFIX?}-sub
CX-7 RDMA NICs VPC is named ${RDMA_NETWORK_PREFIX?}-net with subnets named ${RDMA_NETWORK_PREFIX?}-sub-[0…7]

Create the GKE network objects by running the following command:

  kubectl apply -f - <<EOF
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: gvnic-1
  spec:
    vpc: ${GVNIC_NETWORK_PREFIX}-net
    vpcSubnet: ${GVNIC_NETWORK_PREFIX}-sub
    deviceMode: NetDevice
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: gvnic-1
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: gvnic-1
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-0
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-0
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-0
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-0
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-1
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-1
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-1
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-1
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-2
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-2
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-2
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-2
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-3
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-3
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-3
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-3
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-4
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-4
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-4
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-4
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-5
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-5
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-5
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-5
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-6
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-6
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-6
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-6
  ---
  apiVersion: networking.gke.io/v1
  kind: GKENetworkParamSet
  metadata:
    name: rdma-7
  spec:
    vpc: ${RDMA_NETWORK_PREFIX}-net
    vpcSubnet: ${RDMA_NETWORK_PREFIX}-sub-7
    deviceMode: RDMA
  ---
  apiVersion: networking.gke.io/v1
  kind: Network
  metadata:
    name: rdma-7
  spec:
    type: "Device"
    parametersRef:
      group: networking.gke.io
      kind: GKENetworkParamSet
      name: rdma-7
  EOF

Install the RDMA binary and configure NCCL

Apply the following DaemonSet to install the RDMA binaries and the NCCL library on the node. The RDMA binaries are stored in /home/kubernetes/bin/gib directory and the NCCL library is stored in /home/kubernetes/bin/nvidia/lib64 directory on the VM:

  kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-rdma-installer.yaml

Run NCCL tests

To validate the functionality of the provisioned cluster, you can run a NCCL test. For instructions, see Deploy and run a NCCL test.

Configure your Pod manifests for GPUDirect RDMA

To run your workloads using GPUDirect RDMA, configure your Pod manifests with the following steps:

To avoid setting hostNetwork: true, add the following annotations to the Pod metadata:

metadata:
  annotations:
    networking.gke.io/default-interface: 'eth0'
    networking.gke.io/interfaces: |
      [
        {"interfaceName":"eth0","network":"default"},
        {"interfaceName":"eth1","network":"gvnic-1"},
        {"interfaceName":"eth2","network":"rdma-0"},
        {"interfaceName":"eth3","network":"rdma-1"},
        {"interfaceName":"eth4","network":"rdma-2"},
        {"interfaceName":"eth5","network":"rdma-3"},
        {"interfaceName":"eth6","network":"rdma-4"},
        {"interfaceName":"eth7","network":"rdma-5"},
        {"interfaceName":"eth8","network":"rdma-6"},
        {"interfaceName":"eth9","network":"rdma-7"}
      ]

Add the following volumes to the Pod spec:

spec:
  volumes:
    - name: library-dir-host
      hostPath:
        path: /home/kubernetes/bin/nvidia
    - name: gib
      hostPath:
        path: /home/kubernetes/bin/gib

Add the following volume mounts, environment variables, and resources to the container that requests GPUs. Your workload container must request all 8 GPUs:

containers:
  - name: my-container
    volumeMounts:
      - name: library-dir-host
        mountPath: /usr/local/nvidia
      - name: gib
        mountPath: /usr/local/gib
    env:
      - name: LD_LIBRARY_PATH
        value: /usr/local/nvidia/lib64
    resources:
      limits:
        nvidia.com/gpu: 8

Set all the required environment variables to configure NCCL, by using the following shell script from the workload container:
```
source /usr/local/gib/scripts/set_nccl_env.sh
```

A completed Pod manifest should look similar to the following:

  apiVersion: apps/v1
  kind: Pod
  metadata:
    name: my-pod
    labels:
      k8s-app: my-pod
    annotations:
      networking.gke.io/default-interface: 'eth0'
      networking.gke.io/interfaces: |
        [
          {"interfaceName":"eth0","network":"default"},
          {"interfaceName":"eth1","network":"gvnic-1"},
          {"interfaceName":"eth2","network":"rdma-0"},
          {"interfaceName":"eth3","network":"rdma-1"},
          {"interfaceName":"eth4","network":"rdma-2"},
          {"interfaceName":"eth5","network":"rdma-3"},
          {"interfaceName":"eth6","network":"rdma-4"},
          {"interfaceName":"eth7","network":"rdma-5"},
          {"interfaceName":"eth8","network":"rdma-6"},
          {"interfaceName":"eth9","network":"rdma-7"}
        ]
  spec:
    ...
    volumes:
      - name: library-dir-host
        hostPath:
          path: /home/kubernetes/bin/nvidia
      - name: gib
        hostPath:
          path: /home/kubernetes/bin/gib
    containers:
      - name: my-container
        volumeMounts:
          - name: library-dir-host
            mountPath: /usr/local/nvidia
          - name: gib
            mountPath: /usr/local/gib
        env:
          - name: LD_LIBRARY_PATH
            value: /usr/local/nvidia/lib64
        resources:
          limits:
            nvidia.com/gpu: 8
            ...

Deploy and run a NCCL test

To validate the functionality of the provisioned cluster, you can run a NCCL test. Run a basic two node test, or if you have a larger number of nodes, we recommend using the NCCL test with Topology Aware Scheduling.

Two node test

Run the two node test:

To deploy a NCCL test workload of two test Pods running on two A3 Ultra nodes, apply the following manifest:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test.yaml

Trigger a NCCL all-gather test for the A3 Ultra nodes:

kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2

The output should be similar to the following:

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        1024            16     float    none      -1    56.00    0.02    0.02      0    55.59    0.02    0.02      0
        2048            32     float    none      -1    55.79    0.04    0.03      0    55.57    0.04    0.03      0
        4096            64     float    none      -1    56.29    0.07    0.07      0    57.35    0.07    0.07      0
        8192           128     float    none      -1    56.44    0.15    0.14      0    56.32    0.15    0.14      0
       16384           256     float    none      -1    57.57    0.28    0.27      0    57.60    0.28    0.27      0
       32768           512     float    none      -1    57.92    0.57    0.53      0    59.35    0.55    0.52      0
       65536          1024     float    none      -1    59.92    1.09    1.03      0    60.15    1.09    1.02      0
      131072          2048     float    none      -1    59.21    2.21    2.08      0    61.82    2.12    1.99      0
      262144          4096     float    none      -1    63.58    4.12    3.87      0    63.34    4.14    3.88      0
      524288          8192     float    none      -1    64.89    8.08    7.57      0    65.09    8.06    7.55      0
     1048576         16384     float    none      -1    80.90   12.96   12.15      0    77.49   13.53   12.69      0
     2097152         32768     float    none      -1    80.22   26.14   24.51      0    79.88   26.25   24.61      0
     4194304         65536     float    none      -1    82.86   50.62   47.45      0    82.47   50.86   47.68      0
     8388608        131072     float    none      -1    95.83   87.53   82.06      0    93.27   89.94   84.32      0
    16777216        262144     float    none      -1    122.8  136.58  128.04      0    121.7  137.86  129.24      0
    33554432        524288     float    none      -1    180.6  185.75  174.14      0    179.2  187.19  175.49      0
    67108864       1048576     float    none      -1    279.7  239.90  224.90      0    277.0  242.26  227.12      0
   134217728       2097152     float    none      -1    507.5  264.46  247.93      0    485.1  276.66  259.37      0
   268435456       4194304     float    none      -1    866.3  309.88  290.51      0    864.0  310.70  291.28      0
   536870912       8388608     float    none      -1   1576.1  340.62  319.33      0   1558.2  344.54  323.01      0
  1073741824      16777216     float    none      -1   3096.6  346.75  325.08      0   3047.5  352.33  330.31      0
  2147483648      33554432     float    none      -1   6148.0  349.30  327.47      0   6034.3  355.88  333.64      0
  4294967296      67108864     float    none      -1    12226  351.29  329.33      0    12000  357.92  335.55      0
  8589934592     134217728     float    none      -1    24391  352.17  330.16      0    23920  359.11  336.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.94

Test with Topology Aware Scheduling (TAS)

If you have a larger number of nodes than two, we recommend using the following test, which uses TAS. Follow the steps in the next sections to prepare and run the test on your cluster.

Set up your cluster with Jobset and TAS plugin

Install JobSet.

Install the TAS plugin:

Clone the container-engine-accelerators git repository:

cd ~
git clone https://github.com/GoogleCloudPlatform/container-engine-accelerators.git

Apply the TAS plugin:

cd container-engine-accelerators/gke-topology-scheduler
kubectl create configmap topology-scheduler-scripts --namespace kube-system --from-file=schedule-daemon.py=schedule-daemon.py --from-file=label-nodes-daemon.py=label-nodes-daemon.py
kubectl apply -f service-account.yaml
kubectl apply -f schedule-daemon.yaml
kubectl apply -f label-nodes-daemon.yaml

Deploy a NCCL test workload with TAS

Create the following nccl-jobset-test.yaml manifest, replacing NUM_NODES with the number of nodes in the node pool:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: nccl-allgather
spec:
  ttlSecondsAfterFinished: 1200
  suspend: False
  network:
    enableDNSHostnames: true
  replicatedJobs:
    - name: worker
      template:
        spec:
          parallelism: NUM_NODES
          completions: NUM_NODES
          template:
            metadata:
              annotations:
                networking.gke.io/default-interface: 'eth0'
                networking.gke.io/interfaces: |
                  [
                    {"interfaceName":"eth0","network":"default"},
                    {"interfaceName":"eth1","network":"gvnic-1"},
                    {"interfaceName":"eth2","network":"rdma-0"},
                    {"interfaceName":"eth3","network":"rdma-1"},
                    {"interfaceName":"eth4","network":"rdma-2"},
                    {"interfaceName":"eth5","network":"rdma-3"},
                    {"interfaceName":"eth6","network":"rdma-4"},
                    {"interfaceName":"eth7","network":"rdma-5"},
                    {"interfaceName":"eth8","network":"rdma-6"},
                    {"interfaceName":"eth9","network":"rdma-7"}
                  ]
            spec:
              activeDeadlineSeconds: 3600
              restartPolicy: Never
              nodeSelector:
                cloud.google.com/gke-accelerator: nvidia-h200-141gb
              tolerations:
              - key: cloud.google.com/gke-queued
                effect: NoSchedule
                value: "true"
              - key: "nvidia.com/gpu"
                operator: "Exists"
                effect: "NoSchedule"
              setHostnameAsFQDN: true
              volumes:
              - name: gib
                hostPath:
                  path: /home/kubernetes/bin/gib
              - name: nvidia
                hostPath:
                  path: /home/kubernetes/bin/nvidia
              - name: lib64
                hostPath:
                  path: /lib64
              - name: shared-memory
                emptyDir:
                  medium: "Memory"
                  sizeLimit: 250Gi
              schedulingGates:
              - name: "gke.io/topology-aware-auto-nccl-test"
              containers:
              - name: nccl-test
                stdin: true
                tty: true
                image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:v1.0.3
                securityContext:
                  privileged: true
                env:
                - name: MY_NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
                - name: N_NODES
                  value: "NUM_NODES"
                - name: LD_LIBRARY_PATH
                  value: /usr/local/nvidia/lib64
                command:
                - bash
                - -c
                - |
                  set -x
                  echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark"
                  # Install ping
                  apt update -y
                  apt install -y iputils-ping

                  # Start sshd
                  /scripts/container_entry.sh daemon &

                  # Get helper variables to form all hostnames
                  export POSTFIX=$(hostname | cut -d . -f 2-)
                  export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev )
                  export NODE_RANK=$JOB_COMPLETION_INDEX

                  # For every worker, wait till online and add to hostfile
                  for i in `seq 0 $(($N_NODES-1))`; do
                    OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX}
                    until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do
                      echo Waiting for ${OTHER}...
                      sleep 10
                    done
                    echo ${OTHER} port=222 slots=8 | tee -a /tmp/hostfile;
                  done

                  cat /tmp/hostfile

                  # Launch from head node
                  if [[ "${NODE_RANK}" -eq "0" ]]; then

                      # World Level = 0x0, Rail Aligned = 0x7
                      export NCCL_TESTS_SPLIT_MASK="0x0";

                      # Force use of libnccl-gib
                      export NCCL_NET=gIB

                      # Set all the correct libnccl-gib environment variables
                      source /usr/local/gib/scripts/set_nccl_env.sh

                      # Get all relevant NCCL / env vars to pass to all workers
                      ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g')

                      mpirun --hostfile /tmp/hostfile \
                        -x $ENV_VARS  \
                        -mca plm_rsh_no_tree_spawn 1 \
                        --mca orte_keep_fqdn_hostnames 1 \
                        --mca btl self,tcp \
                        --mca btl_tcp_if_include eth0 \
                        --bind-to none \
                        --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \
                        /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1

                  else
                      while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do
                      sleep 5
                  done
                  fi

                  exit 0
                volumeMounts:
                - name: nvidia
                  mountPath: /usr/local/nvidia
                - name: gib
                  mountPath: /usr/local/gib
                - name: shared-memory
                  mountPath: /dev/shm
                resources:
                  limits:
                    nvidia.com/gpu: 8
                  requests:
                    nvidia.com/gpu: 8
              restartPolicy: Never

Ensure that you understand that the following about this manifest:

The JobSet is a Headless Service with the same name as the JobSet name, in this case nccl-allgather.
The gke.io/topology-aware-auto-nccl-test scheduling gate is used to ensure the Pods are scheduled for colocation.
The parallelism and completions fields are both set to the number of nodes with which you want to run the NCCL test.

Apply the manifest:
```
kubectl apply -f nccl-jobset-test.yaml
```

Confirm that the workload is admitted:

kubectl get jobsets

The output is similar to the following:

NAME            RESTARTS   COMPLETED   AGE
nccl-allgather                         3s

Confirm that the workload is in the Completed state:

kubectl get pods

The output is similar to the following:

NAME                          READY   STATUS      RESTARTS   AGE
nccl-allgather-worker-0-0-n9s6j   0/1     Completed   0          9m34s
nccl-allgather-worker-0-1-rsf7r   0/1     Completed   0          9m34s
...

The logs of the Pod with the pattern nccl-allgather-worker-0-0-.* contain the results of the test.

Fetch the logs for this Pod:

  kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep nccl-allgather-worker-0-0)

The output should be similar to the following:

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us) ∂ç (GB/s)  (GB/s)
        1024            16     float    none      -1    54.07    0.02    0.02      0    55.80    0.02    0.02      0
        2048            32     float    none      -1    55.46    0.04    0.03      0    55.31    0.04    0.03      0
        4096            64     float    none      -1    55.59    0.07    0.07      0    55.38    0.07    0.07      0
        8192           128     float    none      -1    56.05    0.15    0.14      0    55.92    0.15    0.14      0
       16384           256     float    none      -1    57.08    0.29    0.27      0    57.75    0.28    0.27      0
       32768           512     float    none      -1    57.49    0.57    0.53      0    57.22    0.57    0.54      0
       65536          1024     float    none      -1    59.20    1.11    1.04      0    59.20    1.11    1.04      0
      131072          2048     float    none      -1    59.58    2.20    2.06      0    63.57    2.06    1.93      0
      262144          4096     float    none      -1    63.87    4.10    3.85      0    63.61    4.12    3.86      0
      524288          8192     float    none      -1    64.83    8.09    7.58      0    64.40    8.14    7.63      0
     1048576         16384     float    none      -1    79.74   13.15   12.33      0    76.66   13.68   12.82      0
     2097152         32768     float    none      -1    78.41   26.74   25.07      0    79.05   26.53   24.87      0
     4194304         65536     float    none      -1    83.21   50.41   47.26      0    81.25   51.62   48.39      0
     8388608        131072     float    none      -1    94.35   88.91   83.35      0    99.07   84.68   79.38      0
    16777216        262144     float    none      -1    122.9  136.55  128.02      0    121.7  137.83  129.21      0
    33554432        524288     float    none      -1    184.2  182.19  170.80      0    178.1  188.38  176.60      0
    67108864       1048576     float    none      -1    294.7  227.75  213.51      0    277.7  241.62  226.52      0
   134217728       2097152     float    none      -1    495.4  270.94  254.00      0    488.8  274.60  257.43      0
   268435456       4194304     float    none      -1    877.5  305.92  286.80      0    861.3  311.65  292.17      0
   536870912       8388608     float    none      -1   1589.8  337.71  316.60      0   1576.2  340.61  319.33      0
  1073741824      16777216     float    none      -1   3105.7  345.74  324.13      0   3069.2  349.85  327.98      0
  2147483648      33554432     float    none      -1   6161.7  348.52  326.74      0   6070.7  353.75  331.64      0
  4294967296      67108864     float    none      -1    12305  349.03  327.22      0    12053  356.35  334.08      0
  8589934592     134217728     float    none      -1    24489  350.77  328.85      0    23991  358.05  335.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 120.248

What's next

To learn about scheduling workloads on your Hypercompute Cluster with GKE using Topology Aware Scheduling (TAS) and Kueue, see Schedule GKE workloads with Topology Aware Scheduling.
To learn about managing common events relevant to GKE clusters and AI workloads, see Manage Hypercompute Clusters with GKE.