This page describes how to run NCCL/gIB tests on provisioned clusters that use GPUDirect RDMA. It describes tests for the following scenarios:
- If you have nodes that are provisioned with flex-start (Preview), use a basic test on two nodes.
- If you have a larger number of nodes that are not provisioned with flex-start, use an NCCL test with Topology Aware Scheduling.
Test on two nodes
Connect to your cluster:
gcloud container clusters get-credentials CLUSTER_NAME \ --location=COMPUTE_REGIONReplace the following variables:
CLUSTER_NAME: the name of your cluster, which, for the clusters created with Cluster Toolkit, is based on theDEPLOYMENT_NAME.COMPUTE_REGION: the name of the compute region.
To deploy an NCCL test workload of two test Pods that are running on two A4X nodes, run the following:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/refs/heads/master/gpudirect-rdma/nccl-test-imex-a4x.yamlCheck if the Pods are both running on some nodes:
kubectl get pods nccl-test-host-1 nccl-test-host-2If the two Pods show a
Runningstatus, you can proceed to the next step.Trigger an all-gather test for the A4X nodes:
kubectl exec nccl-test-host-1 -it -- /usr/local/gib/scripts/run_nccl_tests.sh -t all_gather -b 1K -e 8G nccl-host-1 nccl-host-2The output is similar to the following:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 32 float none -1 21.20 0.05 0.04 0 20.56 0.05 0.04 0 2048 64 float none -1 21.03 0.10 0.09 0 20.82 0.10 0.09 0 4096 128 float none -1 21.11 0.19 0.17 0 20.98 0.20 0.17 0 8192 256 float none -1 21.51 0.38 0.33 0 21.15 0.39 0.34 0 16384 512 float none -1 21.85 0.75 0.66 0 21.72 0.75 0.66 0 32768 1024 float none -1 24.08 1.36 1.19 0 23.73 1.38 1.21 0 65536 2048 float none -1 24.68 2.66 2.32 0 24.02 2.73 2.39 0 131072 4096 float none -1 24.93 5.26 4.60 0 24.30 5.40 4.72 0 262144 8192 float none -1 24.86 10.55 9.23 0 24.33 10.78 9.43 0 524288 16384 float none -1 25.10 20.89 18.28 0 24.48 21.41 18.74 0 1048576 32768 float none -1 25.43 41.24 36.09 0 24.82 42.25 36.97 0 2097152 65536 float none -1 32.30 64.93 56.81 0 31.28 67.04 58.66 0 4194304 131072 float none -1 45.92 91.34 79.92 0 44.22 94.84 82.99 0 8388608 262144 float none -1 71.38 117.52 102.83 0 68.98 121.61 106.41 0 16777216 524288 float none -1 74.17 226.20 197.93 0 72.37 231.83 202.85 0 33554432 1048576 float none -1 116.6 287.84 251.86 0 112.7 297.75 260.54 0 67108864 2097152 float none -1 188.9 355.27 310.86 0 184.0 364.71 319.12 0 134217728 4194304 float none -1 309.6 433.56 379.36 0 299.7 447.83 391.85 0 268435456 8388608 float none -1 559.0 480.23 420.20 0 540.3 496.85 434.75 0 536870912 16777216 float none -1 1053.7 509.52 445.83 0 1021.4 525.64 459.93 0 1073741824 33554432 float none -1 2087.4 514.39 450.10 0 2013.8 533.19 466.54 0 2147483648 67108864 float none -1 4154.7 516.88 452.27 0 3987.4 538.57 471.25 0 4294967296 134217728 float none -1 8289.2 518.14 453.37 0 7907.4 543.16 475.26 0 8589934592 268435456 float none -1 16556 518.85 453.99 0 15726 546.24 477.96 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 175.233 #
Test with TAS
To validate the functionality of the provisioned cluster, you can run the following NCCL test with TAS.
Configure Kueue with TAS enabled
- Install Kueue with TAS enabled.
Configure Kueue with TAS enabled by creating the following file, which you name
a4x-kueue-config.yaml:apiVersion: kueue.x-k8s.io/v1alpha1 kind: Topology metadata: name: "a4x-default" spec: levels: - nodeLabel: "cloud.google.com/gce-topology-block" - nodeLabel: "cloud.google.com/gce-topology-subblock" - nodeLabel: "cloud.google.com/gke-nodepool" - nodeLabel: "cloud.google.com/gce-topology-host" - nodeLabel: "kubernetes.io/hostname" --- kind: ResourceFlavor apiVersion: kueue.x-k8s.io/v1beta1 metadata: name: "a4x" spec: nodeLabels: cloud.google.com/gke-accelerator: nvidia-gb200 topologyName: "a4x-default" tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: NoSchedule - key: "kubernetes.io/arch" operator: "Exists" effect: NoSchedule --- apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "a4x" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["nvidia.com/gpu"] flavors: - name: "a4x" resources: - name: "nvidia.com/gpu" nominalQuota: 1_000_000_000 --- apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: "default" name: "a4x" spec: clusterQueue: "a4x"Run the test:
kubectl apply -f a4x-kueue-config.yaml
Schedule a topology-aware NCCL test with Kueue with TAS enabled
The following workload must be placed within a single NVLink Domain sub-block.
- Install JobSet, a Kubernetes-native API for managing of group of Kubernetes Jobs as a unit. Ensure that your non-GPU node pools have enough resources to schedule the JobSet controllers.
Create the following file with the name
nccl-tas-test.yaml. ReplaceNUM_NODESwith the intended number of nodes to run the NCCL test, up to18:apiVersion: resource.nvidia.com/v1beta1 kind: ComputeDomain metadata: name: nccl-test-compute-domain spec: numNodes: NUM_NODES channel: resourceClaimTemplate: name: nccl-test-compute-domain-channel --- apiVersion: jobset.x-k8s.io/v1alpha2 kind: JobSet metadata: name: kueue-tas-nccl-all-gather labels: kueue.x-k8s.io/queue-name: a4x spec: ttlSecondsAfterFinished: 1200 network: enableDNSHostnames: true replicatedJobs: - name: worker template: spec: parallelism: NUM_NODES completions: NUM_NODES template: metadata: annotations: kueue.x-k8s.io/podset-required-topology: "cloud.google.com/gce-topology-subblock" networking.gke.io/default-interface: 'eth0' networking.gke.io/interfaces: | [ {"interfaceName":"eth0","network":"default"}, {"interfaceName":"eth2","network":"rdma-0"}, {"interfaceName":"eth3","network":"rdma-1"}, {"interfaceName":"eth4","network":"rdma-2"}, {"interfaceName":"eth5","network":"rdma-3"} ] spec: activeDeadlineSeconds: 3600 restartPolicy: Never nodeSelector: cloud.google.com/gke-accelerator: nvidia-gb200 tolerations: - key: nvidia.com/gpu operator: Equal value: present effect: NoSchedule - key: kubernetes.io/arch operator: Equal value: arm64 effect: NoSchedule setHostnameAsFQDN: true volumes: - name: gib hostPath: path: /home/kubernetes/bin/gib - name: nvidia hostPath: path: /home/kubernetes/bin/nvidia - name: lib64 hostPath: path: /lib64 - name: shared-memory emptyDir: medium: "Memory" sizeLimit: 250Gi resourceClaims: - name: compute-domain-channel resourceClaimTemplateName: nccl-test-compute-domain-channel containers: - name: nccl-test stdin: true tty: true image: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.0.4 env: - name: MY_NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName - name: OMPI_ALLOW_RUN_AS_ROOT value: "1" - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM value: "1" - name: N_NODES value: "NUM_NODES" - name: LD_LIBRARY_PATH value: /usr/local/nvidia/lib64 command: - bash - -c - | set -x echo "Starting workload container on ${MY_NODE_NAME} for $N_NODES benchmark" # Install ping apt update -y apt install -y iputils-ping # Start sshd /scripts/container_entry.sh daemon & # Get helper variables to form all hostnames export POSTFIX=$(hostname | cut -d . -f 2-) export WORKERS_BASENAME=$(hostname | cut -d . -f 1 | rev | cut -d - -f 2- | rev ) export NODE_RANK=$JOB_COMPLETION_INDEX # For every worker, wait till online and add to hostfile for i in `seq 0 $(($N_NODES-1))`; do OTHER=${WORKERS_BASENAME}-${i}.${POSTFIX} until ssh -p 222 -o StrictHostKeyChecking=no $OTHER hostname; do echo Waiting for ${OTHER}... sleep 10 done echo ${OTHER} port=222 slots=4 | tee -a /tmp/hostfile; done cat /tmp/hostfile # Launch from head node if [[ "${NODE_RANK}" -eq "0" ]]; then # World Level = 0x0, Rail Aligned = 0x7 export NCCL_TESTS_SPLIT_MASK="0x0"; # Force use of libnccl-gib export NCCL_NET=gIB # Set all the correct libnccl-gib environment variables source /usr/local/gib/scripts/set_nccl_env.sh # Get all relevant NCCL / env vars to pass to all workers ENV_VARS=$(echo ${!NCCL*} ${!OMPI*} LD_LIBRARY_PATH PATH | sed 's/ / -x /g') mpirun --hostfile /tmp/hostfile \ -x $ENV_VARS \ -mca plm_rsh_no_tree_spawn 1 \ --mca orte_keep_fqdn_hostnames 1 \ --mca btl self,tcp \ --mca btl_tcp_if_include eth0 \ --bind-to none \ --mca plm_rsh_agent "ssh -q -o LogLevel=ERROR -o StrictHostKeyChecking=no -p 222" \ /third_party/nccl-tests/build/all_gather_perf -b 1K -e 8G -f 2 -g 1 -w 5 --iters 100 -c 1 else while ping -c 1 ${WORKERS_BASENAME}-0.${POSTFIX}; do sleep 5 done fi exit 0 volumeMounts: - name: nvidia mountPath: /usr/local/nvidia - name: gib mountPath: /usr/local/gib - name: shared-memory mountPath: /dev/shm resources: limits: nvidia.com/gpu: 4 requests: nvidia.com/gpu: 4 claims: - name: compute-domain-channel restartPolicy: NeverRun the test:
kubectl apply -f nccl-tas-test.yamlCheck the test result by reviewing the logs:
kubectl logs $(kubectl get pods -o go-template='{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | grep kueue-tas-nccl-all-gather-worker-0-0)
The output should be similar to the following:
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 8 float none -1 56.72 0.02 0.02 0 56.12 0.02 0.02 0 2048 16 float none -1 56.85 0.04 0.03 0 56.87 0.04 0.03 0 4096 32 float none -1 57.53 0.07 0.07 0 57.47 0.07 0.07 0 8192 64 float none -1 58.43 0.14 0.14 0 58.27 0.14 0.14 0 16384 128 float none -1 59.29 0.28 0.27 0 58.87 0.28 0.27 0 32768 256 float none -1 60.02 0.55 0.53 0 59.60 0.55 0.53 0 65536 512 float none -1 61.83 1.06 1.03 0 61.64 1.06 1.03 0 131072 1024 float none -1 70.99 1.85 1.79 0 70.82 1.85 1.79 0 262144 2048 float none -1 71.56 3.66 3.55 0 71.07 3.69 3.57 0 524288 4096 float none -1 72.62 7.22 6.99 0 71.90 7.29 7.06 0 1048576 8192 float none -1 72.80 14.40 13.95 0 72.31 14.50 14.05 0 2097152 16384 float none -1 73.40 28.57 27.68 0 72.96 28.74 27.85 0 4194304 32768 float none -1 73.86 56.78 55.01 0 73.44 57.12 55.33 0 8388608 65536 float none -1 102.5 81.86 79.30 0 101.4 82.69 80.11 0 16777216 131072 float none -1 158.3 105.97 102.66 0 156.8 107.02 103.68 0 33554432 262144 float none -1 158.4 211.89 205.26 0 157.5 212.99 206.33 0 67108864 524288 float none -1 250.7 267.68 259.32 0 248.7 269.81 261.38 0 134217728 1048576 float none -1 417.7 321.29 311.25 0 414.1 324.13 314.01 0 268435456 2097152 float none -1 728.8 368.32 356.81 0 721.5 372.08 360.45 0 536870912 4194304 float none -1 1226.5 437.72 424.04 0 1216.1 441.46 427.66 0 1073741824 8388608 float none -1 2268.4 473.35 458.56 0 2247.0 477.86 462.93 0 2147483648 16777216 float none -1 4330.6 495.88 480.39 0 4291.6 500.39 484.76 0 4294967296 33554432 float none -1 8640.9 497.05 481.52 0 8544.0 502.69 486.98 0 8589934592 67108864 float none -1 17258 497.75 482.19 0 17052 503.75 488.00 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 157.091
What's next
- Collect and Understand NCCL Logs for Troubleshooting to understand the test outputs and troubleshoot issues.
- Learn about troubleshooting slow performance.