This page describes how to run NCCL/gIB tests on a Slurm cluster. Choose the steps for your machine type:
A4X and A4 machines
The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests. Ramble and its dependencies are compatible with the ARM64 architecture used by A4X machines.
The run scripts used for this test are staged in the
/opt/apps/system_benchmarks on the Slurm controller node and are
available to all nodes in the cluster. Running this test installs Ramble
to the /opt/apps/ramble directory.
From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses
nohupand redirects thestdout/errto a log file .nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &
This command creates a folder called
nccl-tests_$(date +%s)that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.For example, if your cluster has 16 nodes then NCCL tests are ran for
all-gather,all-reduce, andreduce-scatteron 2, 4, 8, and 16 nodes.Review the results. The
nccl.logcontains the logs from setting up and running the test. To view these logs, run the following:tail -f nccl.log
You can also use
Ctrl+Cto stop tailing the output at any time. At the end of thenccl.log, your output should resemble the following:... ---- SUMMARY for >1GB Message Sizes ---- workload n_nodes msg_size busbw all-gather 2 1073741824 ###.## all-gather 2 2147483648 ###.## all-gather 2 4294967296 ###.## all-gather 2 8589934592 ###.## ... all-reduce 2 1073741824 ###.## ... reduce-scatter 2 1073741824 ###.## ... -------- Benchmarking Complete -------
All of the Slurm job scripts and nccl-tests output logs are stored in the
nccl-tests_$(date +%s)/experimentsdirectory. A summary of the NCCL test performance is also stored in thenccl-tests_${date +%s)/summary.tsvfile.Removing
nccl-tests_$(date +%s)/directory removes all of the files generated during these tests.
A3 Ultra machines
Download the script needed to build the NCCL test by running the following command from the shared directory of the login node (this node is usually located at
${HOME}):wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:
sbatch build-nccl-tests.sh
The preceding script runs on one of your nodes. It uses the
--container-mountsswitch to mount your current directory,$PWD, into the/nccldirectory within the container.Verify that the NCCL test is built. To verify this, run the following command:
sacct -a
If successfully completed, the output is similar to the following:
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1 build-ncc+ a3ultra 112 COMPLETED 0:0
If the build is successful you should also have a file named
nvidia+pytorch+24.09-py3.sqshin the directory where you ran the command along with a directory namednccl-tests.Check that the
nccl-tests/buildfolder contains several binaries, includingall_gather_perf.Download the NCCL test script.
wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh
To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with RDMA. Because you use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the
run-nccl-tests.shscript that you just downloaded.Run the NCCL test script. The test can take approximately 15 minutes, or longer.
sbatch run-nccl-tests.sh
Review the results. The script outputs a
slurm-XX.outfile that contains the result of the ncclall_gather_perfbenchmark.The output is similar to the following:
# # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 536870912 8388608 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 1073741824 16777216 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 2147483648 33554432 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 4294967296 67108864 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 8589934592 134217728 float none -1 ##### ###.## ###.## N/A ###### ###.## ###.## 0 # Out of bounds values : 0 OK # Avg bus bandwidth : ###.## #
What's next
- Collect and Understand NCCL Logs for Troubleshooting to understand the test outputs and troubleshoot issues.
- Monitor VMs and Slurm clusters.
- Learn about troubleshooting slow performance.