Run NCCL on Slurm clusters

This page describes how to run NCCL/gIB tests on a Slurm cluster. Choose the steps for your machine type:

A4X and A4 machines

The following test uses Ramble, which is an open-source, multi-platform experimentation framework written in Python that is used to coordinate the running of NCCL tests. Ramble and its dependencies are compatible with the ARM64 architecture used by A4X machines.

The run scripts used for this test are staged in the /opt/apps/system_benchmarks on the Slurm controller node and are available to all nodes in the cluster. Running this test installs Ramble to the /opt/apps/ramble directory.

  1. From the login node in the ${HOME} directory, run the following command. Because the test can take approximately 10 minutes, or longer if other jobs are in the queue, the following command uses nohup and redirects the stdout/err to a log file .

    nohup bash /opt/apps/system_benchmarks/run-nccl-tests-via-ramble.sh >& nccl.log &

    This command creates a folder called nccl-tests_$(date +%s) that stores all of the test results. The date tag ensures that a unique folder is created based on each current timestamp.

    For example, if your cluster has 16 nodes then NCCL tests are ran for all-gather, all-reduce, and reduce-scatter on 2, 4, 8, and 16 nodes.

  2. Review the results. The nccl.log contains the logs from setting up and running the test. To view these logs, run the following:

    tail -f nccl.log

    You can also use Ctrl+C to stop tailing the output at any time. At the end of the nccl.log, your output should resemble the following:

    ...
    ---- SUMMARY for >1GB Message Sizes ----
    workload        n_nodes msg_size        busbw
    all-gather      2       1073741824      ###.##
    all-gather      2       2147483648      ###.##
    all-gather      2       4294967296      ###.##
    all-gather      2       8589934592      ###.##
    ...
    all-reduce      2       1073741824      ###.##
    ...
    reduce-scatter  2       1073741824      ###.##
    ...
    -------- Benchmarking Complete -------
    

    All of the Slurm job scripts and nccl-tests output logs are stored in the nccl-tests_$(date +%s)/experiments directory. A summary of the NCCL test performance is also stored in the nccl-tests_${date +%s)/summary.tsv file.

    Removing nccl-tests_$(date +%s)/ directory removes all of the files generated during these tests.

A3 Ultra machines

  1. Download the script needed to build the NCCL test by running the following command from the shared directory of the login node (this node is usually located at ${HOME}):

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/build-nccl-tests.sh
  2. After the script downloads, import a Pytorch image from the NVIDIA container registry and build the NCCL tests. To do this, run the following command:

    sbatch build-nccl-tests.sh

    The preceding script runs on one of your nodes. It uses the --container-mounts switch to mount your current directory, $PWD, into the /nccl directory within the container.

  3. Verify that the NCCL test is built. To verify this, run the following command:

    sacct -a

    If successfully completed, the output is similar to the following:

    JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
    ------------ ---------- ---------- ---------- ---------- ---------- --------
    1            build-ncc+    a3ultra                   112  COMPLETED      0:0
    

    If the build is successful you should also have a file named nvidia+pytorch+24.09-py3.sqsh in the directory where you ran the command along with a directory named nccl-tests.

  4. Check that the nccl-tests/build folder contains several binaries, including all_gather_perf.

  5. Download the NCCL test script.

    wget -np -nd https://raw.githubusercontent.com/GoogleCloudPlatform/cluster-toolkit/refs/heads/main/examples/machine-learning/a3-ultragpu-8g/nccl-tests/run-nccl-tests.sh

    To run any job run on an A3 Ultra cluster, several environment variables must be set in order to enable high performance networking with RDMA. Because you use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment. These variables can be inspected in the run-nccl-tests.sh script that you just downloaded.

  6. Run the NCCL test script. The test can take approximately 15 minutes, or longer.

    sbatch run-nccl-tests.sh
  7. Review the results. The script outputs a slurm-XX.out file that contains the result of the nccl all_gather_perf benchmark.

    The output is similar to the following:

    #
    #                                                              out-of-place                       in-place
    #        size         count     type     redop   root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #         (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
        268435456       4194304     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
        536870912       8388608     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
       1073741824      16777216     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
       2147483648      33554432     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
       4294967296      67108864     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
       8589934592     134217728     float    none      -1    #####  ###.##  ###.##    N/A   ######  ###.##  ###.##      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : ###.##
    #
    

What's next