Enable GPUDirect-TCPXO optimized NCCL communication

This document provides an overview of how to enable GPUDirect-TCPXO for optimizing communication in multi-node workloads, such as ML training, by using a NCCL tests to measure NCCL collective performance between two nodes of an A3 Mega a3-megagpu-8g Slurm cluster.

Before you begin

Ensure that you have created an A3 Mega Slurm cluster. To create the cluster, see Deploy an A3 Mega Slurm cluster for ML training.

Overview

To enable GPUDirect-TCPXO and test NCCL communication, complete the following steps:

  1. Create an enroot container. For performing ML training on Slurm clusters, we recommend using enroot and pyxis, which together can be used to create and run containers with Slurm.
  2. Build the NCCL test.
  3. Set the GPUDirect-TCPXO environment variables and run the NCCL test.

Connect to the A3 Mega Slurm cluster

To enable optimized NCCL communication tuning on your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.

Console

  1. Go to the Compute Engine > VM instances page.

    Go to VM instances

  2. Locate the login node. It should have a name similar to a3mega-login-001.

  3. From the Connect column of the login node, click SSH.

gcloud

To connect to the login node, use the gcloud compute ssh command.

gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \
  --tunnel-through-iap \
  --zone ZONE

Create an enroot container

From the login node on your cluster, import a Pytorch image from the NVIDIA container registry.

To import the PyTorch image, run the following Slurm srun command from the login node:

srun -N 1 enroot import docker://nvcr.io#nvidia/pytorch:24.04-py3

This runs on one of your a3-megagpu-8g nodes that has more CPU and memory than the login node, which enroot can use to more quickly import the container. When the import completes, you should have a file named nvidia+pytorch+24.04-py3.sqsh in the directory where you ran the command.

Build NCCL test

Next, build the NCCL-test binaries by running the following command from the same directory as the previous step:

CONTAINER_IMAGE=./nvidia+pytorch+24.04-py3.sqsh

git clone https://github.com/NVIDIA/nccl-tests.git

srun --partition a3mega \
      --ntasks-per-node=1 \
      --gpus-per-node=8 \
      --container-mounts="$PWD:/nccl" \
      --container-image=${CONTAINER_IMAGE} \
      bash -c "
          cd /nccl/nccl-tests/ &&
          MPI=1 CC=mpicc CXX=mpicxx make -j
        "

This creates a directory named nccl-tests. The preceding command uses --container-mounts to mount your current working directory $PWD into the /nccl directory inside the container. After the srun command finishes, check that the nccl-tests/build folder contains several binaries, including all_gather_perf.

Run NCCL test

As part of the cluster deployment process, a Slurm prolog and epilog are installed which handles automatic installation of both a custom libnccl-net.so and the running of a sidecar process that enables GPUDirect-TCPXO optimized communication.

To run any job run on an A3 Mega cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-TCPXO. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment.

To run the NCCL test, select one of the following options based on whether or not the VMs in your cluster use a compact placement policy.

Without placement policy

If you don't use a compact placement policy, complete the following steps:

  1. Use a text editor to create a file named run-nccl-tests.sh and add the following content to the file:

    #!/bin/bash
    # Copyright 2024 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    #SBATCH --partition=a3mega
    #SBATCH --mem=0
    #SBATCH -N 2
    #SBATCH --gpus-per-node=8
    #SBATCH --ntasks-per-node=8
    
    # Usage: sbatch run-nccl-tests.sh
    
    set -x
    # This should be set to the squashfs file that you created for your application
    CONTAINER_IMAGE=./nvidia+pytorch+24.04-py3.sqsh
    
    # Set up NCCL Environment variables
    # The following two can be useful for debugging
    # export NCCL_DEBUG=INFO
    # export NCCL_DEBUG_SUBSYS=INIT,NET
    
    # These parameters should not be modified
    NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
    export NCCL_FASTRAK_CTRL_DEV=enp0s12
    export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
    export NCCL_SOCKET_IFNAME=enp0s12
    export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices
    
    # Here we grab all the environment variables that need to be
    # passed down into the container. Slurm would otherwise only pass these env vars
    # to the job environment on the host.
    # shellcheck disable=SC2001
    HOST_VARS=$(sed 's/ \{1,\}/,/g' <<<"${!NCCL*}")
    
    # Mount /var/tmp to allow the rest of the enroot container to be read-only, and
    # mount current $PWD to /nccl to for accessing nccl-tests binary
    CONTAINER_MOUNTS="/var/tmp:/var/tmp"
    
    # Mount PWD to /nccl in the enroot container
    CONTAINER_MOUNTS=${CONTAINER_MOUNTS},"$PWD:/nccl"
    
    # Mount required directories for GPUDirect-TCPXO functionality
    CONTAINER_MOUNTS=${CONTAINER_MOUNTS},"/var/lib/tcpxo/lib64/"
    
    # Run the workload
    srun -l \
    	-N "${SLURM_NNODES}" \
    	--mpi=pmi2 \
    	--ntasks-per-node=8 \
    	--container-image="${CONTAINER_IMAGE}" \
    	--container-env="${HOST_VARS}" \
    	--container-mounts="${CONTAINER_MOUNTS}" \
    	sh -c "
      export LD_LIBRARY_PATH=/var/lib/tcpxo/lib64:/usr/lib/x86_64-linux-gnu:\$LD_LIBRARY_PATH;
      /nccl/nccl-tests/build/all_gather_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200 -c 0;
      "
    

  2. Submit the script.
    sbatch run-nccl-tests.sh

    This results in a slurm-XX.out file that contains the result of the nccl all_reduce_perf benchmark.

    The output is similar to the following:

    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
       268435456       4194304     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       536870912       8388608     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       1073741824      16777216     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       2147483648      33554432     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       4294967296      67108864     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       8589934592     134217728     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : XXX.XX
    #
    

With placement policy

If you use a compact placement policy, complete the following steps:

  1. Use a text editor to create a file named run-topological-nccl-tests.sh and add the following content to the file:
    #!/bin/bash
    # Copyright 2024 Google LLC
    #
    # Licensed under the Apache License, Version 2.0 (the "License");
    # you may not use this file except in compliance with the License.
    # You may obtain a copy of the License at
    #
    #     http://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS,
    # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    # See the License for the specific language governing permissions and
    # limitations under the License.
    
    # shellcheck disable=SC2016
    
    #SBATCH --exclusive
    #SBATCH --partition=a3mega
    #SBATCH --mem=0
    #SBATCH --gpus-per-node=8
    #SBATCH --ntasks-per-node=8
    #SBATCH --nodes 2
    
    # Usage: sbatch run-nccl-tests.sh
    
    set -x
    # This should be set to the squashfs file that you created for your application
    CONTAINER_IMAGE=./nvidia+pytorch+24.04-py3.sqsh
    
    # Set up NCCL Environment variables
    # The following two can be useful for debugging
    # export NCCL_DEBUG=INFO
    # export NCCL_DEBUG_SUBSYS=INIT,NET
    
    # These parameters should not be modified
    # shellcheck source=/dev/null
    NCCL_LIB_DIR="/var/lib/tcpxo/lib64" source /var/lib/tcpxo/lib64/nccl-env-profile.sh
    export NCCL_FASTRAK_CTRL_DEV=enp0s12
    export NCCL_FASTRAK_IFNAME=enp6s0,enp7s0,enp13s0,enp14s0,enp134s0,enp135s0,enp141s0,enp142s0
    export NCCL_SOCKET_IFNAME=enp0s12
    export NCCL_FASTRAK_USE_SNAP=1
    export NCCL_FASTRAK_USE_LLCM=1
    export NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY=/dev/aperture_devices
    
    # Here we grab all the environment variables that need to be
    # passed down into the container. Slurm would otherwise only pass these env vars
    # to the job environment on the host.
    # shellcheck disable=SC2001
    HOST_VARS=$(sed 's/ \{1,\}/,/g' <<<"${!NCCL*}")
    
    # Mount /var/tmp to allow the rest of the enroot container to be read-only, and
    # mount current $PWD to /nccl to for accessing nccl-tests binary
    CONTAINER_MOUNTS="/var/tmp:/var/tmp"
    
    # Mount PWD to /nccl in the enroot container
    CONTAINER_MOUNTS=${CONTAINER_MOUNTS},"$PWD:/nccl"
    
    # Mount required directories for GPUDirect-TCPXO functionality
    CONTAINER_MOUNTS=${CONTAINER_MOUNTS},"/var/lib/tcpxo/lib64/"
    
    # Construct topology ordered hostfile
    # The -n, -N, --ntasks-per-node, etc, must match the way the workload is
    # launched in order to ensure proper placement.
    srun --mpi=pmi2 \
    	-n $((SLURM_NNODES * 8)) \
    	--ntasks-per-node=8 \
    	bash -c 'curl -s "http://metadata.google.internal/computeMetadata/v1/instance/attributes/physical_host" -H "Metadata-Flavor: Google"; echo /$SLURMD_NODENAME' |
    	sort -t / -s -k 1,4 |
    	awk -F "/" '{print $NF}' >/var/tmp/topo_sorted_hostfile
    export SLURM_HOSTFILE=/var/tmp/topo_sorted_hostfile
    
    # Run the workload
    srun -l \
    	--mpi=pmi2 \
    	--ntasks-per-node=8 \
    	--container-image="${CONTAINER_IMAGE}" \
    	--container-env="${HOST_VARS}" \
    	--container-mounts="${CONTAINER_MOUNTS}" \
    	sh -c "
      export LD_LIBRARY_PATH=/var/lib/tcpxo/lib64:/usr/lib/x86_64-linux-gnu:\$LD_LIBRARY_PATH;
      /nccl/nccl-tests/build/all_gather_perf -b 8M -e 8G -f 2 -g 1 -w 5 --iters 200 -c 0;
      "
    
  2. Submit the script.
    sbatch run-topological-nccl-tests.sh
    This results in a slurm-XX.out file that contains the result of the nccl all_reduce_perf benchmark.

    The output is similar to the following:

    #
    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
       268435456       4194304     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       536870912       8388608     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       1073741824      16777216     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       2147483648      33554432     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       4294967296      67108864     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
       8589934592     134217728     float    none      -1    XXXXX  XXX.XX  XXX.XX    N/A   XXXXXX  XXX.XX  XXX.XX    N/A
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : XXX.XX
    #