This document provides an overview of how to enable GPUDirect-TCPXO for
optimizing communication in multi-node workloads, such as ML training, by using
a NCCL tests to measure
NCCL collective performance between two nodes of an A3 Mega a3-megagpu-8g
Slurm cluster.
Before you begin
Ensure that you have created an A3 Mega Slurm cluster. To create the cluster, see Deploy an A3 Mega Slurm cluster for ML training.
Overview
To enable GPUDirect-TCPXO and test NCCL communication, complete the following steps:
- Create an enroot container. For performing ML training on Slurm clusters, we recommend using enroot and pyxis, which together can be used to create and run containers with Slurm.
- Build the NCCL test.
- Set the GPUDirect-TCPXO environment variables and run the NCCL test.
Connect to the A3 Mega Slurm cluster
To enable optimized NCCL communication tuning on your cluster, you must login to the Slurm login node. To login, you can use either Google Cloud console or Google Cloud CLI.
Console
Go to the Compute Engine > VM instances page.
Locate the login node. It should have a name similar to
a3mega-login-001
.From the Connect column of the login node, click SSH.
gcloud
To connect to the login node, use the
gcloud compute ssh
command.
gcloud compute ssh $(gcloud compute instances list --filter "name ~ login" --format "value(name)") \ --tunnel-through-iap \ --zone ZONE
Create an enroot container
From the login node on your cluster, import a Pytorch image from the NVIDIA container registry.
To import the PyTorch image, run the following Slurm srun
command from the login node:
srun -N 1 enroot import docker://nvcr.io#nvidia/pytorch:24.04-py3
This runs on one of your a3-megagpu-8g
nodes that has more CPU and memory than
the login node, which enroot can use to more quickly import the container.
When the import completes, you should have a file named
nvidia+pytorch+24.04-py3.sqsh
in the directory where you ran the command.
Build NCCL test
Next, build the NCCL-test binaries by running the following command from the same directory as the previous step:
CONTAINER_IMAGE=./nvidia+pytorch+24.04-py3.sqsh git clone https://github.com/NVIDIA/nccl-tests.git srun --partition a3mega \ --ntasks-per-node=1 \ --gpus-per-node=8 \ --container-mounts="$PWD:/nccl" \ --container-image=${CONTAINER_IMAGE} \ bash -c " cd /nccl/nccl-tests/ && MPI=1 CC=mpicc CXX=mpicxx make -j "
This creates a directory named nccl-tests
. The preceding command uses
--container-mounts
to mount your current working directory $PWD
into the
/nccl
directory inside the container.
After the srun
command finishes, check that the nccl-tests/build
folder
contains several binaries, including all_gather_perf
.
Run NCCL test
As part of the cluster deployment process, a Slurm
prolog
and epilog
are installed which handles automatic installation of both a custom
libnccl-net.so
and the running of a sidecar process that enables GPUDirect-TCPXO
optimized communication.
To run any job run on an A3 Mega cluster, several environment variables must be set in order to enable high performance networking with GPUDirect-TCPXO. Because we use enroot containers in this procedure to launch workloads, these variables must be set in the container environment as opposed to the host environment.
To run the NCCL test, select one of the following options based on whether or not the VMs in your cluster use a compact placement policy.
Without placement policy
If you don't use a compact placement policy, complete the following steps:
- Use a text editor to create a file named
run-nccl-tests.sh
and add the following content to the file: - Submit the script.
sbatch run-nccl-tests.sh
This results in a
slurm-XX.out
file that contains the result of the ncclall_reduce_perf
benchmark.The output is similar to the following:
# # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A # Out of bounds values : 0 OK # Avg bus bandwidth : XXX.XX #
With placement policy
If you use a compact placement policy, complete the following steps:
- Use a text editor to create a file named
run-topological-nccl-tests.sh
and add the following content to the file: - Submit the script.
sbatch run-topological-nccl-tests.sh
This results in aslurm-XX.out
file that contains the result of the ncclall_reduce_perf
benchmark.The output is similar to the following:
# # out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 268435456 4194304 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 536870912 8388608 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 1073741824 16777216 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 2147483648 33554432 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 4294967296 67108864 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A 8589934592 134217728 float none -1 XXXXX XXX.XX XXX.XX N/A XXXXXX XXX.XX XXX.XX N/A # Out of bounds values : 0 OK # Avg bus bandwidth : XXX.XX #