This page provides instructions for how to install NCCL/gIB with either Debian Software Packages (.deb) or the Red Hat Package Manager (.rpm). This installation lets you run NCCL tests on A3 Ultra, A4, and A4X VMs (the following examples are for 2-node tests).
If you are using Google's 1P schedulers such as GKE and Cluster Toolkit (with Slurm and GKE support), then you don't need to follow the steps on this page. Instead, follow the instructions on the page that is appropriate for your scenario:
- Run NCCL on GKE clusters that use default configuration
- Run NCCL on custom GKE clusters that use A4X
- Run NCCL on custom GKE clusters that use A4 or A3 Ultra
- Run NCCL tests on Slurm clusters
Install nccl-gib
Depending on where you run your workloads, you install NCCL/gIB in either the guest VM or the container image.
The nccl-gib package is bundled with an unmodified NVidia NCCL library (libnccl2.so) and headers. All NCCL/gIB content is installed to the /usr/local/gib directory. Some dependencies are also fetched through the distribution's repository.
Debian 12+/Ubuntu 20.04+ (.deb package)
# If not using an image from Google, trust the GCP signing key curl http://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/cloud.google.gpg # Add gpudirect-gib-apt repo echo 'deb https://packages.cloud.google.com/apt gpudirect-gib-apt main' | sudo tee /etc/apt/sources.list.d/nccl-gib.list sudo apt update sudo apt install nccl-gib
RockyLinux/CentOS/RHEL 9+ (.rpm package)
# Add gpudirect-gib-rpm repo sudo tee -a /etc/yum.repos.d/nccl-gib.repo << EOL [gpudirect-gib-rpm] name=NCCL/gIB baseurl=https://packages.cloud.google.com/yum/repos/gpudirect-gib-rpm enabled=1 repo_gpgcheck=0 gpgcheck=0 sudo dnf makecache sudo dnf install nccl-gib
If you are using standard OS images, you must also install the latest NVIDIA DOCA-OFED driver. You don't need to install this driver if you are using Google's A* optimized images, such as Container OS or Guest Accelerator Ubuntu/RockyLinux OS Images.
To avoid VMs running different versions of the nccl-gib package, we recommend that you update nccl-gib before you run your NCCL workloads or disable unattended-upgrades.
Use NCCL/gIB
To enable NCCL/gIB in your workloads, ensure the following:
/usr/local/gib/scripts/set_nccl_env.shis sourced in your runtime environment. The source file includes all the necessary environment variables for NCCL/gIB and Google expects to update them in future NCCL/gIB releases.- The
/usr/local/gib/lib64directory is in yourLD_LIBRARY_PATH.
To verify NCCL/gIB is enabled check that the following NCCL INFO level log entries are present:
# A sample log entry from NCCL core
vm-0:606:642 [6] NCCL INFO Using network gIB
# A sample log entry from the gIB network plugin
vm-0:606:642 [6] NCCL INFO NET/gIB : Initializing gIB v1.0.5
Run NCCL tests
To learn how to run NCCL tests in a scheduled environment, see the following:
- Run NCCL on GKE clusters that use default configuration
- Run NCCL on custom GKE clusters that use A4X
- Run NCCL on custom GKE clusters that use A4 or A3 Ultra
- Run NCCL tests on Slurm clusters
We also publish a diagnostic container image with everything included at http://us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic:latest.
To run NCCL tests in a non-scheduled environment:
- Install cuda-12.8 (or newer) and openmpi
- Set up non-interactive ssh logins among the VMs
- Build nccl-tests with MPI enabled. When building nccl-tests, set
NCCL_HOME=/usr/local/gib
To run the script shipped with the NCCL/gIB package:
# The script assumes binaries at /opt/nccl-tests/build/
$ /usr/local/gib/scripts/run_nccl_tests.sh -d /opt/nccl-tests/build/ -p 22 -t all_gather -m 0x0 -b 4K -e 16G a4-vm-1 a4-vm-2
Example output on two A4 VMs:
NCCL version 2.25.1+cuda12.8
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
4096 64 float none -1 59.97 0.07 0.06 0 57.49 0.07 0.07 0
8192 128 float none -1 58.17 0.14 0.13 0 58.36 0.14 0.13 0
16384 256 float none -1 59.07 0.28 0.26 0 59.03 0.28 0.26 0
32768 512 float none -1 60.93 0.54 0.50 0 60.79 0.54 0.51 0
65536 1024 float none -1 61.93 1.06 0.99 0 62.17 1.05 0.99 0
131072 2048 float none -1 64.62 2.03 1.90 0 64.48 2.03 1.91 0
262144 4096 float none -1 66.50 3.94 3.70 0 67.05 3.91 3.67 0
524288 8192 float none -1 69.37 7.56 7.09 0 67.83 7.73 7.25 0
1048576 16384 float none -1 117.2 8.95 8.39 0 113.7 9.22 8.64 0
2097152 32768 float none -1 118.8 17.65 16.55 0 118.1 17.75 16.64 0
4194304 65536 float none -1 122.2 34.32 32.17 0 122.6 34.22 32.08 0
8388608 131072 float none -1 132.2 63.44 59.48 0 130.7 64.20 60.18 0
16777216 262144 float none -1 139.2 120.49 112.96 0 139.7 120.07 112.56 0
33554432 524288 float none -1 152.0 220.81 207.01 0 152.1 220.59 206.81 0
67108864 1048576 float none -1 227.6 294.87 276.44 0 225.9 297.08 278.51 0
134217728 2097152 float none -1 431.7 310.87 291.44 0 438.0 306.41 287.26 0
268435456 4194304 float none -1 728.6 368.44 345.41 0 735.9 364.79 341.99 0
536870912 8388608 float none -1 1404.2 382.33 358.44 0 1418.4 378.51 354.85 0
1073741824 16777216 float none -1 2795.8 384.06 360.05 0 2768.9 387.79 363.55 0
2147483648 33554432 float none -1 5440.1 394.75 370.08 0 5418.7 396.31 371.54 0
4294967296 67108864 float none -1 10754 399.40 374.43 0 10746 399.67 374.69 0
8589934592 134217728 float none -1 21434 400.77 375.72 0 21421 401.01 375.95 0
17179869184 268435456 float none -1 42679 402.53 377.38 0 42792 401.48 376.38 0
What's next
- Collect and Understand NCCL Logs for Troubleshooting to understand the test outputs and troubleshoot issues.
- Monitor VMs and Slurm clusters.
- Learn about troubleshooting slow performance.