Optimize cluster networking by using NCCL/gIB

Modern machine learning frameworks often use the NVIDIA Collective Communications Library (NCCL) for GPU-to-GPU communication primitives.

Google's enhanced version of NCCL is called NCCL/gIB and is available on Google Cloud's A3 Ultra, A4, and A4X VMs. NCCL/gIB is often more performant than upstream NCCL on Google infrastructure. Therefore, because NCCL performance can impact overall workload performance, we recommend that you use NCCL/gIB.

NCCL/gIB contains Google-specific features and optimizations such as the following:

The gIB network plugin offers improved load balancing on Google's networks, leading to more consistent high throughput and low latency during collective operations.
A custom tuner plugin, which selects the best tuning options on Google Cloud VMs.
The CoMMA profiler plugin provides detailed performance metrics and diagnostic data for your workload.

NCCL/gIB architecture

NCCL/gIB interacts with your machine learning framework and the NVIDIA GPUs on your clusters to optimize performance and gather telemetry, as shown in this diagram:

The ML workload is managed with an ML framework that connects to both the NVIDIA GPUs and NCCL, while NCCL connects to various Google tools and plugins.

Benefits of using NCCL/gIB

It is possible to use the upstream NVIDIA Collective Communications Library on Google Cloud VMs without stability problems. However, NCCL/gIB is better optimized for Google Cloud and the performance disparity can be very significant for certain communication patterns, even with the same NCCL parameters.

For example, the following diagram shows a comparison of NCCL/gIB with upstream NCCL on AllReduce performance. NCCL/gIB outperforms upstream NCCL by as much as 12x on certain message sizes.

A graph showing NCCL/gIB outperforms upstream NCCL at AllReduce tasks.

32-node NCCL AllReduce performance using A3 Ultra (H200) with no background traffic.

Similarly, as shown in the following image, in a comparison of NCCL/gIB with upstream NCCL on AllGather performance with background traffic, NCCL/gIB outperforms upstream NCCL by approximately 50% on larger message sizes.

A graph showing NCCL/gIB outperforms upstream NCCL at AllGather tasks.

32-node NCCL AllGather performance using A3 Ultra (H200) on a shared fabric with a noisy background.

In addition, the CoMMA profiler plugin provides Google with improved custom telemetry, enabling us to better assist you should a workload-level issue arise.

Using NCCL/gIB

To run NCCL/gIB tests on your AI Hypercomputer cluster, choose the page that applies to your setup:

To learn how to address any issues with your cluster after you have run your tests, see Collect and understand NCCL/gIB logs for troubleshooting.