Stay organized with collections
Save and categorize content based on your preferences.
This document describes the Collective Communication Analyzer (CoMMA), a
library for collecting NCCL telemetry for Google Cloud services.
NCCL telemetry collects performance metrics and operational events
that NCCL generates during its execution. The
NVIDIA Collective Communication Library (NCCL)
accelerates high-performance communication between GPUs running in parallel and
distributed computing systems. This high-performance communication is especially
useful for deep learning and high performance computing (HPC).
For NCCL versions 2.23 and later, NVIDIA introduced the NCCL profiler plugin
API,
which lets developers register function callbacks to collect telemetry during
NCCL collective operations. Google provides the
Collective Communication Analyzer (CoMMA),
which is a library that uses NVIDIA's NCCL profiler plugin API to collect NCCL
telemetry for Google Cloud services. CoMMA automatically installs and enables
for some images, but you can also disable, re-enable, or manually install and
enable CoMMA to control data collection.
Images that have CoMMA enabled
For A4X, A4 High, and A3 Ultra machine types, CoMMA is installed and
automatically enabled when you use any images that packages
the NCCL Google Infrastructure Bundle (gIB) plugin. The following images contain
the NCCL gIB plugin:
Container-Optimized OS with containerd (cos_containerd)
node images: Google Kubernetes Engine (GKE) uses these images for creating
GKE Autopilot clusters. The CoMMA binaries are
available in the
/home/kubernetes/bin/gib directory.
If you use any of images and want to disable CoMMA from
collecting NCCL telemetry, see Disable CoMMA.
However, CoMMA must be enabled for features such as
straggler detection to function.
If you don't use these images and want to enable CoMMA to
collect NCCL telemetry, see Install CoMMA.
Benefits
The NCCL telemetry that CoMMA collects helps identify
performance bottlenecks, specifically stragglers, in GPU communication. CoMMA
collects fine-grained data, such as latency histograms for collective
communication operations. A diagnostic service can then process and use this
data to pinpoint stragglers.
Using CoMMA to collect telemetry offers the following benefits:
Required for straggler detection: CoMMA collects the fine-grained
NCCL telemetry to identify performance bottlenecks or stragglers in
GPU-to-GPU communication. CoMMA provides detailed NCCL telemetry that helps
identify and resolve issues in large-scale AI and ML training workloads.
For example, CoMMA captures the algorithm used in NCCL operations. This
information is valuable for performance analysis and tuning because
different algorithms can have significantly varying performance
characteristics based on workload and system configuration.
CoMMA also helps with the troubleshooting of suboptimal performance and
errors. It traces errors originating in lower-level transport layers, such
as TCP, RDMA, or switch fabrics, back to specific NCCL collectives and
initiating nodes.
Low-overhead tracing: CoMMA uses minimal computational resources during
active NCCL telemetry collection, making it ideal for performance-sensitive
and long-running machine learning workloads like large language model (LLM)
training.
Broaden NCCL telemetry scope: CoMMA uses the NCCL profiler plugin API.
This API collects a broader scope of NCCL telemetry in comparison to
transport-based plugins. Transport-based plugins primarily collect telemetry
about the underlying network transport, including data transfers over
network hardware and network protocols. The profiler plugin collects
telemetry for NCCL's communication operations, including the timing of
collective communications, proxy operations, and data transfers.
Understand how CoMMA works
During application runtime, NCCL automatically loads the CoMMA libraries
that are installed in the location specified by the LD_LIBRARY_PATH
environment variable. CoMMA then collects NCCL telemetry, which other Google
services can then use. You can also optionally export this data to your local
file system.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["This document describes the Collective Communication Analyzer (CoMMA), a\nlibrary for collecting NCCL telemetry for Google Cloud services.\n*NCCL telemetry* collects performance metrics and operational events\nthat NCCL generates during its execution. The\n[NVIDIA Collective Communication Library (NCCL)](https://developer.nvidia.com/nccl)\naccelerates high-performance communication between GPUs running in parallel and\ndistributed computing systems. This high-performance communication is especially\nuseful for deep learning and high performance computing (HPC).\n\nFor NCCL versions 2.23 and later, NVIDIA introduced the [NCCL profiler plugin\nAPI](https://developer.nvidia.com/blog/new-scaling-algorithm-and-initialization-with-nvidia-collective-communications-library-2-23/#new_profiler_plugin_api%C2%A0),\nwhich lets developers register function callbacks to collect telemetry during\nNCCL collective operations. Google provides the\n[Collective Communication Analyzer (CoMMA)](https://github.com/google/CoMMA),\nwhich is a library that uses NVIDIA's NCCL profiler plugin API to collect NCCL\ntelemetry for Google Cloud services. CoMMA automatically installs and enables\nfor some images, but you can also disable, re-enable, or manually install and\nenable CoMMA to control data collection.\n\nImages that have CoMMA enabled\n\nFor A4X, A4 High, and A3 Ultra machine types, CoMMA is installed and\nautomatically enabled when you use any images that packages\nthe NCCL Google Infrastructure Bundle (gIB) plugin. The following images contain\nthe NCCL gIB plugin:\n\n- [Container-Optimized OS with containerd (cos_containerd)](/kubernetes-engine/docs/concepts/node-images#cos-variants) node images: Google Kubernetes Engine (GKE) uses these images for creating GKE Autopilot clusters. The CoMMA binaries are available in the `/home/kubernetes/bin/gib` directory.\n- [Deep learning Software Layer container images](/ai-hypercomputer/docs/software-stack#dlsl-container-images): you use these images to deploy and configure AI and ML frameworks and libraries on GKE clusters.\n\nIf you use any of images and want to disable CoMMA from\ncollecting NCCL telemetry, see [Disable CoMMA](/ai-hypercomputer/docs/nccl/configure-comma#disable-plugin).\nHowever, CoMMA must be enabled for features such as\n[straggler detection](/ai-hypercomputer/docs/monitor) to function.\nIf you don't use these images and want to enable CoMMA to\ncollect NCCL telemetry, see [Install CoMMA](/ai-hypercomputer/docs/nccl/configure-comma#install-plugin).\n\nBenefits\n\nThe NCCL telemetry that CoMMA collects helps identify\nperformance bottlenecks, specifically stragglers, in GPU communication. CoMMA\ncollects fine-grained data, such as latency histograms for collective\ncommunication operations. A diagnostic service can then process and use this\ndata to pinpoint stragglers.\n\nUsing CoMMA to collect telemetry offers the following benefits:\n\n- **Required for straggler detection**: CoMMA collects the fine-grained\n NCCL telemetry to identify performance bottlenecks or stragglers in\n GPU-to-GPU communication. CoMMA provides detailed NCCL telemetry that helps\n identify and resolve issues in large-scale AI and ML training workloads.\n\n For example, CoMMA captures the algorithm used in NCCL operations. This\n information is valuable for performance analysis and tuning because\n different algorithms can have significantly varying performance\n characteristics based on workload and system configuration.\n\n CoMMA also helps with the troubleshooting of suboptimal performance and\n errors. It traces errors originating in lower-level transport layers, such\n as TCP, RDMA, or switch fabrics, back to specific NCCL collectives and\n initiating nodes.\n- **Low-overhead tracing**: CoMMA uses minimal computational resources during\n active NCCL telemetry collection, making it ideal for performance-sensitive\n and long-running machine learning workloads like large language model (LLM)\n training.\n\n- **Broaden NCCL telemetry scope**: CoMMA uses the NCCL profiler plugin API.\n This API collects a broader scope of NCCL telemetry in comparison to\n transport-based plugins. Transport-based plugins primarily collect telemetry\n about the underlying network transport, including data transfers over\n network hardware and network protocols. The profiler plugin collects\n telemetry for NCCL's communication operations, including the timing of\n collective communications, proxy operations, and data transfers.\n\nUnderstand how CoMMA works\n\nDuring application runtime, NCCL automatically loads the CoMMA libraries\nthat are installed in the location specified by the `LD_LIBRARY_PATH`\nenvironment variable. CoMMA then collects NCCL telemetry, which other Google\nservices can then use. You can also optionally export this data to your local\nfile system.\n\nWhat's next\n\n- Learn how to [enable, disable, and configure CoMMA](/ai-hypercomputer/docs/nccl/configure-comma).\n- Learn how to [troubleshoot issues with CoMMA](/ai-hypercomputer/docs/troubleshooting/troubleshoot-comma).\n- Learn how to [detect and resolve stragglers](/ai-hypercomputer/docs/troubleshooting/troubleshoot-slow-performance)."]]