Troubleshoot the Collective Communication Analyzer (CoMMA)
Stay organized with collections
Save and categorize content based on your preferences.
This page shows you how to resolve common issues that you might encounter when
using the Collective Communication Analyzer (CoMMA). CoMMA is a
library that collects telemetry data for Google Cloud services.
For more information, see Collective Communication Analyzer (CoMMA).
Troubleshoot CoMMA loading issues
CoMMA might not load correctly.
To verify that the binaries load correctly, complete these steps:
Enable NCCL debug logging. To enable logging, set the environment variable
NCCL_DEBUG=INFO. You might also use a more detailed debug level.
For options, see the
NCCL_DEBUG
section in the NVIDIA documentation.
Specify the INIT subsystem for debugging. To specify INIT, set
NCCL_DEBUG_SUBSYS=INIT. You might also add other subsystems.
For more subsystem options, see the NCCL_DEBUG_SUBSYS section.
Look for a line in the NCCL log that is similar to the following:
NCCL INFO PROFILER/Plugin: Plugin name set by env to PATH_TO_PROFILER_PLUGIN
If the NCCL_PROFILER_PLUGIN environment variable is unset, NCCL might
attempt to load the libnccl-profiler.so binary from the path specified in
the LD_LIBRARY_PATH environment variable.
To resolve this issue, consider the following solutions:
Verify that the plugin shared library (libnccl-profiler.so) is correctly
named.
Check that it is located in a directory specified in LD_LIBRARY_PATH
environment variable. Alternatively, check that the NCCL_PROFILER_PLUGIN
environment variable points directly to the location of the libnccl-profiler.so
binary.
Check that your NCCL version is 2.23 or later, as the NCCL profiler API
requires this version.
Troubleshoot missing output files
If you configured your environment to send data collected by CoMMA
to a local file, but the output file is missing, check the NCCL logs
or application logs for messages that are similar to the following:
Failed to open file
Failed to log <telemetry type> to file
These errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to
files after these errors occur.
To resolve this issue, consider these solutions:
Check that the NCCL_PROFILER_LATENCY_FILE or NCCL_PROFILER_SUMMARY_FILE
environment variables are set correctly. Provide a valid path and filename
template, such as /tmp/latency-%p.txt.
Check that the process has write permissions to the specified output
directory.
If you modified the NCCL_TELEMETRY_MODE environment variable, check that
you set it to a value that enables local file output (for example, 1 or 4).
Troubleshoot unexpected data or missing events
CoMMA might capture unexpected
data or miss expected events.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-09-04 UTC."],[],[],null,["This page shows you how to resolve common issues that you might encounter when\nusing the Collective Communication Analyzer (CoMMA). CoMMA is a\nlibrary that collects telemetry data for Google Cloud services.\nFor more information, see [Collective Communication Analyzer (CoMMA)](/ai-hypercomputer/docs/nccl/comma).\n\n\nTroubleshoot CoMMA loading issues\n\nCoMMA might not load correctly.\nTo verify that the binaries load correctly, complete these steps:\n\n1. Enable NCCL debug logging. To enable logging, set the environment variable `NCCL_DEBUG=INFO`. You might also use a more detailed debug level. For options, see the [`NCCL_DEBUG`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug) section in the NVIDIA documentation.\n2. Specify the `INIT` subsystem for debugging. To specify `INIT`, set `NCCL_DEBUG_SUBSYS=INIT`. You might also add other subsystems. For more subsystem options, see the [`NCCL_DEBUG_SUBSYS`](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-debug-subsys) section.\n3. Look for a line in the NCCL log that is similar to the following:\n `NCCL INFO PROFILER/Plugin: Plugin name set by env to `\u003cvar translate=\"no\"\u003ePATH_TO_PROFILER_PLUGIN\u003c/var\u003e\n\n If the `NCCL_PROFILER_PLUGIN` environment variable is unset, NCCL might\n attempt to load the `libnccl-profiler.so` binary from the path specified in\n the `LD_LIBRARY_PATH` environment variable.\n\nTo resolve this issue, consider the following solutions:\n\n- Verify that the plugin shared library (`libnccl-profiler.so`) is correctly\n named.\n\n Check that it is located in a directory specified in `LD_LIBRARY_PATH`\n environment variable. Alternatively, check that the `NCCL_PROFILER_PLUGIN`\n environment variable points directly to the location of the `libnccl-profiler.so`\n binary.\n- Check that your NCCL version is `2.23` or later, as the NCCL profiler API\n requires this version.\n\nTroubleshoot missing output files\n\nIf you configured your environment to send data collected by CoMMA\nto a local file, but the output file is missing, check the NCCL logs\nor application logs for messages that are similar to the following: \n\n```\nFailed to open file\nFailed to log \u003ctelemetry type\u003e to file\n```\n\nThese errors indicate an underlying file system issue, such as a missing directory or insufficient free space. CoMMA ceases to export telemetry to\nfiles after these errors occur.\n\nTo resolve this issue, consider these solutions:\n\n- Check that the `NCCL_PROFILER_LATENCY_FILE` or `NCCL_PROFILER_SUMMARY_FILE` environment variables are set correctly. Provide a valid path and filename template, such as `/tmp/latency-%p.txt`.\n- Check that the process has write permissions to the specified output directory.\n- If you modified the `NCCL_TELEMETRY_MODE` environment variable, check that you set it to a value that enables local file output (for example, `1` or `4`).\n\nTroubleshoot unexpected data or missing events\n\nCoMMA might capture unexpected\ndata or miss expected events.\n\nTo resolve this issue, check that the required\n[level of granularity is set](/ai-hypercomputer/docs/nccl/comma#set-data-granularity)."]]