Monitor VMs and Slurm clusters

This document explains how to use the monitoring service that is available from the Cluster Director suite. For more information about Cluster Director, see Cluster Director overview.

By using the available metrics in this document, you can create or use prebuilt Cloud Monitoring dashboards to monitor the following:

VM health
GPU performance
Network transmission efficiency
Network efficiency among blocks and sub-blocks
Machine learning (ML) workload efficiency
Straggler detection

Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your workloads and infrastructure.

Before you begin

Before monitoring your workload, if you haven't already done so, complete the following steps:

Deploy a workload that you can monitor. To learn which workloads are supported, see the limitations in this document. To learn how to deploy a workload, see VM and cluster creation overview.
Learn about the Google Cloud services for monitoring workloads:
- The metrics in this document use Monitoring dashboards. Learn about Monitoring dashboards, Monitoring retention periods, and Monitoring pricing.
- Straggler detection also provides log entries in Cloud Logging. Learn about Logging interfaces, Logging retention periods, and Logging pricing.

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

Limitations

The metrics in this document are only supported for Cluster Director workloads that run on VMs that meet all the following criteria:
- The VMs must have been created by using Compute Engine or a Slurm cluster of Compute Engine VMs.
- The VMs must use the A4 or A3 Ultra machine series.
  - However, straggler detection also supports VMs that use the A3 Mega machine series.
- The VMs must use the future reservations consumption option.
Caution: Although metrics might appear for other VMs, those metrics might be incorrect.
Known issues for infrastructure metrics:
- Some VMs might not display GPU Power Consumption metrics.
- Some VMs might not display GPU Temperature or GPU Thermal Margin metrics, or they might display NaN instead of metrics.
- Metrics might not appear in the Google Cloud console for up to seven minutes after they're collected.
To monitor ML workload metrics, you must set up monitoring for your workload.
Straggler detection metrics have the following additional limitations:
- For supported machine series other than A3 Mega, straggler detection only supports VMs that enable the Collective Communication Analyzer (CoMMA) library to export NCCL telemetry to Google Cloud services. For more information, see CoMMA overview.
- Caution: You might see false positives or false negatives when using straggler detection such as, but not limited to, the following:
  - Straggler detection logs report that running VMs that don't support straggler detection are not suspected stragglers.
  - Although straggler detection is accurate for many ML workloads, inaccuracies are more likely for workloads with complex communication patterns.
  Consequently, we recommend that you act on suspected stragglers only when other metrics indicate that the workload is experiencing issues. Otherwise, if the overall performance of the workload is satisfactory, then no action is recommended.
- Straggler detection typically takes up to 10 minutes to report a straggler.
- Unlike the other metrics in this document, you can't filter straggler detection metrics for your projects by cluster, block, subblock, or VM. However, you can filter queries for straggler detection logs by the ID of one or more VMs that are suspected stragglers.

Required roles

To get the permissions that you need to monitor metrics for Cluster Director workloads, ask your administrator to grant you the following IAM roles:

To view metrics in Cloud Monitoring: Monitoring Editor (roles/monitoring.editor) on the project
To view straggler detection logs in Logging: Logs Viewer (roles/logging.viewer) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to monitor metrics for Cluster Director workloads. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to monitor metrics for Cluster Director workloads:

To view dashboards: monitoring.dashboards.get on the project
To create dashboards: monitoring.dashboards.create on the project
To view log entries: logging.logEntries.list on the project

You might also be able to get these permissions with custom roles or other predefined roles.

Available metrics

Depending on your use case, the following metrics are available for monitoring your VMs and Slurm clusters:

To monitor the health, performance, and network performance of the GPUs attached to your VMs, see Infrastructure metrics.
To monitor the efficiency of the GPUs in your ML workloads, see ML workload metrics.
To monitor suspected straggler VMs in ML workloads with slow performance, see Straggler detection metrics.

To learn how to view these metrics, see Visualize metrics in this document.

Infrastructure metrics

To monitor the health, performance, and network performance of your VMs with attached GPUs, use one or more of the following metrics:

To monitor the health of your VMs, use the following metrics:

Name	Metric type	Description
NVSwitch Status	`instance/gpu/nvswitch_status`	Whether an NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues.
VM Infra Health	`instance/gpu/infra_health`	The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason.

To monitor the performance of your GPUs, use the following metrics:

Name	Metric type	Description
GPU Power Consumption	`instance/gpu/power_consumption`	The power in watts consumed on individual GPUs on the host formatted as a `double`. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host.
SM Utilization	`instance/gpu/sm_utilization`	A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used.
GPU Temperature	`instance/gpu/temperature`	The temperature in degrees Celsius of individual GPUs on the host formatted as a `double`. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host.
GPU Thermal Margin	`instance/gpu/tlimit`	The thermal headroom in degrees Celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is formatted as a `double`. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host.

To monitor the network performance across blocks and sub-blocks, use the following metrics:

Name	Metric type	Description
Network Traffic at Inter-Block	`instance/gpu/network/inter_block_tx`	The number of bytes of network traffic among blocks.
Network Traffic at Inter-Subblock	`instance/gpu/network/inter_subblock_tx`	The number of bytes of network traffic among sub-blocks.
Network Traffic at Intra-Subblock	`instance/gpu/network/intra_subblock_tx`	The number of bytes of network traffic within a single sub-block.

To monitor the network performance of your GPUs, use the following metrics:

Name	Metric type	Description
Link Carrier Changes	`instance/gpu/link_carrier_changes`	How often the network link carrier changes in a minute.
Network RTT	`instance/gpu/network_rtt`	The round-trip time, measured in microseconds, for network data to travel between a source and destination.
Throughput Rx Bytes	`instance/gpu/throughput_rx_bytes`	The number of bytes received from network traffic.
Throughput TX Bytes	`instance/gpu/throughput_tx_bytes`	The number of bytes transmitted to network traffic.

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

ML workload metrics

To monitor the productivity—specifically, the goodput—of your ML workloads, use the following metrics:

Name	Metric type	Description
Productive time	`workload/goodput_time`	The time, in seconds, the workload spends on goodput activities. These activities are core, useful tasks, such as a forward or backward pass during model training.
Non-productive time	`workload/badput_time`	The time, in seconds, the workload spends on badput activities. These activities are overhead tasks, such as loading or preprocessing data for training.

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

Straggler detection metrics

Straggler detection metrics help you notice and pinpoint suspected stragglers. Stragglers are single-point, non-crashing failures that eventually slow down the entire workload.

To monitor straggler detection for your VMs, use the following metric:

Name	Metric type	Description
Suspected Stragglers	`instance/gpu/straggler_status`	Whether a VM is suspected as a straggler that is affecting the performance of the workload. We recommend that you act on suspected stragglers only when other metrics indicate that the workload is experiencing issues.

You can also view straggler detection metrics in log entries. For example, you can use the following queries:

Description Query

Description	Query
Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project.	logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="`INSTANCE_ID`" Replace `INSTANCE_ID` with the ID of a VM. For each additional VM that you want to specify, add the following condition to the query: OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="`INSTANCE_ID`"
All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.)	`logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic"`

Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project.

    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"

Replace INSTANCE_ID with the ID of a VM. For each additional VM that you want to specify, add the following condition to the query:

    OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"

All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.)


    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic"

Straggler detection metrics are particularly helpful for large-scale ML workloads for the following reasons:

Large-scale ML workloads are very susceptible to stragglers. Large-scale ML workloads use synchronous and massively distributed computing. (In other words, they have many, highly interdependent components that run simultaneously.) This architecture makes large-scale ML workloads very susceptible to single-point failures like stragglers.
Noticing and pinpointing stragglers in large-scale ML workloads is very difficult. For reference, consider that there are two types of single-point failures:
- stopping failures: Failures that cause the entire system to halt; for example host errors and maintenance events. They are relatively straightforward to detect and resolve.
- slow failures: Failures that cause severe performance degradation without crashes. They are very difficult to pinpoint and debug.
Due to their slow-failure nature, stragglers are inherently difficult to notice and pinpoint, especially in large-scale synchronous workloads.

View metrics

To view metrics for your VMs and Slurm clusters, use Monitoring dashboards as follows:

To view infrastructure metrics and straggler detection metrics, you can use the following methods:
- For a quick overview or to customize an existing dashboard, use prebuilt dashboards.
- For specific monitoring needs, create custom dashboards.
To view ML workload metrics, see the documentation for how to set up monitoring for your workload.
To view logs from straggler detection, view straggler detection logs.

Use prebuilt dashboards

You can use Monitoring dashboards that are prebuilt for Cluster Director to view metrics for your VMs and Slurm clusters. You can also create a copy of a prebuilt dashboard and modify it to fit your needs.

To use a prebuilt dashboard for Cluster Director, do the following:

In the Google Cloud console, go to the Dashboards page:
Go to Dashboards

If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Name column, click the name of one of the following dashboards based on which metrics you want to view:
- To monitor VM health, GPU performance, and straggler detection, use the Cluster Director Health Monitoring dashboard.
  
  For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Health Monitoring playbook dashboard.
- To monitor network transmission efficiency, use the Cluster Director Transmission Efficiency dashboard.
- To monitor network efficiency among blocks and sub-blocks, use the Cluster Director Block Network dashboard.
  
  For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Block Network playbook dashboard.
The details page of your chosen dashboard opens. You can use the time-range selector in the toolbar to change the time range of the data.
Optional: To create a copy of a dashboard and customize it to fit your needs, click Copy dashboard.

Create custom dashboards

To create a custom Monitoring dashboard, do the following:

Choose the metrics to monitor. If you haven't already, then see Available metrics in this document.
Create and manage custom dashboards

View straggler detection logs

To view straggler detection logs using the Logs Explorer, complete the following steps:

In the Google Cloud console, go to the Logs Explorer page:
Go to Logs Explorer

If you use the search bar to find this page, then select the result whose subheading is Logging.

The page queries all logs in your project by default. Click Stop query.
Use the time-range selector in the toolbar to select the time range that you want to analyze.
In the Query pane, enter a query for straggler detection logs.
Click Run Query.

The following is an example of a straggler detection log entry.

  {
    ...
    "jsonPayload": {
      ...
      "@type": "type.googleapis.com/ml.aitelemetry.performancedebugging.output.NetworkStragglersOutput",
      "suspectedStragglersDetection": {
        "numNodes": 4,
        "nodes": [
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_1"
          },
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_2"
          },
          {
            "instanceId": "INSTANCE_ID_3",
            "latencyMs": 4
          },
          {
            "instanceId": "INSTANCE_ID_4",
            "latencyMs": 0
          }
        ],
        "message": "Suspected stragglers detected."
      }
    },
    "resource": {
      "type": "project",
      "labels": {
        "project_id": "PROJECT_NUMBER"
      }
    },
    ...
    "severity": "INFO",
    "logName": "projects/PROJECT_ID/logs/compute.googleapis.com%2Fworkload_diagnostic",
    ...
  }

The log entry includes the following fields:

numNodes: The number of suspected straggler VMs that are detected in the project. In the example, four (4) suspected straggler VMs have been detected.
instanceId: The ID of a VM that was detected as a suspected straggler.

For instructions on how to use and act on straggler detection logs, see troubleshoot slow performance.

Troubleshoot slow performance

For instructions on how to use metrics to troubleshoot workloads with slow performance, see Troubleshoot slow performance.