Monitor VMs and Slurm clusters

This document explains how to use the monitoring service that is available from the Cluster Director suite. For more information about Cluster Director, see Cluster Director overview.

By using the available metrics in this document, you can create or use prebuilt Cloud Monitoring dashboards to monitor the following:

  • VM health

  • GPU performance

  • Network transmission efficiency

  • Network efficiency among blocks and sub-blocks

  • Machine learning (ML) workload efficiency

  • Straggler detection

Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your workloads and infrastructure.

Before you begin

Before monitoring your workload, if you haven't already done so, complete the following steps:

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

Limitations

  • The metrics in this document are only supported for Cluster Director workloads that run on VMs that meet all the following criteria:

    • The VMs must have been created by using Compute Engine or a Slurm cluster of Compute Engine VMs.
    • The VMs must use the A4 or A3 Ultra machine series.
      • However, straggler detection also supports VMs that use the A3 Mega machine series.
    • The VMs must use the future reservations consumption option.
  • To monitor ML workload metrics, you must set up monitoring for your workload.

  • Straggler detection metrics have the following additional limitations:

    • For supported machine series other than A3 Mega, straggler detection only supports VMs that enable the Collective Communication Analyzer (CoMMA) library to export NCCL telemetry to Google Cloud services. For more information, see CoMMA overview.

    • Straggler detection typically takes up to 10 minutes to report a straggler.

    • Unlike the other metrics in this document, you can't filter straggler detection metrics for your projects by cluster, block, subblock, or VM. However, you can filter queries for straggler detection logs by the ID of one or more VMs that are suspected stragglers.

Required roles

To get the permissions that you need to monitor metrics for Cluster Director workloads, ask your administrator to grant you the following IAM roles:

  • To view metrics in Cloud Monitoring: Monitoring Editor (roles/monitoring.editor) on the project
  • To view straggler detection logs in Logging: Logs Viewer (roles/logging.viewer) on the project

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to monitor metrics for Cluster Director workloads. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to monitor metrics for Cluster Director workloads:

  • To view dashboards: monitoring.dashboards.get on the project
  • To create dashboards: monitoring.dashboards.create on the project
  • To view log entries: logging.logEntries.list on the project

You might also be able to get these permissions with custom roles or other predefined roles.

Available metrics

Depending on your use case, the following metrics are available for monitoring your VMs and Slurm clusters:

To learn how to view these metrics, see Visualize metrics in this document.

Infrastructure metrics

To monitor the health, performance, and network performance of your VMs with attached GPUs, use one or more of the following metrics:

  • To monitor the health of your VMs, use the following metrics:

    Name Metric type Description
    NVSwitch Status instance/gpu/nvswitch_status Whether an NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues.
    VM Infra Health instance/gpu/infra_health The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason.
  • To monitor the performance of your GPUs, use the following metrics:

    Name Metric type Description
    GPU Power Consumption instance/gpu/power_consumption The power in watts consumed on individual GPUs on the host formatted as a double. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host.
    SM Utilization instance/gpu/sm_utilization A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used.
    GPU Temperature instance/gpu/temperature The temperature in degrees Celsius of individual GPUs on the host formatted as a double. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host.
    GPU Thermal Margin instance/gpu/tlimit The thermal headroom in degrees Celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is formatted as a double. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host.
  • To monitor the network performance across blocks and sub-blocks, use the following metrics:

    Name Metric type Description
    Network Traffic at Inter-Block instance/gpu/network/inter_block_tx The number of bytes of network traffic among blocks.
    Network Traffic at Inter-Subblock instance/gpu/network/inter_subblock_tx The number of bytes of network traffic among sub-blocks.
    Network Traffic at Intra-Subblock instance/gpu/network/intra_subblock_tx The number of bytes of network traffic within a single sub-block.
  • To monitor the network performance of your GPUs, use the following metrics:

    Name Metric type Description
    Link Carrier Changes instance/gpu/link_carrier_changes How often the network link carrier changes in a minute.
    Network RTT instance/gpu/network_rtt The round-trip time, measured in microseconds, for network data to travel between a source and destination.
    Throughput Rx Bytes instance/gpu/throughput_rx_bytes The number of bytes received from network traffic.
    Throughput TX Bytes instance/gpu/throughput_tx_bytes The number of bytes transmitted to network traffic.

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

ML workload metrics

To monitor the productivity—specifically, the goodput—of your ML workloads, use the following metrics:

Name Metric type Description
Productive time workload/goodput_time The time, in seconds, the workload spends on goodput activities. These activities are core, useful tasks, such as a forward or backward pass during model training.
Non-productive time workload/badput_time The time, in seconds, the workload spends on badput activities. These activities are overhead tasks, such as loading or preprocessing data for training.

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

Straggler detection metrics

Straggler detection metrics help you notice and pinpoint suspected stragglers. Stragglers are single-point, non-crashing failures that eventually slow down the entire workload.

To monitor straggler detection for your VMs, use the following metric:

Name Metric type Description
Suspected Stragglers instance/gpu/straggler_status Whether a VM is suspected as a straggler that is affecting the performance of the workload. We recommend that you act on suspected stragglers only when other metrics indicate that the workload is experiencing issues.

You can also view straggler detection metrics in log entries. For example, you can use the following queries:

Description Query
Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project.
    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"
    

Replace INSTANCE_ID with the ID of a VM. For each additional VM that you want to specify, add the following condition to the query:

    OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID"
    
All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.)
    logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic"
    

Straggler detection metrics are particularly helpful for large-scale ML workloads for the following reasons:

  • Large-scale ML workloads are very susceptible to stragglers. Large-scale ML workloads use synchronous and massively distributed computing. (In other words, they have many, highly interdependent components that run simultaneously.) This architecture makes large-scale ML workloads very susceptible to single-point failures like stragglers.

  • Noticing and pinpointing stragglers in large-scale ML workloads is very difficult. For reference, consider that there are two types of single-point failures:

    • stopping failures: Failures that cause the entire system to halt; for example host errors and maintenance events. They are relatively straightforward to detect and resolve.

    • slow failures: Failures that cause severe performance degradation without crashes. They are very difficult to pinpoint and debug.

    Due to their slow-failure nature, stragglers are inherently difficult to notice and pinpoint, especially in large-scale synchronous workloads.

View metrics

To view metrics for your VMs and Slurm clusters, use Monitoring dashboards as follows:

Use prebuilt dashboards

You can use Monitoring dashboards that are prebuilt for Cluster Director to view metrics for your VMs and Slurm clusters. You can also create a copy of a prebuilt dashboard and modify it to fit your needs.

To use a prebuilt dashboard for Cluster Director, do the following:

  1. In the Google Cloud console, go to the  Dashboards page:

    Go to Dashboards

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  2. In the Name column, click the name of one of the following dashboards based on which metrics you want to view:

    • To monitor VM health, GPU performance, and straggler detection, use the Cluster Director Health Monitoring dashboard.

      For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Health Monitoring playbook dashboard.

    • To monitor network transmission efficiency, use the Cluster Director Transmission Efficiency dashboard.

    • To monitor network efficiency among blocks and sub-blocks, use the Cluster Director Block Network dashboard.

      For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Block Network playbook dashboard.

    The details page of your chosen dashboard opens. You can use the time-range selector in the toolbar to change the time range of the data.

  3. Optional: To create a copy of a dashboard and customize it to fit your needs, click Copy dashboard.

Create custom dashboards

To create a custom Monitoring dashboard, do the following:

  1. Choose the metrics to monitor. If you haven't already, then see Available metrics in this document.

  2. Create and manage custom dashboards

View straggler detection logs

To view straggler detection logs using the Logs Explorer, complete the following steps:

  1. In the Google Cloud console, go to the Logs Explorer page:

    Go to Logs Explorer

    If you use the search bar to find this page, then select the result whose subheading is Logging.

    The page queries all logs in your project by default. Click Stop query.

  2. Use the time-range selector in the toolbar to select the time range that you want to analyze. Straggler detection typically takes up to 10 minutes to report a straggler.

  3. In the Query pane, enter a query for straggler detection logs.

  4. Click Run Query.

The following is an example of a straggler detection log entry.

  {
    ...
    "jsonPayload": {
      ...
      "@type": "type.googleapis.com/ml.aitelemetry.performancedebugging.output.NetworkStragglersOutput",
      "suspectedStragglersDetection": {
        "numNodes": 4,
        "nodes": [
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_1"
          },
          {
            "latencyMs": 9,
            "instanceId": "INSTANCE_ID_2"
          },
          {
            "instanceId": "INSTANCE_ID_3",
            "latencyMs": 4
          },
          {
            "instanceId": "INSTANCE_ID_4",
            "latencyMs": 0
          }
        ],
        "message": "Suspected stragglers detected."
      }
    },
    "resource": {
      "type": "project",
      "labels": {
        "project_id": "PROJECT_NUMBER"
      }
    },
    ...
    "severity": "INFO",
    "logName": "projects/PROJECT_ID/logs/compute.googleapis.com%2Fworkload_diagnostic",
    ...
  }
  

The log entry includes the following fields:

  • numNodes: The number of suspected straggler VMs that are detected in the project. In the example, four (4) suspected straggler VMs have been detected.
  • instanceId: The ID of a VM that was detected as a suspected straggler.

For instructions on how to use and act on straggler detection logs, see troubleshoot slow performance.

Troubleshoot slow performance

For instructions on how to use metrics to troubleshoot workloads with slow performance, see Troubleshoot slow performance.

What's next