Monitor VMs and clusters

This document explains how to monitor A3 Ultra virtual machine (VM) instances and clusters that are deployed on Hypercompute Cluster. For more information about Hypercompute Cluster, see Hypercompute Cluster.

By using the available GPU metrics, you can create or use prebuilt Monitoring dashboards to monitor the following:

  • VMs and GPUs performance

  • Networks transmission efficiency

  • Networks efficiency among blocks and sub-blocks

Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your VMs and clusters. To learn more about Monitoring dashboards, see Dashboards overview.

Before you begin

  • When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

Required roles

To get the permissions that you need to view and create Monitoring dashboards, ask your administrator to grant you the Monitoring Editor (roles/monitoring.editor) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

This predefined role contains the permissions required to view and create Monitoring dashboards. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to view and create Monitoring dashboards:

  • To view dashboards: monitoring.dashboards.get on the project
  • To create dashboards: monitoring.dashboards.create on the project

You might also be able to get these permissions with custom roles or other predefined roles.

Overview

To monitor your VMs and clusters, complete the following steps:

  1. Review available metrics: view the GPU metric data that you can access and use for monitoring. This is useful to understand system behavior and, optionally, create custom dashboards in the next step.

    For more information, see Available metrics in this document.

  2. Visualize metrics: based on your monitoring needs, visualize the available metrics using prebuilt dashboards or custom dashboards.

    For instructions, see Visualize metrics in this document.

Available metrics

This section lists the metrics that are available for monitoring the health and performance of GPUs. These metrics are available in prebuilt dashboards, and can also be used when creating a custom dashboard.

  • To monitor the health of your GPUs, use the following metrics:

    Name Metric type Description
    NVSwitch Status instance/gpu/nvswitch_status Whether a NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues.
    VM Infra Health instance/gpu/infra_health The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason.
  • To monitor the performance of your GPUs, use the following metrics:

    Name Metric type Description
    GPU Power Consumption instance/gpu/power_consumption The power in watts consumed on individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host.
    SM Utilization instance/gpu/sm_utilization A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used.
    GPU Temperature instance/gpu/temperature The temperature in celsius of individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host.
    GPU Thermal Margin instance/gpu/tlimit The thermal headroom in celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is displayed as a double value. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host.
  • To monitor the network performance of your GPUs, use the following metrics:

    Name Metric type Description
    Network Traffic at Inter-Block instance/gpu/network/inter_block_tx The number of bytes of network traffic among blocks.
    Network Traffic at Inter-Subblock instance/gpu/network/inter_subblock_tx The number of bytes of network traffic among sub-blocks.
    Network Traffic at Intra-Subblock instance/gpu/network/intra_subblock_tx The number of bytes of network traffic within a single sub-block.

For an overview of available metrics in Compute Engine, see Google Cloud metrics.

Visualize metrics

To monitor metrics data of your VMs and clusters using Monitoring dashboards, use one of the following methods:

If you encounter issues when using a dashboard, see Troubleshoot slow performance or errors in this document.

Use prebuilt dashboards

You can monitor your VMs and clusters using prebuilt Monitoring dashboards for Hypercompute Cluster. You can also create a copy of a prebuilt dashboard and modify it to fit to your needs.

To use a prebuilt Monitoring dashboard, do the following:

  1. In the Google Cloud console, go to the  Dashboards page:

    Go to Dashboards

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  2. In the Categories pane, click G​C​P.

  3. In the Name column, click one of the following dashboards based on the metrics that you want to monitor:

    • To monitor VMs and GPUs performance, click Hypercompute Cluster Health Monitoring.

    • To monitor networks transmission efficiency, click Hypercompute Cluster Transmission Efficiency.

    • To monitor networks efficiency among blocks and sub-blocks, click Hypercompute Cluster Block Network.

    The details page of your chosen dashboard opens.

  4. Optional: To create a copy of a dashboard and customize it to fit your needs, click Copy dashboard.

Create custom dashboards

To create custom a Monitoring dashboard, do the following:

  1. Choose the metrics to monitor. If you haven't already, see Available metrics in this document.

  2. Create and manage custom dashboards

Troubleshoot slow performance or errors

If you experience slow performance or errors in your jobs or workloads, then you can troubleshoot them by doing the following:

  1. In the Google Cloud console, go to the  Dashboards page:

    Go to Dashboards

    If you use the search bar to find this page, then select the result whose subheading is Monitoring.

  2. In the Categories pane, click G​C​P.

  3. To learn more about the metrics, do the following:

    • For network efficiency among blocks and sub-blocks, click GCE Interactive Playbook - Hypercompute Cluster Block Network.

    • For VM and GPU performance, and network transmission efficiency, click GCE Interactive Playbook - Hypercompute Cluster Health Monitoring.

  4. Optional: To create a copy of a dashboard and customize it to fit your needs, click Copy dashboard.

What's next