This document explains how to monitor A3 Ultra virtual machine (VM) instances and clusters that are deployed on Hypercompute Cluster. For more information about Hypercompute Cluster, see Hypercompute Cluster.
By using the available GPU metrics, you can create or use prebuilt Monitoring dashboards to monitor the following:
VMs and GPUs performance
Networks transmission efficiency
Networks efficiency among blocks and sub-blocks
Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your VMs and clusters. To learn more about Monitoring dashboards, see Dashboards overview.
Before you begin
- When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
Required roles
To get the permissions that you need to view and create Monitoring dashboards,
ask your administrator to grant you the
Monitoring Editor (roles/monitoring.editor
) IAM role on the project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to view and create Monitoring dashboards. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to view and create Monitoring dashboards:
-
To view dashboards:
monitoring.dashboards.get
on the project -
To create dashboards:
monitoring.dashboards.create
on the project
You might also be able to get these permissions with custom roles or other predefined roles.
Overview
To monitor your VMs and clusters, complete the following steps:
Review available metrics: view the GPU metric data that you can access and use for monitoring. This is useful to understand system behavior and, optionally, create custom dashboards in the next step.
For more information, see Available metrics in this document.
Visualize metrics: based on your monitoring needs, visualize the available metrics using prebuilt dashboards or custom dashboards.
For instructions, see Visualize metrics in this document.
Available metrics
This section lists the metrics that are available for monitoring the health and performance of GPUs. These metrics are available in prebuilt dashboards, and can also be used when creating a custom dashboard.
To monitor the health of your GPUs, use the following metrics:
Name Metric type Description NVSwitch Status instance/gpu/nvswitch_status
Whether a NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues. VM Infra Health instance/gpu/infra_health
The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason. To monitor the performance of your GPUs, use the following metrics:
Name Metric type Description GPU Power Consumption instance/gpu/power_consumption
The power in watts consumed on individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host. SM Utilization instance/gpu/sm_utilization
A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used. GPU Temperature instance/gpu/temperature
The temperature in celsius of individual GPUs on the host as a double value. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host. GPU Thermal Margin instance/gpu/tlimit
The thermal headroom in celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is displayed as a double value. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host. To monitor the network performance of your GPUs, use the following metrics:
Name Metric type Description Network Traffic at Inter-Block instance/gpu/network/inter_block_tx
The number of bytes of network traffic among blocks. Network Traffic at Inter-Subblock instance/gpu/network/inter_subblock_tx
The number of bytes of network traffic among sub-blocks. Network Traffic at Intra-Subblock instance/gpu/network/intra_subblock_tx
The number of bytes of network traffic within a single sub-block.
For an overview of available metrics in Compute Engine, see Google Cloud metrics.
Visualize metrics
To monitor metrics data of your VMs and clusters using Monitoring dashboards, use one of the following methods:
For a quick overview of health and performance, or to customize an existing dashboard, use prebuilt dashboards.
For specific monitoring needs, create custom dashboards.
If you encounter issues when using a dashboard, see Troubleshoot slow performance or errors in this document.
Use prebuilt dashboards
You can monitor your VMs and clusters using prebuilt Monitoring dashboards for Hypercompute Cluster. You can also create a copy of a prebuilt dashboard and modify it to fit to your needs.
To use a prebuilt Monitoring dashboard, do the following:
-
In the Google Cloud console, go to the Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Categories pane, click
GCP.In the Name column, click one of the following dashboards based on the metrics that you want to monitor:
To monitor VMs and GPUs performance, click Hypercompute Cluster Health Monitoring.
To monitor networks transmission efficiency, click Hypercompute Cluster Transmission Efficiency.
To monitor networks efficiency among blocks and sub-blocks, click Hypercompute Cluster Block Network.
The details page of your chosen dashboard opens.
Optional: To create a copy of a dashboard and customize it to fit your needs, click
Copy dashboard.
Create custom dashboards
To create custom a Monitoring dashboard, do the following:
Choose the metrics to monitor. If you haven't already, see Available metrics in this document.
Troubleshoot slow performance or errors
If you experience slow performance or errors in your jobs or workloads, then you can troubleshoot them by doing the following:
-
In the Google Cloud console, go to the Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Categories pane, click
GCP.To learn more about the metrics, do the following:
For network efficiency among blocks and sub-blocks, click GCE Interactive Playbook - Hypercompute Cluster Block Network.
For VM and GPU performance, and network transmission efficiency, click GCE Interactive Playbook - Hypercompute Cluster Health Monitoring.
Optional: To create a copy of a dashboard and customize it to fit your needs, click
Copy dashboard.