This document explains how to use the monitoring service that is available from the Cluster Director suite. For more information about Cluster Director, see Cluster Director overview.
By using the available metrics in this document, you can create or use prebuilt Cloud Monitoring dashboards to monitor the following:
VM health
GPU performance
Network transmission efficiency
Network efficiency among blocks and sub-blocks
Machine learning (ML) workload efficiency
Straggler detection
Monitoring this data helps you identify and troubleshoot performance bottlenecks, as well as ensure the health and stability of your workloads and infrastructure.
Before you begin
Before monitoring your workload, if you haven't already done so, complete the following steps:
Deploy a workload that you can monitor. To learn which workloads are supported, see the limitations in this document. To learn how to deploy a workload, see VM and cluster creation overview.
Learn about the Google Cloud services for monitoring workloads:
The metrics in this document use Monitoring dashboards. Learn about Monitoring dashboards, Monitoring retention periods, and Monitoring pricing.
Straggler detection also provides log entries in Cloud Logging. Learn about Logging interfaces, Logging retention periods, and Logging pricing.
When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
Limitations
The metrics in this document are only supported for Cluster Director workloads that run on VMs that meet all the following criteria:
- The VMs must have been created by using Compute Engine or a Slurm cluster of Compute Engine VMs.
- The VMs must use the A4 or A3 Ultra
machine series.
- However, straggler detection also supports VMs that use the A3 Mega machine series.
- The VMs must use the future reservations consumption option.
To monitor ML workload metrics, you must set up monitoring for your workload.
Straggler detection metrics have the following additional limitations:
For supported machine series other than A3 Mega, straggler detection only supports VMs that enable the Collective Communication Analyzer (CoMMA) library to export NCCL telemetry to Google Cloud services. For more information, see CoMMA overview.
Straggler detection typically takes up to 10 minutes to report a straggler.
Unlike the other metrics in this document, you can't filter straggler detection metrics for your projects by cluster, block, subblock, or VM. However, you can filter queries for straggler detection logs by the ID of one or more VMs that are suspected stragglers.
Required roles
To get the permissions that you need to monitor metrics for Cluster Director workloads, ask your administrator to grant you the following IAM roles:
-
To view metrics in Cloud Monitoring:
Monitoring Editor (
roles/monitoring.editor
) on the project -
To view straggler detection logs in Logging:
Logs Viewer (
roles/logging.viewer
) on the project
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to monitor metrics for Cluster Director workloads. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to monitor metrics for Cluster Director workloads:
-
To view dashboards:
monitoring.dashboards.get
on the project -
To create dashboards:
monitoring.dashboards.create
on the project -
To view log entries:
logging.logEntries.list
on the project
You might also be able to get these permissions with custom roles or other predefined roles.
Available metrics
Depending on your use case, the following metrics are available for monitoring your VMs and Slurm clusters:
To monitor the health, performance, and network performance of the GPUs attached to your VMs, see Infrastructure metrics.
To monitor the efficiency of the GPUs in your ML workloads, see ML workload metrics.
To monitor suspected straggler VMs in ML workloads with slow performance, see Straggler detection metrics.
To learn how to view these metrics, see Visualize metrics in this document.
Infrastructure metrics
To monitor the health, performance, and network performance of your VMs with attached GPUs, use one or more of the following metrics:
To monitor the health of your VMs, use the following metrics:
Name Metric type Description NVSwitch Status instance/gpu/nvswitch_status
Whether an NVLink Switch on an NVIDIA GPU attached to a VM is encountering issues. VM Infra Health instance/gpu/infra_health
The health of the cluster, block, sub-block, and host on which your GPU VMs are running. If this metric shows that a VM's infrastructure is unhealthy, then the metric also outputs the reason. To monitor the performance of your GPUs, use the following metrics:
Name Metric type Description GPU Power Consumption instance/gpu/power_consumption
The power in watts consumed on individual GPUs on the host formatted as a double
. For VMs with multiple GPUs attached, the metric provides the power consumption separately for each GPU on the host.SM Utilization instance/gpu/sm_utilization
A non-zero value indicates that the streaming multiprocessors (SMs) on your GPUs are actively being used. GPU Temperature instance/gpu/temperature
The temperature in degrees Celsius of individual GPUs on the host formatted as a double
. For VMs with multiple GPUs attached, the metric provides the temperature separately for each GPU on the host.GPU Thermal Margin instance/gpu/tlimit
The thermal headroom in degrees Celsius that individual GPUs have before they need to slow down due to high temperature. The value for this metric is formatted as a double
. For VMs with multiple GPUs attached, the metric provides the thermal headroom separately for each GPU on the host.To monitor the network performance across blocks and sub-blocks, use the following metrics:
Name Metric type Description Network Traffic at Inter-Block instance/gpu/network/inter_block_tx
The number of bytes of network traffic among blocks. Network Traffic at Inter-Subblock instance/gpu/network/inter_subblock_tx
The number of bytes of network traffic among sub-blocks. Network Traffic at Intra-Subblock instance/gpu/network/intra_subblock_tx
The number of bytes of network traffic within a single sub-block. To monitor the network performance of your GPUs, use the following metrics:
Name Metric type Description Link Carrier Changes instance/gpu/link_carrier_changes
How often the network link carrier changes in a minute. Network RTT instance/gpu/network_rtt
The round-trip time, measured in microseconds, for network data to travel between a source and destination. Throughput Rx Bytes instance/gpu/throughput_rx_bytes
The number of bytes received from network traffic. Throughput TX Bytes instance/gpu/throughput_tx_bytes
The number of bytes transmitted to network traffic.
For an overview of available metrics in Compute Engine, see Google Cloud metrics.
ML workload metrics
To monitor the productivity—specifically, the goodput—of your ML workloads, use the following metrics:
Name | Metric type | Description |
---|---|---|
Productive time | workload/goodput_time |
The time, in seconds, the workload spends on goodput activities. These activities are core, useful tasks, such as a forward or backward pass during model training. |
Non-productive time | workload/badput_time |
The time, in seconds, the workload spends on badput activities. These activities are overhead tasks, such as loading or preprocessing data for training. |
For an overview of available metrics in Compute Engine, see Google Cloud metrics.
Straggler detection metrics
Straggler detection metrics help you notice and pinpoint suspected stragglers. Stragglers are single-point, non-crashing failures that eventually slow down the entire workload.
To monitor straggler detection for your VMs, use the following metric:
Name | Metric type | Description |
---|---|---|
Suspected Stragglers | instance/gpu/straggler_status |
Whether a VM is suspected as a straggler that is affecting the performance of the workload. We recommend that you act on suspected stragglers only when other metrics indicate that the workload is experiencing issues. |
You can also view straggler detection metrics in log entries. For example, you can use the following queries:
Description | Query |
---|---|
Logs with suspected stragglers for specific VMs. Use this query to check if there are any suspected stragglers for a specific workload in your project. |
logName=~ "/logs/compute.googleapis.com%2Fworkload_diagnostic" AND jsonPayload.suspectedStragglersDetection.numNodes > 0 AND jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID" Replace OR jsonPayload.suspectedStragglersDetection.nodes.instanceId="INSTANCE_ID" |
All logs from straggler detection for your project. Use this query to verify if the straggler detection service is running when no suspected stragglers are detected. (Due to the limitations, you can't filter the logs without suspected stragglers by specific VMs.) |
|
Straggler detection metrics are particularly helpful for large-scale ML workloads for the following reasons:
Large-scale ML workloads are very susceptible to stragglers. Large-scale ML workloads use synchronous and massively distributed computing. (In other words, they have many, highly interdependent components that run simultaneously.) This architecture makes large-scale ML workloads very susceptible to single-point failures like stragglers.
Noticing and pinpointing stragglers in large-scale ML workloads is very difficult. For reference, consider that there are two types of single-point failures:
stopping failures: Failures that cause the entire system to halt; for example host errors and maintenance events. They are relatively straightforward to detect and resolve.
slow failures: Failures that cause severe performance degradation without crashes. They are very difficult to pinpoint and debug.
Due to their slow-failure nature, stragglers are inherently difficult to notice and pinpoint, especially in large-scale synchronous workloads.
View metrics
To view metrics for your VMs and Slurm clusters, use Monitoring dashboards as follows:
To view infrastructure metrics and straggler detection metrics, you can use the following methods:
For a quick overview or to customize an existing dashboard, use prebuilt dashboards.
For specific monitoring needs, create custom dashboards.
To view ML workload metrics, see the documentation for how to set up monitoring for your workload.
To view logs from straggler detection, view straggler detection logs.
Use prebuilt dashboards
You can use Monitoring dashboards that are prebuilt for Cluster Director to view metrics for your VMs and Slurm clusters. You can also create a copy of a prebuilt dashboard and modify it to fit your needs.
To use a prebuilt dashboard for Cluster Director, do the following:
-
In the Google Cloud console, go to the
Dashboards page:
If you use the search bar to find this page, then select the result whose subheading is Monitoring.
In the Name column, click the name of one of the following dashboards based on which metrics you want to view:
To monitor VM health, GPU performance, and straggler detection, use the Cluster Director Health Monitoring dashboard.
For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Health Monitoring playbook dashboard.
To monitor network transmission efficiency, use the Cluster Director Transmission Efficiency dashboard.
To monitor network efficiency among blocks and sub-blocks, use the Cluster Director Block Network dashboard.
For more information about how to use these metrics to identify and analyze issues, also use the GCE Interactive Playbook - Cluster Director Block Network playbook dashboard.
The details page of your chosen dashboard opens. You can use the time-range selector in the toolbar to change the time range of the data.
Optional: To create a copy of a dashboard and customize it to fit your needs, click
Copy dashboard.
Create custom dashboards
To create a custom Monitoring dashboard, do the following:
Choose the metrics to monitor. If you haven't already, then see Available metrics in this document.
View straggler detection logs
To view straggler detection logs using the Logs Explorer, complete the following steps:
-
In the Google Cloud console, go to the Logs Explorer page:
If you use the search bar to find this page, then select the result whose subheading is Logging.
The page queries all logs in your project by default. Click Stop query.
Use the time-range selector in the toolbar to select the time range that you want to analyze. Straggler detection typically takes up to 10 minutes to report a straggler.
In the Query pane, enter a query for straggler detection logs.
Click Run Query.
The following is an example of a straggler detection log entry.
{ ... "jsonPayload": { ... "@type": "type.googleapis.com/ml.aitelemetry.performancedebugging.output.NetworkStragglersOutput", "suspectedStragglersDetection": { "numNodes": 4, "nodes": [ { "latencyMs": 9, "instanceId": "INSTANCE_ID_1" }, { "latencyMs": 9, "instanceId": "INSTANCE_ID_2" }, { "instanceId": "INSTANCE_ID_3", "latencyMs": 4 }, { "instanceId": "INSTANCE_ID_4", "latencyMs": 0 } ], "message": "Suspected stragglers detected." } }, "resource": { "type": "project", "labels": { "project_id": "PROJECT_NUMBER" } }, ... "severity": "INFO", "logName": "projects/PROJECT_ID/logs/compute.googleapis.com%2Fworkload_diagnostic", ... }
The log entry includes the following fields:
numNodes
: The number of suspected straggler VMs that are detected in the project. In the example, four (4
) suspected straggler VMs have been detected.instanceId
: The ID of a VM that was detected as a suspected straggler.
For instructions on how to use and act on straggler detection logs, see troubleshoot slow performance.
Troubleshoot slow performance
For instructions on how to use metrics to troubleshoot workloads with slow performance, see Troubleshoot slow performance.
What's next
- Observe and monitor VMs
- Test clusters using cluster health scanner
- Customize dashboards for Google Cloud services