Monitor instances and operations

Cloud Monitoring automatically collects and stores information about your Managed Lustre instance.

This document provides a detailed overview of the metrics available for monitoring your Managed Lustre instances on Google Cloud. These metrics help you understand the performance, capacity, and health of your Managed Lustre file systems, so you can identify bottlenecks, troubleshoot issues, and optimize resource utilization.

You can use these metrics in Cloud Monitoring to create custom dashboards, set up alerts, and gain deeper insights into your Managed Lustre instance's behavior.

Cloud Monitoring is automatically enabled for Managed Lustre. There's no charge for the collection of data or to view metrics in the Google Cloud console. API calls may incur charges; see Cloud Monitoring pricing for pricing details.

Required IAM roles

The following roles are required:

Monitoring Viewer (roles/monitoring.viewer), or equivalent permissions, to view metrics in Cloud Monitoring.
Monitoring Editor (roles/monitoring.editor), or equivalent permissions, to configure alerts.

Learn how to grant an IAM role.

View metrics

Cloud Monitoring metrics are available from two locations in the Google Cloud console:

The Managed Lustre instance details page displays available metrics. In addition to the metrics listed below, it computes the bandwidth of bytes copied and the rate of objects copied.
The Cloud Monitoring page provides multiple chart options and customizations.

View metrics on the instance details page

To view a specific instance's metrics:

Go to the Instances page in the Google Cloud console.

Go to Instances
Click the instance for which to view metrics. The Instance details page appears.
Click the Monitoring tab. The default dashboard is displayed.

View metrics in Cloud Monitoring

To view Managed Lustre metrics in Cloud Monitoring, do the following:

Go to the Metrics Explorer page in the Google Cloud console.

Go to Monitoring: Metrics Explorer
Follow the instructions in Create charts with Metrics Explorer to select and display your metrics.

Set up alerts

You can configure alerting policies in Cloud Monitoring to notify you when your Managed Lustre file system meets specific conditions, such as exceeding storage capacity or throughput limits.

Prerequisites

To create alerting policies, you must have the Monitoring Editor (roles/monitoring.editor) IAM role on the project.

Create an alerting policy

To set up an alert, define a condition using a metric or a PromQL query and configure notification channels.

In the Google Cloud console, go to the Alerting page in the Google Cloud console.

Go to Monitoring: Alerting
Click + Create policy.
Select Builder and select your metric, or choose Code editor to enter a query with PromQL. In the metric picker, Managed Lustre metrics fall under the Lustre instance and Lustre location resources.
Configure your trigger logic and define your notification channels and notification settings.
Click Create policy.

For more information about creating triggers and other options, see:

Example: Create a storage capacity alert

The following example demonstrates how to create an alert that triggers when your Managed Lustre instance exceeds 80% of its provisioned capacity.

In the Google Cloud console, go to the Alerting page in the Google Cloud console.

Go to Monitoring: Alerting
Click + Create policy.
Select Code editor.

In the Query Editor, paste the following PromQL query:

(
  sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes)
  -
  sum by (instance_id, location) (lustre_googleapis_com:instance_available_bytes)
)
/
sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes)
> 0.8

This query calculates the usage ratio across all instances: (Total - Available) / Total. The value 0.8 represents the total bytes reaching 80% usage. To alert at 90%, change this value to 0.9.

Click Run Query to verify the syntax and view a chart of the current usage ratio.
Click Next and configure the trigger to Any time series violates.

Click Next. In the Documentation section, add recommended actions for resolving the capacity issue. For example:

## Action Required: Lustre Capacity Warning
The Managed Lustre instance is exceeding 80% capacity usage.

**Metric:** Usage Ratio > 0.8
**Severity:** Warning

**Recommended Actions:**
1. Check the instance details in the Google Cloud console.
2. Verify if this is expected data growth or a runaway process.
3. If valid, consider expanding the storage capacity of the instance or deleting old data to free up space.
4. Failure to address this may result in "No Space Left on Device" errors for client applications.

Create an alerting policy with gcloud

You can create alerting policies using the Google Cloud CLI. Note that you must edit the alert in the Google Cloud console later to enable specific notification channels.

The following example creates an 80% capacity alert using gcloud:

gcloud alpha monitoring policies create \
  --policy-from-file=/dev/stdin <<EOF
{
  "displayName": "Lustre High Capacity Usage (>80%)",
  "severity": "WARNING",
  "combiner": "OR",
  "conditions": [
    {
      "displayName": "Capacity Usage Ratio > 0.8",
      "conditionPrometheusQueryLanguage": {
        "query": "(sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes) - sum by (instance_id, location) (lustre_googleapis_com:instance_available_bytes)) / sum by (instance_id, location) (lustre_googleapis_com:instance_capacity_bytes) > 0.8",
        "duration": "300s",
        "evaluationInterval": "60s",
        "alertRule": "AlwaysOn"
      }
    }
  ],
  "documentation": {
    "content": "Action Required: The Managed Lustre instance is exceeding 80% capacity usage. Please verify if storage expansion is required.",
    "mimeType": "text/markdown"
  }
}
EOF

Metric details

The following metrics are available for Managed Lustre instances. Each metric is identified by its type (e.g., lustre.googleapis.com/instance/available_bytes), has a display name, a description, and specific labels that provide additional context.

Data is sampled every 60 seconds. After sampling, data may not be visible for up to 180 seconds.

Storage Capacity Metrics

Metrics related to the storage space available and provisioned on your Lustre file system.

For metric labels, the value of target uses the format <fsname>-<TYPE><HEXA> where <HEXA> is the zero-based index of the target in hexadecimal. For example, if your filesystem name is filesys, the 43rd OST is filesys-OST002a, and the 4th MDT is filesys-MDT0003.

Metric	Description	Details
`available_bytes`	The number of bytes of storage space for a given Object Storage Target (OST) or Metadata Target (MDT) that is available to non-root users.	Display Name: Available bytes Metric Kind: GAUGE Value Type: INT64 Unit: bytes Labels: `component`: The target type: `ost`, `mdt`, or `mgt`. `target`: The name of the target.
`capacity_bytes`	The number of bytes provisioned for the given target. The total cluster usable data or metadata space for an instance can be obtained by adding the capacity of all targets for a given type of target.	Display Name: Capacity bytes Metric Kind: GAUGE Value Type: INT64 Unit: bytes Labels: `component`: The target type: `ost`, `mdt`, or `mgt`. `target`: The name of the target.
`free_bytes`	The number of bytes of storage space for a given OST or MDT that is available to root users.	Display Name: Free bytes Metric Kind: GAUGE Value Type: INT64 Unit: bytes Labels: `component`: The target type: `ost`, `mdt`, or `mgt`. `target`: The name of the target.

Inode (object) Metrics

Metrics related to the number of inodes (objects) available and the maximum capacity.

Metric	Description	Details
`inodes_free`	The number of inodes (objects) available on the given target.	Display Name: Free inodes Metric Kind: GAUGE Value Type: INT64 Unit: inodes Labels: `component`: The target type. `target`: The name of the target.
`inodes_maximum`	The maximum number of inodes (objects) the target can hold.	Display Name: Maximum inodes Metric Kind: GAUGE Value Type: INT64 Unit: inodes Labels: `component`: The target type. `target`: The name of the target.

I/O Performance Metrics

Metrics providing insight into data transfer rates and operation latency.

Operation latency

Metric	Description	Details
`io_time_milliseconds_total`	The number of read or write operations whose latency is within the bucketed latency ranges.	Display Name: Operation latency Metric Kind: CUMULATIVE Value Type: INT64 Unit: operations Labels: `component`: The target type. `operation`: The operation type. `size`: The bucketed latency range. For example, 512 includes the count of operations that took between 512 and 1024 milliseconds. `target`: The name of the target.
`read_bytes_total`	The number of data bytes read from the given OST.	Display Name: Data read bytes Metric Kind: CUMULATIVE Value Type: INT64 Unit: bytes Labels: `component`: The target type: always `ost`. `operation`: The operation type: `read`. `target`: The name of the target.
`read_samples_total`	The number of read operations performed on the given OST.	Display Name: Data read operations Metric Kind: CUMULATIVE Value Type: INT64 Unit: operations Labels: `component`: The target type: always `ost`. `operation`: The operation type: `read`. `target`: The name of the target.
`write_bytes_total`	The number of data bytes written to the given OST.	Display Name: Data write bytes Metric Kind: CUMULATIVE Value Type: INT64 Unit: bytes Labels: `component`: The target type: always `ost`. `operation`: The operation type: `write`. `target`: The name of the target.
`write_samples_total`	The number of write operations performed on the given OST.	Display Name: Data write operations Metric Kind: CUMULATIVE Value Type: INT64 Unit: operations Labels: `component`: The target type: always `ost`. `operation`: The operation type: `write`. `target`: The name of the target.

Client Connection Metrics

Metrics specifically for understanding client connectivity.

Connected clients

Metric	Description	Details
`connected_clients`	The number of clients currently connected to the given MDT.	Display Name: Connected clients Metric Kind: GAUGE Value Type: INT64 Unit: clients Labels: `component`: The target type. This is always `mdt`. `target`: The name of the MDT.