Shape the future of software operations and make your voice heard by taking the 2021 State of DevOps survey.

Types of alerting policies

This page describes different types of alerting policies as they are represented by the Google Cloud Console or by the Cloud Monitoring API, and it provides JSON examples for these policies. If you are interested in alerting policies that are created by using Monitoring Query Language (MQL), see Alerting policies with MQL.

An alerting policy defines conditions, and these conditions are built on metrics. An alerting policy condition can monitor, for example, if a metric reaches a value, or if a metric starts to change quickly. Metrics are associated with resources and measure some characteristic of that resource, for example, average CPU utilization across a group of VMs. For more information about metrics, see Metrics, time series, and resources.

All conditions watch for three things: Some metric behaving in some way for some period of time.

All conditions are implemented as one of two general types: a metric absence condition or a metric threshold condition.

Metric absence condition

A metric absence condition triggers if any time series in the metric has no data for a specific duration window.

Metric absence conditions require at least one successful measurement — one that retrieves data — within the maximum duration window after the policy was installed or modified. The maximum configurable duration window is 24 hours if you use the Google Cloud Console and 24.5 hours if you use the Cloud Monitoring API.

For example, suppose you set the duration window in a metric-absence policy to 30 minutes. The condition isn't met if the subsystem that writes metric data has never written a data point. The subsystem needs to output at least one data point and then fail to output additional data points for 30 minutes.

If you are using the Monitoring API, you can create a condition that triggers only when all of the time series are absent by aggregating the time series into a single time series by using aggregations; see the MetricAbsence in the API documentation.

Metric threshold condition

A metric threshold condition triggers if a metric rises above or falls below a value for a specific duration window.

Within the class of metric-threshold conditions, there are patterns that fall into general sub-categories:

  • Metric rate (percent) of change: Triggers if a metric increases or decreases by a specific percent or more during a duration window.

    In this type of condition, a percent-of-change computation is applied to the time series before comparison to the threshold.

    The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.

  • Group-aggregate threshold: Triggers if a metric measured across a resource group crosses a threshold.

  • Uptime-check health: Triggers if you've created an uptime check and the resource fails to successfully respond to a request sent from at least two geographic locations.

    The results of uptime checks are displayed in multiple places. In the Google Cloud Console, go to Monitoring and then select either Overview or Uptime Checks. Both windows list the uptime checks for the project, and they list the check status. To view details for a particular uptime check, select its name from the list. By creating an alerting policy on an uptime check, you can have uptime checks that indirectly open incidents and optionally send notifications when they fail.

  • Process health: These conditions count the number of process running on a VM instance or on an instance group that match a naming convention. The condition triggers when this count falls above or below a specific number during a duration window.

    This condition type requires the Monitoring Agent to be running on the monitored resources.

  • Metric ratio: Triggers if the ratio of two metrics exceeds a threshold for a duration. This is a threshold condition using two related metrics, for example, the ratio of HTTP error responses to all HTTP responses.

    The metrics being compared must have the same MetricKind. For example, you can create a ratio-based alerting policy if both metrics are gauge metrics. For a list of metrics and their properties, see Metric lists.

Examples

Examples of each of these types are available:

Condition type JSON example
Metric threshold View
Rate of change View
Group aggregate View
Uptime check View
Process health View
Metric ratio View

What's next