Types of alerting policies

An alerting policy defines conditions, and these conditions are built on metrics. An alerting policy condition can monitor, for example, if a metric reaches a value, or if a metric starts to change quickly. Metrics are associated with resources and measure some characteristic of that resource, for example, average CPU utilization across a group of VMs. For more information about metrics, see Metrics, time series, and resources.

All conditions watch for three things: Some metric behaving in some way for some period of time.

All conditions are implemented as one of two general types: a metric absence condition or a metric threshold condition.

Metric absence condition

A metric absence condition triggers if any time series in the metric has no data for a specific duration window. The duration window is the length of time a condition must evaluate as true before an incident is created.

Metric absence conditions require at least one successful measurement — one that retrieves data — since the policy was installed or within the maximum duration window (24 hours).

For example, suppose you set the duration window in a metric-absence policy to 30 minutes. The condition isn't met if the subsystem that writes metric data has never written a data point. The subsystem needs to output at least one data point and then fail to output additional data points for 30 minutes.

Metric threshold condition

A metric threshold condition triggers if a metric rises above or falls below a value for a specific duration window.

Within the class of metric-threshold conditions, there are patterns that fall into general sub-categories:

  • Metric rate (percent) of change: Triggers if a metric increases or decreases by a specific percent or more during a duration window.

    In this type of condition, a percent-of-change computation is applied to the time series before comparison to the threshold.

    The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.

  • Group-aggregate threshold: Triggers if a metric measured across a resource group crosses a threshold.

  • Uptime-check health: Triggers if you've created an uptime check and the resource fails to successfully respond to a request sent from at least two geographic locations.

    The results of uptime checks are displayed in multiple places. In the Google Cloud Console, go to Monitoring and then select either Overview or Uptime Checks. Both windows list the uptime checks for the project, and they list the check status. To view details for a particular uptime check, select its name from the list. By creating an alerting policy on an uptime check, you can have uptime checks that indirectly open incidents and optionally send notifications when they fail.

  • Process health: These conditions count the number of process running on a VM instance or on an instance group that match a naming convention. The condition triggers when this count falls above or below a specific number during a duration window.

    This condition type requires the Monitoring Agent to be running on the monitored resources.

  • Metric ratio: Triggers if the ratio of two metrics exceeds a threshold for a duration. This is a threshold condition using two related metrics, for example, the ratio of HTTP error responses to all HTTP responses.

    The metrics being compared must have the same MetricKind. For example, you can create a ratio-based alerting policy if both metrics are gauge metrics. For a list of metrics and their properties, see Metric lists.

Examples

Examples of each of these types are available:

Condition type JSON example
Metric threshold View
Rate of change View
Group aggregate View
Uptime check View
Process health View
Metric ratio View

What's next