This document describes different types of metric-based alerting policies, and it provides JSON examples for these policies. Alerting policies define conditions that watch for three things: Some metric behaving in some way for some period of time. For example, an alerting policy might trigger when the value of a metric goes higher than a threshold, or when the value changes too quickly.
This content does not apply to log-based alerting policies. For information about log-based alerting policies, which notify you when a particular message appears in your logs, see Monitoring your logs.
For information about creating alerting policies, see the following documents:
- Alerting policies with Monitoring Query Language (MQL)
- Alerting policies in the Cloud Monitoring API
- Create metric-based alert policies with the Google Cloud console
Metric-absence condition
A metric-absence condition triggers when a monitored time series has no data for a specific duration window.
Metric-absence conditions require at least one successful measurement — one that retrieves data — within the maximum duration window after the policy was installed or modified. The maximum configurable duration window is 24 hours if you use the Google Cloud console and 24.5 hours if you use the Cloud Monitoring API.
For example, suppose you set the duration window in a metric-absence policy to 30 minutes. The condition won't trigger when the subsystem that writes metric data has never written a data point. The subsystem needs to output at least one data point and then fail to output additional data points for 30 minutes.
Metric-threshold condition
A metric-threshold condition triggers when the values of a metric are more than, or less than, the threshold for a specific duration window. For example, a metric-threshold condition might trigger when the CPU utilization is higher than 80% for at least 5 minutes.
Within the class of metric-threshold conditions, there are patterns that fall into general sub-categories:
Rate-of-change conditions trigger when the values in a time series increase or decrease by a specific percent or more during a duration window.
When you create this type of condition, a percent-of-change computation is applied to the time series before comparison to the threshold.
The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.
Group-aggregate conditions trigger when a metric measured across a resource group crosses a threshold for a duration window.
Uptime-check conditions trigger when an uptime check fails to successfully respond to a request sent from at least two geographic locations.
Process-health conditions trigger when the number of processes running on a VM instance is more than, or less than, a threshold. You can also configure these conditions to monitor a group of instances that match a naming convention.
This condition type requires the Ops Agent or the Monitoring agent to be running on the monitored resources. For more information about the agents, see Google Cloud Operations suite agents.
Metric-ratio conditions trigger when the ratio of two metrics exceeds a threshold for a duration window. These conditions compute the ratio of two metrics, for example, the ratio of HTTP error responses to all HTTP responses.
For more information about ratio-based policies, see Conditions for alerting on ratios.
Forecast condition
A forecast condition triggers when it generates a forecast that the threshold will be violated within the upcoming forecast window. A forecast predicts whether or not a time series will violate a threshold within a forecast window, which is a time period in the future. The forecast window can range from 1 hour (3,600 seconds) to 7 days (604,800 seconds).
You can use forecasting when monitoring most metrics. However, when you monitor a constrained resource, like quota, disk space, or memory usage, a forecast condition can notify you before the threshold is violated. That capability gives you more time to respond to how that constrained resource is being consumed before the threshold is violated.
For each time series that a forecast condition monitors, the condition instantiates a decision algorithm. After that algorithm is trained, it generates a forecast each time the condition is evaluated. Each forecast is a prediction that their time series will violate, or won't violate, the threshold within the forecast window. If a monitored time series has a regular periodicity, then the decision algorithm for that time series incorporates the periodic behavior into its forecasts.
A forecast condition can trigger when either, or both, of the following occur:
- All values of a time series during a specific duration window violate the threshold.
- All forecasts for a specific time series that are made in a duration window predict that the time series will violate the threshold within the forecast window.
The initial training time for a decision algorithm is twice the length of the forecast window. For example, if the forecast window is one hour, then two hours of training time are required. The decision algorithm for each time series is trained independently. While a decision algorithm is being trained, its time series can trigger the condition only when the values of the time series violate the threshold for the specified duration window.
After the initial training is completed, each decision algorithm is continually trained using data that spans up to six times the length of the forecast window. For example, when the forecast window is one hour, the most recent six hours of data are used during continual training.
When you configure a forecast condition and then data stops arriving for more than 10 minutes, forecasting is disabled and the condition operates as a metric-threshold condition.
The way incidents are created and managed for forecast alerts is the same as for metric-threshold and metric-absence conditions. Incidents are automatically closed when the forecast predicts that time series won't violate the threshold within the forecast window.
Restrictions
- You must configure the condition by using Monitoring filters. If you use the menu-driven interface of the Google Cloud console, your selections are converted into a Monitoring filter.
- You must configure an alerting policy by using the Cloud Monitoring API when you want to monitor a ratio of metrics. For more information, see Alerting policies in the Cloud Monitoring API and Metric ratio.
- You can't configure the condition by using Monitoring Query Language or PromQL.
- All metrics that have a value type of double or int64 are supported, except those from Amazon VM instances.
Examples
Examples of each of these policy types are available:
Condition type | JSON example | Google Cloud console |
---|---|---|
Metric threshold | View | Instructions |
Rate of change | View | Instructions |
Group aggregate | View | Instructions |
Uptime check | View | Instructions |
Process health | View | Instructions |
Metric ratio | View | Instructions |
Forecast | View | Instructions |
What's next
- To understand variables that impact alerting, see Behavior of metric-based alerting policies.
- For an assortment of alerting policies, see Sample policies.