This page provides a detailed conceptual overview of Stackdriver Monitoring alerting policies. Understanding this material will help you to use these alerting policies effectively.
For information on creating and managing alerting policies, see:
- Managing alerting policies by UI for information about using the graphical user interface.
- Managing alerting policies by API for information about using the command line and API.
An alerting policy defines conditions under which a service is considered unhealthy. When the conditions are met, the policy triggers and opens a new incident. The triggered policy also sends notifications if you have configured it to do so.
A policy belongs to an individual Workspace, and each Workspace can contain up to 500 policies.
A condition determines when an alerting policy triggers. To describe a condition, you specify:
- A metric to measure.
- A test for determining when that metric reaches a state you want to know about.
Depending on what you want to monitor, you can create a condition that refers to the following:
Resources to monitor, such as individual VM instances or instance groups, third-party applications, databases, load balancers, URL endpoints, and so on.
Predefined metrics that measure resource performance.
Types of conditions
Conditions are built on metrics and can monitor, for example, if a metric reaches a value, or if a metric starts to change quickly. Metrics are associated with resources and measure some characteristic of that resource, for example, average CPU utilization across a group of VMs. For more information about metrics, see Metrics, time series, and resources.
All conditions watch for three things: Some metric behaving in some way for some period of time.
All conditions are implemented as one of two general types:
A metric absence condition, which triggers any time series in the metric has no data for a specific duration window.
Metric absence conditions require at least one successful measurement — one that retrieves data — since the policy was installed or within the maximum duration window (24 hours).
For example, suppose you set the duration window in a metric-absence policy to 30 minutes. The condition isn't met if the subsystem that writes metric data has never written a data point. The subsystem needs to output at least one data point and then fail to output additional data points for 30 minutes.
A metric threshold condition, which triggers if a metric rises above or falls below a value for a specific duration window.
Within the class of metric-threshold conditions, there are patterns that fall into general sub-categories:
Metric rate (percent) of change: Triggers if a metric increases or decreases by a specific percent or more during a duration window.
In this type of condition, a percent-of-change computation is applied to the time series before comparison to the threshold.
The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value; you can't change it. However, you do specify the duration window when you create a condition.
Group-aggregate threshold: Triggers if a metric measured across a resource group crosses a threshold.
Uptime-check health: Triggers if you've created an uptime check and the resource fails to successfully respond to a request sent from at least two geographic locations.
The results of uptime checks appear only on the Uptime Checks Overview page of the Stackdriver Monitoring console. By creating an alerting policy on an uptime check, you can have uptime checks that indirectly open incidents and optionally send notifications when they fail.
Process health: Triggers if the number of processes (1) running on a VM instance or instance group and (2) matching some string, falls above or below a specific number during a duration window.
This condition type requires the Monitoring Agent to be running on the monitored resources.
Metric ratio: Triggers if the ratio of two metrics exceeds a threshold for a duration. This is a threshold condition using two related metrics, for example, the ratio of HTTP error responses to all HTTP responses. You can't create this kind of policy in the UI. You can create ratio-based policies by using the API, see Metric ratio for a sample policy.
Examples of each of these types are available:
|Condition type||JSON example|
|Rate of change||View|
A policy can contain up to 6 conditions. If you are using multiple conditions, there are three options for specifying the combinations that violate a policy:
OR: any condition is met.
AND: each condition is violated by at least one resource, even if a different resource violates each condition.
AND_WITH_RESOURCES: all conditions are violated by the same resource. (This option is available only to conditions created using the Monitoring API.)
A condition includes a duration window, the length of time a condition must evaluate as true before triggering. Duration windows on conditions keep the alerting policy from overreacting. You want to reduce false positives, because an alerting policy that sends out a steady stream of notifications will eventually be ignored.
As a rule of thumb, the more highly available the service or the bigger the penalty for not detecting issues, the shorter duration window you may want to specify.
Because of normal fluctuations in performance, you don't want a policy to trigger if a single measurement matches a condition. Instead, you usually want several consecutive measurements to meet a condition before you consider your application to be in an unhealthy state.
A condition resets its duration window each time a measurement does not satisfy the condition.
The following condition specifies a five-minute duration window:
HTTP latency above two seconds for five consecutive minutes.
The following sequence illustrates how the duration window affects the evaluation of the condition:
- For three consecutive minutes, HTTP latency is above two seconds.
- In the next measurement, latency falls below two seconds, so the condition resets the duration window.
- For the next five consecutive minutes, HTTP latency is above two seconds, so the policy is triggered.
Duration windows should be long enough to reduce false positives, but short enough to ensure that incidents are opened in a timely manner.
Alerting policies exist in dynamic and complex environments, so using them well requires an understanding of some of the variables that can affect their behavior. The metrics and resources monitored by conditions, the duration windows for conditions, the notification channels can each have an effect.
This section provides some additional information to help you understand the behavior of your alerting policies.
Disabled alerting policies
Alerting policies can be temporarily paused and restarted by disabling and enabling the policy. For example, if you have an alerting policy that notifies you when a process is down for more than 5 minutes, you can disable the alerting policy when you take the process down for upgrade or other maintenance.
Disabling an alerting policy prevents the policy from triggering (or resolving) incidents, but it doesn't stop Stackdriver from evaluating the policy conditions and recording the results.
Suppose the monitored process is down for 20 minutes for maintenance. If you restart the process and re-enable the alerting policy immediately, it recognizes that the process hasn't been up for the last 5 minutes and opens an incident.
When a disabled policy is re-enabled, Stackdriver examines the values of all conditions over the most recent duration window, which might include data taken before, during, and after the paused interval. Policies can trigger immediately after resuming them, even with large duration windows.
Alerting policies, particularly those with metric-absence or “less than” threshold conditions, can appear on the Alerting > Incidents page to have triggered prematurely or incorrectly.
This occurs when there is a gap in the data, but it isn't always easy to identify such gaps. Sometimes the gap is obscured, and sometimes it is corrected.
In charts, for example, gaps in the data are interpolated. Several minutes of data may be missing, but the chart connects missing points for visual contiguity. Such a gap in the underlying data may be enough to trigger an alerting policy.
Points in logs-based metrics can arrive late and be backfilled, for up to 10 minutes in the past. This effectively corrects the gap; the gap is filled in when the data finally arrives. Thus, a gap in a logs-based metric that can no longer be seen could have caused an alerting policy to trigger.
Metric-absence and “less than” threshold conditions are evaluated in real time, with a small query delay. The status of the condition can change between the time it is evaluated and the time the corresponding incident is visible in the Stackdriver Monitoring console.
Partial metric data
If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions.
- A condition specifies HTTP latency of two seconds or higher for five consecutive minutes.
- For three consecutive minutes, HTTP latency is three seconds.
- For two consecutive minutes, there are no HTTP requests. In this case, a condition carries forward the last measurement (three seconds) for these two minutes.
- After a total of five minutes the policy triggers, even though there has been no data for the last two minutes.
Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Stackdriver Monitoring might have lost some of the recent history of the conditions. Later inspection of the timeseries data might not reveal this problem because there is no evidence of delays once the data arrives.
You can minimize these problems by doing any of the following:
- Contact your third-party cloud provider to see if there is a way to reduce metric collection latency.
- Use longer duration windows in your conditions. This has the disadvantage of making your alerting policies less responsive.
Choose metrics that have a lower collection delay:
- Monitoring agent metrics, especially when the agent is running on VM instances in third-party clouds.
- Custom metrics, when you write their data directly to Stackdriver Monitoring.
- Logs-based metrics, if logs collection is not delayed.
Incidents per policy
An alerting policy can apply to many resources, and a problem affecting all resources can trigger the policy and open incidents for each resource. To prevent overloading the system, the number of incidents that a single policy can open simultaneously is limited to 5000.
For example, if a policy applies to 2000 (or 20,000) Compute Engine instances, and something causes each to violate the alerting conditions, then only 5000 incidents will be opened. The remaining violations are ignored until some of the open incidents for that policy resolve.
Notifications per incident
The number of notifications sent per incident varies with the conditions in the policy.
If a policy contains only one condition, then only one notification is sent when the incident initially opens, even if subsequent measurements continue to meet the condition.
If a policy contains multiple conditions, it may send multiple notifications depending on how you set up the policy:
If a policy triggers only when all conditions are met, then the policy sends a notification only when the incident initially opens.
If a policy triggers when any condition is met, then the policy sends a notification each time a new combination of conditions is met. For example:
- ConditionA is met, and an incident opens and a notification is sent
- The incident is still open when a subsequent measurement meets both ConditionA and ConditionB. In this case, the incident remains open and another notification is sent.
Notification latency is the delay from the time a problem first starts until the time a policy is triggered.
The following events and settings contribute to the overall notification latency:
Metric collection delay: The time Stackdriver Monitoring needs to collect metric values. For GCP values, this is typically negligible. For AWS CloudWatch metrics, this can be several minutes. For uptime checks, this can be an average of 4 minutes, up to a maximum of 5 minutes and 30 seconds (from the end of the duration window).
Duration window: The window configured for the condition. Note that conditions are only met if a condition is true throughout the duration window. For example, if you specify a five-minute window, the notification will be delayed at least five minutes from when the event first occurs.
Time for notification to arrive: Notification channels such as email and SMS themselves may experience network or other latencies (unrelated to what's being delivered), sometimes approaching minutes. On some channels—such as SMS and Slack—there is no guarantee that the messages will ever be delivered.