Alerting gives timely awareness to problems in your cloud applications so you can resolve the problems quickly.
This page provides an overview of Stackdriver Monitoring alerting. For a more hands-on introduction, follow the steps in one of these quickstarts:
How does alerting work?
You use the Stackdriver Monitoring Console to set up alerting policies. Each policy specifies the following:
Conditions that identify an unhealthy state for a resource or a group of resources.
Optional notifications sent through email, SMS, or other channels to let your support team know a resource is unhealthy.
Optional documentation that can be included in some types of notifications to help your support team resolve the issue.
When events trigger conditions in one of your alerting policies, Stackdriver Monitoring creates and displays an incident in the Stackdriver Monitoring Console. If you set up notifications, Stackdriver Monitoring also sends notifications to people or third-party notification services. Responders can acknowledge receipt of the notification, but the incident remains open until resources are no longer in an unhealthy state.
You deploy a web application onto a Compute Engine VM instance that's running a LAMP stack. While you know that HTTP response latency may fluctuate as normal demand rises and falls, if your users start to experience high latency for a significant period of time, you want to take action.
To be notified when your users experience high latency, create the following alerting policy:
If HTTP response latency is higher than two seconds,
and if this condition lasts longer than five minutes,
open an incident and send email to your support team.
Your web app turns out to be more popular than you expected and the response latency grows beyond two seconds. Here's how your alerting policy responds:
Stackdriver Monitoring opens an incident and sends email after five consecutive minutes of HTTP latency higher than two seconds.
The support team receives the email, signs into the Stackdriver Monitoring Console, and acknowledges receipt of the notification.
Following the documentation in the notification email, the team is able to addresses the cause of the latency. Within a few minutes HTTP responses drop back below two seconds.
As soon as Stackdriver Monitoring Console measures HTTP latency below two seconds, the policy's condition is no longer true (even a single measurement of lower latency breaks the "consecutive five minutes" requirement).
Stackdriver Monitoring Console closes the incident and resets the five-minute timer. If latency rises above two seconds during the next consecutive five-minutes, the policy opens a new incident.
Here's a quick overview of the concepts you should understand before creating alerts:
Defines conditions under which a service is considered unhealthy. When the conditions are met, the policy triggers and opens a new incident. The triggered policy also sends notifications if you've configured it to do so.
A policy belongs to an individual Stackdriver Monitoring account, and each account can contain up to 500 policies.
Determines when an alerting policy triggers, opens an incident, and optionally sends a notification. You can create the following types of conditions:
- Metric threshold: Opens an incident if a metric rises above or falls below a value for a specific duration window.
- Metric absence: Opens an incident if a metric is unavailable (has no data) for a specific duration window.
- Metric rate of change: Opens an incident if a metric increases or decreases by a specific percent or more during a duration window.
- Uptime check health: Opens an incident if you've created an uptime check and the resource fails to successfully respond to a request sent from at least two geographic locations.
- Process health: Opens an incident if the number of processes running on a VM instance or instance group falls above or below a specific number during a duration window.
Some condition types are available only if your account is in the Premium Tier of service.
Depending on its type, each condition specifies the following:
Resources to monitor, such as individual VM instances or instance groups, third party applications, databases, load balancers, URL endpoints, and so on.
Predefined metrics that measure resource performance and typically reflect the health of a process.
A policy can contain up to 6 conditions. You can specify if a policy triggers when any condition is met (logical OR), or when all conditions are met (logical AND).
For more details about conditions, see Condition details.
A record of a service being in an unhealthy state. When events trigger a policy, Stackdriver opens an incident. Incidents appear in the Stackdriver Monitoring Console and can be in one of the following states:
Open: A policy's condition is currently being met. Continuing with the policy described in the previous example, Stackdriver opens an incident after HTTP latency has been above two seconds for five consecutive minutes.
If a policy specifies multiple conditions, incidents open depending on the policy trigger you set:
All conditions: An incident opens only if all conditions are true
Any condition: An incident opens if any condition is true. If only one condition causes an incident to open, and later a second condition is met, then the policy only opens a new incident if the prior incident has been resolved. A policy cannot create multiple open instances.
Resolved: The policy's conditions are no longer met. For example, Stackdriver Monitoring measured HTTP latency above two seconds for 10 consecutive minutes, but in its next measurement, latency is equal to or below two seconds. So the policy resolves the incident and resets the duration window.
A person cannot change the state of an incident to Resolved. The policy resolves the incident when its conditions are no longer met (either because updated measurements no longer meet the conditions or because someone changes the policy's conditions).
Acknowledging an incident
After you view an open incident, you can mark it as acknowledged to indicate that you're aware of the issue. In the Stackdriver Monitoring Console, acknowledged incidents that are still open appear on a separate tab. This lets other people who might be notified know that someone is looking at the issue.
In the following example, the Stackdriver Monitoring Console shows:
One open incident that hasn't been acknowledged.
One open incident that has been acknowledged.
Five resolved incidents (once an incident resolves, the Stackdriver Monitoring Console moves it to the Resolved tab whether or not it was acknowledged).
A notification is an optional communication sent when a policy's conditions are triggered. Notifications are sent over one or more notification channels, such as email, SMS, and others.
Depending on the channel, you can include documentation in the notification to help others when resolving the issue.
For more details about notifications, see Notification details.
A metric threshold condition opens an incident if a metric rises above or falls below a value for a specific duration window.
Here's the UI panel for creating a metric threshold condition:
A metric absence condition opens an incident if a metric is unavailable (has no data) for a specific duration window.
Following is the UI panel for creating a metric absence condition that opens an incident if the "daily sales" custom metric is absent for 30 consecutive minutes:
Metric absence conditions require at least one successful measurement in the prior 24 hours. In the example above, the condition would not be met if the subsystem that writes metric data for the custom metric has never written a data point. The subsystem would need to output at least one data point, and then fail to output additional data points for 30 minutes.
Metric rate of change (Premium)
A metric rate of change condition opens an incident if a metric increases or decreases by at least a specific percent during a duration window.
The condition averages the values of the metric from the past 10 minutes, then compares the result with the 10-minute average that was measured just before the duration window. The 10-minute lookback window used by a metric rate of change condition is a fixed value (you can't make it longer or shorter). However, the duration window is a value that you specify when you create a condition.
A custom metric value increases by 30% in a single hour.
The condition in this example averages the ten minute period that started 1 hour and 10 minutes ago, and compares it to the average from the period that started 10 minutes ago. If the latter increased by more than 30%, the policy is triggered.
Uptime check health
An uptime check health condition monitors an uptime check that you've created in your account.
An uptime check periodically makes an an automated request to a URL, a VM instance, or other resource on a regular basis. If the resource fails to successfully respond to a request sent from at least two geographic locations, the uptime check fails.
By default, the result of uptime health checks appear only on the Uptime Checks Overview page of the Stackdriver Monitoring Console, but you can create a policy that opens an incident and optionally sends an alert if an uptime check fails.
Here's the UI panel for creating a condition that monitors an uptime check named "lamp intro Uptime check" (which checks a specific URL on a recurring basis):
Process health (Premium)
A process health condition opens an incident if the number of processes that match a specific pattern falls above or below a threshold during a duration window.
This condition is available only if your account is in the Premium Tier of service, and requires the Stackdriver Monitoring Agent to be running on the monitored resource.
Within your instance group, the number of processes that you use to push notifications to your customers falls to zero. Here's the UI panel for creating this example process health condition:
A duration window is the length of time a condition must evaluate as true before triggering a policy. Because of normal fluctuations in performance, you don't want a policy to trigger if a single measurement matches a condition. Instead, you usually want several consecutive measurements to meet a condition before you consider your application to be in an unhealthy state.
A condition resets its duration window each time a measurement does not satisfy the condition.
The following condition specifies a five-minute duration window:
HTTP latency above two seconds for five consecutive minutes.
The following sequence illustrates how the duration window affects the evaluation of the condition:
- For three consecutive minutes, HTTP latency is above two seconds.
- In the next measurement, latency falls below two seconds, so the condition resets the duration window.
- For the next five consecutive minutes, HTTP latency is above two seconds, so the policy is triggered.
Duration windows should be long enough to reduce false positives, but short enough to ensure that incidents are opened in a timely manner. As a rule of thumb, the more highly available the service or the bigger the penalty for not detecting issues, the shorter duration window you may want to specify.
Partial metric data
If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions.
- A condition specifies HTTP latency of two seconds or higher for five consecutive minutes.
- For three consecutive minutes, HTTP latency is three seconds.
- For two consecutive minutes, there are no HTTP requests. In this case, a condition would carry forward the last measurement (three seconds) for these two minutes.
- After a total of five minutes the policy triggers, even though there has been no data for the last two minutes.
Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Stackdriver Monitoring might have lost some of the recent history of the conditions. Later inspection of the timeseries data might not reveal this problem because there is no evidence of delays once the data arrives.
You can minimize these problems by doing any of the following:
- Contact your third party cloud provider to see if there is a way to reduce metric collection latency.
- Use longer duration windows in your conditions. This has the disadvantage of making your alerting policies less responsive.
Prefer metrics that have a lower collection delay:
- Stackdriver Monitoring agent metrics, especially when the agent is running on VM instances in third-party clouds.
- Custom metrics, when you write their data directly to Stackdriver Monitoring.
- Logs-based metrics, if logs collection is not delayed.
A notification is an optional communication sent when a policy's conditions are triggered. Depending on the channel you use for sending a notification, you can include documentation to help your support team resolve the issue.
Number of notifications sent per incident
If a policy contains only one condition, then only one notification is sent when the incident initially opens, even if subsequent measurements continue to meet the condition.
If a policy contains multiple conditions, it may send multiple notifications depending on how you set up the policy:
If a policy triggers only when all conditions are met, then the policy sends a notification only when the incident initially opens.
If a policy triggers when any condition is met, then the policy sends a notification each time a new combination of conditions is met. For example:
- ConditionA is met, and an incident opens and a notification is sent
- The incident is still open when a subsequent measurement meets both ConditionA and ConditionB. In this case, the incident remains open and another notification is sent.
You can send notifications over one or more channels, such as email, SMS, and third-party applications including as PagerDuty, HipChat, Campfire, Slack, and others. For more information, see Notification Options.
Notification latency is the delay from the time a problem first starts until the time a policy is triggered.
The following events and settings contribute to the overall notification latency:
Metric collection delay: The time Stackdriver Monitoring needs to collect metric values. For GCP values, this is typically negligible. For AWS CloudWatch metrics, this can be several minutes. For uptime checks, this can be an average of four minutes, up to a maximum of 5 minutes and 30 seconds (from the end of the duration window).
Duration window: The window configured for the condition. Note that conditions are only met if a condition is true throughout the duration window. For example, if you specify a five-minute window, the notification will be delayed at least five minutes from when the event first occurs.
Time for notification to arrive: Notification channels such as email and SMS themselves may experience network or other latencies (unrelated to what's being delivered), sometimes on the order of minutes. On some channels—such as SMS and Slack—there is no guarantee that the messages will ever be delivered.
Pricing and quotas
The types of resources you can monitor, as well as the types of conditions you can create, depend on your Stackdriver account's tier of service:
Basic Tier: You can create policies that monitor GCP resources and that contain the following types of conditions:
Premium Tier: You can create policies that monitor additional types of resources, such as AWS resources, and that contain the following types of conditions:
The following quotas apply to all service tiers:
A Stackdriver Monitoring account can have up to 500 alerting policies.
Each policy can specify up to 6 conditions and 16 notification channels.