Alerting overview

This document describes how you can get notified when your application fails or when the performance of an application doesn't meet defined criteria.

How alerting works

The Cloud Monitoring alerting process contains three parts:

  • An alerting policy, which describes the circumstances under which you want to be alerted and how you want to be notified about an incident. The alerting policy can monitor time-series data stored by Cloud Monitoring or logs stored by Cloud Logging. When that data meets the alerting policy condition, Cloud Monitoring creates an incident and sends the notifications.

  • Each incident is a record of the type of data that was monitored and when the conditions were met. This information can help you troubleshoot the issues that caused the incident.

  • A notification channel defines how you receive notifications when Cloud Monitoring creates an incident. For example, you can configure an notification channel to email and to post a Slack message to the channel #my-support-team. An alerting policy can contain one or more notification channels.

Alerting policies can evaluate two types of data:

  • Time-series data, also called metric data, which is stored by Monitoring. These types of policies are called metric-based alerting policies.

    To learn how to set up a metric-based alerting policy, try the Quickstart for Compute Engine.

  • Log data stored by Cloud Logging. These types of policies are called log-based alerting policies. Log-based alerting policies notify you when a particular message appears in your logs.

    This document focuses on metric-based alerting policies, with general information about log-based alerting policies when relevant. For detailed information about log-based alerting policies, see Monitor your logs.

The alerting process helps you respond to issues when the performance of an application fails to meet acceptable values. For example, you deploy a web application onto a Compute Engine virtual machine (VM) instance. While you expect the HTTP response latency to fluctuate, you want your support team to respond when the application has high latency for a significant time period. You could create a metric-based alerting policy that monitors the application's HTTP response latency metric. If the response latency is higher than two seconds for at least five minutes, then Monitoring creates an incident and sends email notifications to your support team.

How to create an alerting policy

There are multiple ways to create an alerting policy. For example, you can use pre-configured alerting policies by enabling recommended alerts from integrations or certain pages in the Google Cloud console. You can also configure a new alerting policy by using the Google Cloud console, the Cloud Monitoring API, the Google Cloud CLI and Terraform.

Use integrations and recommended alerts

Cloud Monitoring provides pre-built packages to let you create alerting policies for your Google Cloud services and third-party integrations. The packages include recommended alerting policies, sample dashboards, and key metrics for the service. These packages are available for Google Cloud services such as Google Kubernetes Engine, Compute Engine, and cloud SQL, and common third-party integrations such as MongoDB, Kafka, and Elasticsearch.

When you install a package, you can enable the package's recommended alerts. When you enable the alert, you provide your notification channels and use the alert default configuration or adjust the configuration as needed. The alerting policy begins monitoring its target immediately, with no additional user input required.

Recommended alerting policies are helpful when you've deployed a new service and want to alert on important metrics. For example, the CloudSQL integration package comes with recommended alerts for failed instances and slow transactions:

Two of the recommended alerts for the CloudSQL integration package.

For more information on alerting integrations, see Monitoring third-party applications.

Use Cloud Monitoring

If you want to create an alerting policy and choose its condition type along with other components such as metric type and time series, then use Cloud Monitoring. The following table lists the different types of conditions that you can use when you create an alerting policy.

Condition Type Description Example
Metric-threshold condition

Metric-threshold conditions trigger when the values of a metric are more than, or less than, a threshold for a specific duration window.

For more information, see Create metric-threshold alerting policies and Create alerting policies by using the API.

You want an alerting policy that sends an alert when resource latency is 500ms or higher for five consecutive uptime checks over 10 minutes.
Metric-absence condition

Metric-absence conditions trigger when a monitored time series has no data for a specific duration window. The duration window is up to 24 hours if you create your condition in Google Cloud console or 24.5 hours in the Cloud Monitoring API.

For more information, see Create metric-absence alerting policies and Create alerting policies by using the API.

You want an alerting policy that opens an incident with your support team when a resource doesn't respond to any HTTP requests over the course of five minutes.
Forecasted metric-value condition

Forecasted metric-value conditions trigger when the alerting policy predicts that the threshold will be violated within the upcoming forecast window. The forecast window can range from 1 hour to 7 days.

For more information, see Create forecasted metric-value alerting policies and Create alerting policies by using the API.

You want an alerting policy that opens an incident with your support team when a resource is likely to reach 80% disk space usage within the next 24 hours.
Log-based condition

Log-based alert conditions trigger when the alerting policy detects that a log-based metric matches the alerting policy criteria. Log-based metrics derive metric data from the content of log entries. For example, you can use a log-based metric to count the number of log entries that contain a particular message or to extract latency information recorded in log entries.

For more information, see Configure log based alerts and Create a log-based alert by using the Monitoring API.

You want an alerting policy that opens an incident with your support team when your project has at least 50 log entries with a message that contains product_ids=['tier_1_support', 'tier_2_support']

Alerting policy components

Each alerting policy has the following components:

  • A condition that describes when a resource, or a group of resources, is in a state that requires you to respond. The condition includes the data source, a static or dynamic threshold, and data aggregation methods such as lookback windows, filters, and groupby. Your conditions can monitor a single metric, multiple metrics, or a ratio of metrics. You can also use query languages such as PromQL and Monitoring Query Language (MQL) to include complex expressions such as dynamic thresholds and conditional logic.

    If you use an integration to enable a recommended alerting policy, then the alerting policy condition is pre-populated.

  • A list of notification channels that describe who to notify when action is required. For more information, see Create and manage notification channels.

  • Documentation that appears in notifications and incident pages. Use the documentation section to provide your alert responders with remediation steps and information about the incident. For example, you can include links to internal playbooks, and Google Cloud pages such as custom dashboards, the Logs Explorer, and resource pages.

    For more information, including an example, see Annotate alerts with user-defined documentation.

Query languages

Use query languages and filters in your alerting policies to take greater control over your metric evaluation. Cloud Monitoring supports the following query types:

  • PromQL alerting lets you configure alerting policies to use the Prometheus Query Language. Your PromQL queries can use any kind of valid Prometheus Query Language expression, such as metric combinations, ratios, and scaling thresholds. PromQL alerting also allows for fully Google Cloud CLI-based alert execution, which removes dependencies on external alerting infrastructure. For more information, see PromQL in Cloud Monitoring and Alerting policies with PromQL.

  • Monitoring Query Language (MQL) is an expressive, text-based interface that lets you retrieve, filter, and manipulate time-series data. You can create alerting policies with conditions that include a Monitoring Query Language alerting operation. For more information, see Monitoring Query Language overview and Alerting policies with MQL.

  • Monitoring filters let you configure alerting policies to use filter-based metric ratios. Filter-based alerting policies can't be viewed or modified in the Google Cloud console. For an example of a policy that uses Monitoring filters, see Metric ratio.

Manage alerting policies and incidents

After an alerting policy is enabled, Cloud Monitoring continuously monitors the conditions of that policy. You can't configure the alerting policy to monitor conditions only for certain time periods. If you want to disable the alerting policy for a certain time period, then create a snooze.

If an incident is open and Monitoring determines that the conditions of the metric-based policy are no longer met, then Monitoring automatically closes the incident and sends a notification about the closure.

Costs associated with alerting policies

For pricing information, see Pricing for Google Cloud's operations suite.

What's next