Create metric-based alert policies

Stay organized with collections Save and categorize content based on your preferences.

This document describes how to use the Google Cloud console to create an alerting policy that monitors a metric. For example, an alerting policy that monitors the CPU utilization of a virtual machine (VM) might notify an on-call team when the policy is triggered. Alternatively, a policy that monitors an uptime check might notify on-call and development teams.

This content does not apply to log-based alerting policies. For information about log-based alerting policies, which notify you when a particular message appears in your logs, see Monitoring your logs.

This document doesn't describe the following:

Before you begin

  1. Ensure that your Identity and Access Management role includes the permissions in the role roles/monitoring.alertPolicyEditor. For more information about roles, see Access control.

  2. Ensure that you're familiar with the general concepts of alerting policies. For information about these topics, see Introduction to alerting.

  3. Configure the notification channels that you want to use to receive any alerts. For information about these steps, see Manage notification channels.

    For redundancy purposes, we recommend that you create multiple types of notification channels. For more information, see Manage notification channels.

Create alerting policies

This section describes how to create an alerting policy. By default, when you begin the create alert flow with the Google Cloud console, you are presented with a menu-driven interface. You use these menus to select the metric type that you want to monitor and to configure the policy. The metric-selection menu lists all metric types generated by Google Cloud services and the custom metric types that you defined, provided there is data for the metric type.

There are three types of conditions. These conditions trigger based on the the value of a metric crossing a threshold, the absence of metric data, or the forecasted value of a metric crossing a threshold. For information about how to configure these conditions, see the following sections of this document:

To create an alert for something other than a metric type generated by a Google Cloud service or custom metric types that you defined, use one of the specialized create-alert flows. For example, the Services page in the Google Cloud console contains a guided create-alert flow that is specific to monitoring service-level objectives (SLO). For information about the specialized types of alerting policies that might be of interest to you, see the following:

Alert on metric value

This section describes how to create an alerting policy that monitors a built-in metric type or a custom metric type that you create, and compares the value of that metric to a static threshold. To create a policy that compares the value of a time series to a dynamic threshold, you must use MQL. For more information, see Create dynamic severity levels using MQL.

This content does not apply to log-based alerting policies. For information about log-based alerting policies, which notify you when a particular message appears in your logs, see Monitoring your logs.

To create an alerting policy that compares the value of that metric to a static threshold, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring

  2. In the navigation pane, select Alerting and then click Create policy.

  3. Select the time series to be monitored:

    1. Click Select a metric and enter into the filter bar the name of the metric type or resource type that is of interest. For example, if you enter "VM instance" on the filter bar, then only metric types for VM instances are listed. If you enter "CPU", then the menus only display metric types that contain "CPU" in their name.

    2. Navigate through the menus to select a metric, and then click Apply.

      If the metric type you want to monitor isn't listed, then disable Show only active resources & metrics in the Select a metric menu. For more information, see Troubleshoot: Metric not listed in menu.

    3. Optional: To monitor a subset of the time series that match the metric and resource types you selected in the previous step, click Add filter. In the filter dialog, select the label by which to filter, a comparator, and then the filter value. For example, the filter zone =~ ^us.*.a$ uses a regular expression to match all time-series data whose zone name starts with us and ends with a. For more information, see Filter the selected data.

    4. Optional: To change how the points in a time series are aligned, set the Rolling window and the Rolling window function in the Transform data section.

      These fields specify how the points that are recorded in a window are combined. For example, assume that the window is 15 minutes and the window function is max. The aligned point is the maximum value of all samples recorded in the most recent 15 minutes. For more information, see Align time series.

    5. Optional: Combine time series when you want to reduce the number of time series monitored by a policy, or when you want to monitor only a collection of time series. For example, you might want to monitor the CPU utilization of your VM instances averaged by zone. By default, time series aren't combined.

      To combine all time series, do the following:

      1. Click Expand in the Across time series header.
      2. Set the Time series aggregation field to a value other than none. For example, when you select mean, each point in the displayed time series is the average of points from the individual time series.
      3. Ensure that the Time series group by field is empty.

      To combine, or group, time series by label values, do the following:

      1. Click Expand in the Across time series header.
      2. Set the Time series aggregation field to a value other than none.
      3. In the Time series group by field, select one or more labels by which to group.

      For example, if you group by zone and then set the aggregation field to mean, then the chart displays one time series for each zone. The time series shown for a specific zone is the average of all time series with that zone.

      The Secondary data transform fields are disabled by default. When enabled, these operations are applied after the primary data transformation.

      For more information, see Combine time series.

    6. Click Next.

  4. Configure the condition trigger:

    1. Leave the Condition type field at the default value of Threshold.

    2. Select a value for the Alert trigger menu. This menu lets you specify the subset of time series that must violate the threshold before the condition is triggered.

    3. Enter when the value of a metric violates the threshold by using the Threshold position and Threshold value fields. For example, if you set these values to Above threshold and 0.3, then any measurement higher than 0.3 violates the threshold.

    4. Optional: To select how long measurements must violate the threshold before alerting generates an incident, expand Advanced options and then use the Retest window menu.

      The default value is No retest. With this setting, a single measurement can result in a notification. For more information and an example, see The alignment period and the duration.

    5. Optional: To specify how Monitoring evaluates the condition when data stops arriving, expand Advanced options and then use the Evaluation missing data menu.

      The Evaluation missing data menu is disabled when the value of the Retest window is No retest.

      Google Cloud console
      "Evaluation of missing data" field
      Summary Details
      Missing data empty Open incidents stay open.
      New incidents aren't opened.

      For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed.

      For conditions that aren't met, the condition continues to not be met when data stops arriving.

      Missing data points treated as values that violate the policy condition Open incidents stay open.
      New incidents can be opened.

      For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed.

      For conditions that aren't met, this setting causes the metric-threshold condition to behave like a metric-absence condition. If data doesn't arrive in the time specified by the retest window, then the condition is evaluated as met. For an alerting policy with one condition, the condition being met results in an incident being opened.

      Missing data points treated as values that don't violate the policy condition Open incidents are closed.
      New incidents aren't opened.

      For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed.

      For conditions that aren't met, the condition continues to not be met when data stops arriving.

    6. Click Next.

  5. Optional: Create an alerting policy with multiple conditions.

    Most policies monitor a single metric type, for example, a policy might might monitor the number of bytes written to a VM instance. When you want to monitor multiple metric types, create a policy with multiple conditions. Each condition monitors one metric type. After you create the conditions, you specify how the conditions are combined. For information, see Policies with multiple conditions.

    To create an alerting policy with multiple conditions, do the following:

    1. For each additional condition, click Add alert condition and then configure that condition.
    2. Click Next and configure how conditions are combined.
    3. Click Next to advance to the notifications and documentation set up.
  6. Configure the notifications and documentation:

    1. Expand the Notification channels menu and select your notification channels. For redundancy purposes, we recommend that you add to an alerting policy multiple types of notification channels. For more information, see Manage notification channels.

    2. Optional: To be notified when an incident is closed, select Notify on incident closure. By default, when you create an alerting policy with the Google Cloud console, a notification is sent only when an incident is created.

    3. Optional: To change how long Monitoring waits before closing an incident after data stops arriving, select an option from the Incident autoclose duration menu. By default, when data stops arriving, Monitoring waits seven days before closing an open incident.

    4. Optional: To add custom labels to the alerting policy, in the Policy user labels section, do the following:

      1. Click Add label, and in the Key field enter a name for the label. Label names must start with a lowercase letter, and they can contain lowercase letters, numerals, underscores, and dashes. For example, enter severity.
      2. Click Value and enter a value for your label. Label values can contain lowercase letters, numerals, underscores, and dashes. For example, enter critical.

      For information about how you can use policy labels to help you manage your alerts, see Add severity levels to an alerting policy.

    5. Optional: To include custom documentation with a notification, enter that content to the Documentation section. To format your documentation, you can use Markdown. To pull information out of the policy itself to tailor the content of your documentation, you can use variables. For example, documentation might include a title such as Addressing High CPU Usage and details that identify the project:

      ## Addressing High CPU Usage
      
      This note contains information about high CPU Usage.
      
      You can include variables in the documentation. For example:
      
      This alert originated from the project ${project}, using
      the variable $${project}.
      

      When notifications are created, Monitoring replaces the variables with their values. The values replace the variables only in notifications. The preview pane and in other places in the Google Cloud console only show the Markdown formatting:

      Example writing a documentation note using markdown.

      For more information, see Using Markdown and variables in documentation templates and Using channel controls.

    6. Click Alert name and enter a name for the alerting policy.

  7. Click Create policy.

Alert on metric-absence

To be notified when you stop receiving metric data for a specified duration window, create an alerting policy with a metric-absence condition. Metric-absence conditions require at least one successful measurement — one that retrieves data — within the maximum duration window after the policy was installed or modified. The maximum configurable duration window is 24 hours if you use the Google Cloud console and 24.5 hours if you use the Cloud Monitoring API.

To create an alerting policy with a metric-absence condition, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring
  2. In the navigation pane, select Alerting and then click Create policy.
  3. Select the metric to be monitored, add filters, and specify how the data is transformed. These steps are the same for all types of conditions. For details on these steps, see Alert on metric value: Select time series.
  4. Configure the condition trigger:

    1. Select Metric absence for the type of condition.
    2. Select a value for the Alert trigger menu. This menu lets you specify the subset of time series that must not have data before the condition is triggered.
    3. Specify how long metric data must be absent before alerting notifies you by using the Trigger absence time field.

    Monitoring always evaluates metric-absence conditions with the rolling window set to 24 hours. The console displays a message that indicates the value you entered is being overridden.

  5. Configure the notification channels, documentation, and name for your alerting policy. For more information, see Alert on metric value: Notifications and documentation.

  6. Review your alerting policy and then click Create policy.

Alert on the forecasted value of a metric

To be notified when the alerting policy forecasts that the threshold will be violated within a forecast window, create a forecast condition. Forecast conditions are designed to monitor constraint metrics. Constraint metrics include those that record quota, memory, and storage usage.

To create an alerting policy that creates an alert based on a forecast, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring
  2. In the navigation pane, select Alerting and then click Create policy.
  3. Select the metric to be monitored, add filters, and specify how the data is transformed. These steps are the same for all types of conditions. For details on these steps, see Alert on metric value: Select time series.

    Select a constraint metric that has a value type of double or int64, and don't select a metric for an Amazon VM instance. When more than 64 time series are monitored, Monitoring makes forecasts for the 64 time series whose values are closest to the threshold, or that already violate the threshold. For other time series, their values are compared to the threshold.

  4. Configure the condition trigger:

    1. Select Forecast for the type of condition.

    2. Select a value for the Alert trigger menu. This menu lets you specify the subset of time series that must violate the threshold before the condition is triggered.

    3. Select a value for the Forecast window. The value that you select is the amount of time in the future for the forecast. You must set this value to at least 1 hour (3,600 seconds) and to at most 7 days (604,800 seconds).

    4. Enter when the predicted value of the selected metric violates the threshold by using the Threshold position and Threshold value fields. For example, if you set these values to Above threshold and 10, then any predicted value higher than 10 violates the threshold.

    5. Optional: Expand Advanced options and set the value of the Retest window. The default value of this field is No retest. We recommend that you set this field to at least 10 minutes.

      For example, suppose you configure the forecast condition such that any time series can cause the condition to trigger. Also assume that the Retest window is set to 15 minutes, the Forecast window is set to 1 hour, and a violation occurs when the value of the time series is higher than the Threshold, which is set to 10. The condition triggers if either of the following occur:

      • All values of a time series become higher than 10 and stay there for at least 15 minutes.
      • In a 15-minute interval, every forecast for one time series predicts that its value will rise higher than the threshold of 10 sometime within the next hour.
    6. Optional: To specify how Monitoring evaluates the condition when data stops arriving, expand Advanced options and then use the Evaluation missing data menu.

      The Evaluation missing data menu is disabled when the value of the Retest window is No retest.

      When data is missing for more than 10 minutes, a forecast condition stops making forecasts and instead uses the value of the Evaluation missing data field to determine how to manage incidents. When observations restart, forecasting is restarted.

      Google Cloud console
      "Evaluation of missing data" field
      Summary Details
      Missing data empty Open incidents stay open.
      New incidents aren't opened.

      For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed.

      For conditions that aren't met, the condition continues to not be met when data stops arriving.

      Missing data points treated as values that violate the policy condition Open incidents stay open.
      New incidents can be opened.

      For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed.

      For conditions that aren't met, this setting causes the metric-threshold condition to behave like a metric-absence condition. If data doesn't arrive in the time specified by the retest window, then the condition is evaluated as met. For an alerting policy with one condition, the condition being met results in an incident being opened.

      Missing data points treated as values that don't violate the policy condition Open incidents are closed.
      New incidents aren't opened.

      For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed.

      For conditions that aren't met, the condition continues to not be met when data stops arriving.

    7. Click Next.

  5. Configure the notification channels, documentation, and name for your alerting policy. For more information, see Alert on metric value: Notifications and documentation.

  6. Review your alerting policy and then click Create policy.

Alert on the rate-of-change of a metric

To be notified when the rate of change of a metric exceeds a threshold, create a rate-of-change alerting policy. For example, to be notified when the CPU utilization rises too quickly, create this type of policy.

To create this type of policy, follow the steps described in Alert on metric value. However, ensure that you set the Rolling window function field to percent change.

When you select the percent change function, Monitoring does the following:

  1. If the time series has a DELTA or CUMULATIVE metric kind, the time series is converted to one that has a GAUGE metric kind. For information about the conversion, see Kinds, types, and conversions.
  2. Computes percent changed by comparing the average value in the most recent 10-minute window to the average value from the 10-minute window before the retest window.

    The 10-minute lookback window is a fixed value; you can't change it. However, you do specify the retest window when you create a condition.

Alert on the count of processes running on a VM

To monitor the number of processes running on your VMs that meet conditions you specify, create a process-health alerting policy. For example, you can count the number of processes started by the root user. You can also count the number of processes whose invocation command contained a specific string. An alerting policy can notify you when the number of processes is more than, or less than, a threshold. For information about which processes can be monitored, see Processes that are monitored.

Process-health metrics are available when the Ops Agent or the Monitoring agent is running on the monitored resources. For more information about the agents, see Google Cloud Operations suite agents.

To monitor the count of processes running on a VM, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring

  2. In the navigation pane, select Alerting and then click Create policy.

  3. Select ? on the Select metric section header and then select Direct filter mode in the tooltip.

  4. Enter a Monitoring filter.

    For example, to count the number of processes that are running on Compute Engine VM instances whose name includes nginx, enter the following:

    select_process_count("monitoring.regex.full_match(\".*nginx.*\")")
    resource.type="gce_instance"
    

    For syntax information see the following resources:

  5. Complete the alerting policy dialog. These steps are only outlined in this section. For complete details, refer to Alert on metric value:

    1. Optional: Review and update the data transformation settings.
    2. Click Next and configure the condition trigger.
    3. Click Next and complete the notification and documentation steps.
    4. Click Alert name and enter a name for the alerting policy.
    5. Click Create policy.

Processes that are monitored

Not all processes running in your system can be monitored by a process-health condition. This condition selects processes to be monitored by using a regular expression that is applied to the command line that invoked the process. When the command line field isn't available, the process can't be monitored.

One way to determine if a process can be monitored by a process-health condition is to look at the active processes. For example, on a Linux system, you can use the ps command:

    ps aux | grep nfs
    USER      PID  %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root      1598  0.0  0.0      0     0 ?        S<   Oct25   0:00 [nfsd4]
    root      1639  0.0  0.0      0     0 ?        S    Oct25   2:33 [nfsd]
    root      1640  0.0  0.0      0     0 ?        S    Oct25   2:36 [nfsd]

When a COMMAND entry is wrapped with square brackets, for example [nfsd], the command-line information for the process isn't available. In this situation, you can't use Cloud Monitoring to monitor the process.

Alert when SLO violated

To be notified when a system is in danger of violating a defined service-level objective (SLO), create an alerting policy. For example, an SLO for some system might be that it has 99% availability over a calendar week. A different SLO might specify that the latency can exceed 300 ms in only 5 percent of the requests over a rolling 30-day period.

For information about how to create an alert for an SLO, see the following documents:

To create an SLO alerting policy when you use the Cloud Monitoring API, the data you provide to the API includes time-series selector. For information about these selectors, see Retrieving SLO data.

You can create an SLO alerting policy by using the alerting interface in the Google Cloud console. To do so, follow the steps described in Create a process-health alerting policy. However, when you reach the step to enter a Monitoring filter, enter a time-series selector instead of a process-health expression.

Alert when uptime-check fails

We recommend that you create an alerting policy to notify you when an uptime check fails. The uptime-check infrastructure includes a guided alert-creation flow. For details on these steps, see Alerting on uptime checks.

Restrict condition to a resource-group

If you want to monitor a collection of resources, where membership in the group is defined by some criteria, then create a resource group and monitor the group. For example, you might define a resource group for the Compute Engine VM instances that you use for production. After you create that group, you can then create an alerting policy that monitors only that group of instances. When you add a VM that matches the group criteria, the alerting policy automatically monitors that VM.

You can create a resource-group alerting policy by using the Google Cloud console. To do so, follow the steps described in Create a process-health alerting policy. However, after you select the metric, add a filter that restricts the time series to those that match the group criteria.

To create an alerting policy that monitors a resource-group, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring

  2. In the navigation pane, select Alerting and then click Create policy.

  3. Select the time series to be monitored:

    1. Click Select a metric and enter into the filter bar the name of the metric type or resource type that is of interest. For example, if you enter "VM instance" on the filter bar, then only metric types for VM instances are listed. If you enter "CPU", then the menus only display metric types that contain "CPU" in their name.

    2. Navigate through the menus to select a metric, and then click Apply.

      If the metric type you want to monitor isn't listed, then disable Show only active resources & metrics in the Select a metric menu. For more information, see Troubleshoot: Metric not listed in menu.

    3. Click Add Filter and select Group.

    4. Expand Value and select the group name.

    5. Click Done.

  4. Complete the steps to configure the alerting policy as described in Alert on metric value: Configure trigger.

Troubleshoot: Metric not listed in menu

By default, the Select a metric menus list every metric type for which there's data. For example, if you don't use Pub/Sub, then these menus don't list any Pub/Sub metrics.

You can configure an alert even when the data you want the alert to monitor doesn't yet exist:

  • To create an alert that monitors a Google Cloud metric, follow the steps described in Alert on metric value. However, in the step where you select a metric, disable Show only active resources & metrics in the Select a metric menu. When disabled, the menu lists all metrics for Google Cloud services, and all metrics with data.

  • To configure an alert for a custom metric type before that metric type generates data, follow the steps described in Create a process-health alerting policy. When you reach the step to enter a Monitoring filter, enter a filter that specifies the metric type and resource. The following is an example of a Monitoring filter that specifies a metric type:

    metric.type="compute.googleapis.com/instance/disk/write_bytes_count"
    resource.type="gce_instance"