Create metric-based alert policies

This document describes how to use the Google Cloud console to create an alerting policy that monitors a metric. For example, an alerting policy that monitors the CPU utilization of a virtual machine (VM) might notify an on-call team when the policy is triggered. Alternatively, a policy that monitors an uptime check might notify on-call and development teams.

This content does not apply to log-based alerting policies. For information about log-based alerting policies, which notify you when a particular message appears in your logs, see Monitoring your logs.

This document doesn't describe the following:

Before you begin

  1. Ensure that your Identity and Access Management role includes the permissions in the role roles/monitoring.alertPolicyEditor. For more information about roles, see Access control.

  2. Ensure that you're familiar with the general concepts of alerting policies. For information about these topics, see Introduction to alerting.

  3. Configure the notification channels that you want to use to receive any alerts. For information about these steps, see Manage notification channels.

    For redundancy purposes, we recommend that you create multiple types of notification channels. For more information, see Manage notification channels.

Create alerting policies

This section describes how to create an alerting policy.

By default, when you begin the create alert flow with the Google Cloud console, you are presented with a menu-driven interface. You use these menus to select the metric type that you want to monitor and to configure the policy. The metric-selection menu lists all metric types generated by Google Cloud services and the custom metric types that you defined, provided there is data for the metric type. For information about the steps in the default dialog, see Default create alerting policy flow.

To create an alert for something other than a metric type generated by a Google Cloud service or custom metric types that you defined, use one of the specialized create-alert flows. For example, the Services page in the Google Cloud console contains a guided create-alert flow that is specific to monitoring service-level objectives (SLO). For information about the specialized types of alerting policies that might be of interest to you, see the following:

Default create alerting policy flow

This section describes how to create an alerting policy that monitors a built-in metric type or a custom metric type that you create. The policies that this section describes notify you when a metric is absent or when a metric is more than, or less than, a static threshold. To create a policy that compares the value of a time series to a dynamic threshold, you must use MQL.

This content does not apply to log-based alerting policies. For information about log-based alerting policies, which notify you when a particular message appears in your logs, see Monitoring your logs.

To create an alerting policy that monitors a metric, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring

  2. In the navigation pane, select Alerting and then click Create policy.

  3. Select the time series to be monitored:

    1. Click Select a metric and enter into the filter bar the name of the metric type or resource type that is of interest. For example, if you enter "VM instance" on the filter bar, then only metric types for VM instances are listed. If you enter "CPU", then the menus only display metric types that contain "CPU" in their name.

    2. Navigate through the menus to select a metric, and then click Apply.

      If the metric type you want to monitor isn't listed, then disable Show only active resources & metrics in the Select a metric menu. For more information, see Troubleshoot: Metric not listed in menu.

    3. Optional: To monitor a subset of the time series that match the metric and resource types you selected in the previous step, add a filter. In the filter dialog, select the label by which to filter, a comparator, and then the filter value. For example, the filter zone =~ ^us.*.a$ uses a regular expression to match all time-series data whose zone name starts with us and ends with a. For more information, see Filter the selected data.

  4. Optional: To change how the points in a time series are aligned, set the Rolling window and the Rolling window function in the Transform data section.

    These fields specify how the points that are recorded in a window are combined. For example, assume that the window is 15 minutes and the window function is max. The aligned point is the maximum value of all samples recorded in the most recent 15 minutes. For more information, see Align time series.

  5. Optional: Combine time series when you want to reduce the number of time series monitored by a policy, or when you want to monitor only a collection of time series. For example, you might want to monitor the CPU utilization of your VM instances averaged by zone.

    To combine time series, click Expand in the Across time series header. By default, time series aren't combined.

    To combine all time series, do the following:

    1. Set the Time series aggregation field to a value other than none. For example, when you select mean, each point in the displayed time series is the average of points from the individual time series.

    2. Ensure that the Time series group by field is empty.

    To combine, or group, time series by label values, do the following:

    1. Set the Time series aggregation field to a value other than none.
    2. In the Time series group by field, select one or more labels by which to group.

    For example, if you group by zone and then set the aggregation field to mean, then the chart displays one time series for each zone. The time series shown for a specific zone is the average of all time series with that zone.

    The Secondary data transform fields are disabled by default. When enabled, these operations are applied after the primary data transformation

    For more information, see Combine time series.

  6. Click Next and configure the condition trigger:

    1. Leave the Condition type field at the default value of Threshold unless you want to be notified when data stops arriving. In that case, select Metric absence. The default setting compares the value of a metric to a threshold.

    2. For Metric absence conditions, do the following:

      1. Select a value for the Alert trigger menu. This menu lets you specify the subset of time series that must satisfy before the condition is triggered.
      2. Specify how long metric data must be absent before alerting notifies you by using the Trigger absence time field.
    3. For Threshold conditions, do the following:

      1. Select a value for the Alert trigger menu. This menu lets you specify the subset of time series that must satisfy before the condition is triggered.

      2. Enter when the value of a metric violates the threshold by using the Threshold position and Threshold value fields. For example, if you set these values to Above threshold and 0.3, then any measurement higher than 0.3 violates the threshold.

      3. Optional: To select how long measurements must violate the threshold before alerting generates an incident, expand Advanced options and then use the Retest window menu.

        The default value is No retest. With this setting, a single measurement can result in a notification. For more information and an example, see The alignment period and the duration.

      4. Optional: To specify how Monitoring evaluates the condition when data stops arriving, expand the Advanced options and then use the Evaluation missing data menu. To enable this menu, you must set Retest window to a value other than No retest.

        Google Cloud console
        "Evaluation missing data" field
        Summary Details
        Missing data empty Open incidents stay open.
        New incidents aren't opened.

        For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives, the auto-close timer starts after a delay of at least 15 minutes. If the timer expires, then the incident is closed.

        For conditions that aren't met, the condition continues to not be met when data stops arriving.

        Missing data points treated as values that violate the policy condition Open incidents stay open.
        New incidents can be opened.

        For conditions that are met, the condition continues to be met when data stops arriving. If an incident is open for this condition, then the incident stays open. When an incident is open and no data arrives for the auto-close duration plus 24 hours, the incident is closed.

        For conditions that aren't met, this setting causes the metric-threshold condition to behave like a metric-absence condition. If data doesn't arrive in the time specified by the retest window, then the condition is evaluated as met. For an alerting policy with one condition, the condition being met results in an incident being opened.

        Missing data points treated as values that don't violate the policy condition Open incidents are closed.
        New incidents aren't opened.

        For conditions that are met, the condition stops being met when data stops arriving. If an incident is open for this condition, then the incident is closed.

        For conditions that aren't met, the condition continues to not be met when data stops arriving.

  7. Optional: Create an alerting policy with multiple conditions.

    Most policies monitor a single metric type, for example, a policy might might monitor the number of bytes written to a VM instance. When you want to monitor multiple metric types, create a policy with multiple conditions. Each condition monitors one metric type. After you create the conditions, you specify how the conditions are combined. For information, see Policies with multiple conditions.

    To create an alerting policy with multiple conditions, do the following:

    1. For each additional condition, click Add condition and then configure that condition by using the previous steps.
    2. After you've added all conditions, select how these conditions are combined in the Multi-condition trigger step.
  8. Click Next to advance to the Notifications and name page.

  9. Expand the Notification channels menu and select your notification channels.

    For redundancy purposes, we recommend that you add to an alerting policy multiple types of notification channels. For more information about these recommendations, see Manage notification channels.

  10. Optional: To be notified when an incident is closed, select Notify on incident closure.

    By default, when you create an alerting policy with the Google Cloud console, a notification is sent only when an incident is created.

  11. Optional: To change how long Monitoring waits before closing an incident after data stops arriving, select an option from the Incident autoclose duration menu.

    By default, when data stops arriving, Monitoring waits seven days before closing an open incident.

  12. Optional: To add custom labels to the alerting policy, in the Policy user labels section, do the following:

    1. Click Add label, and in the Key field enter a name for the label. Label names must start with a lowercase letter, and they can contain lowercase letters, numerals, underscores, and dashes. For example, enter severity.
    2. Click Value and enter a value for your label. Label values can contain lowercase letters, numerals, underscores, and dashes. For example, enter critical.

    For information about how you can use policy labels to help you manage your alerts, see Add severity levels to an alerting policy.

  13. Optional: To include custom documentation with a notification, enter that content to the Documentation section.

    To format your documentation, you can use Markdown. To pull information out of the policy itself to tailor the content of your documentation, you can use variables. For example, documentation might include a title such as Addressing High CPU Usage and details that identify the project:

    ## Addressing High CPU Usage
    
    This note contains information about high CPU Usage.
    
    You can include variables in the documentation. For example:
    
    This alert originated from the project ${project}, using
    the variable $${project}.
    

    When notifications are created, Monitoring replaces the variables with their values. The values replace the variables only in notifications. The preview pane and in other places in the Google Cloud console only show the Markdown formatting:

    Example writing a documentation note using markdown.

    For information about Markdown and variables, see Using Markdown and variables in documentation templates.

    For information about how to include channel-specific tagging to control notifications, see Using channel controls.

  14. Click Alert name and enter a name for the alerting policy.

  15. Click Create policy.

Create a rate-of-change alerting policy

To be notified when the rate of change of a metric exceeds a threshold, create a rate-of-change alerting policy. For example, to be notified when the CPU utilization rises too quickly, create this type of policy.

To create this type of policy, follow the steps described in Default create alerting policy flow. However, ensure that you set the Rolling window function field to percent change.

When you select the percent change function, Monitoring does the following:

  1. If the time series has a DELTA or CUMULATIVE metric kind, the time series is converted to one that has a GAUGE metric kind. For information about the conversion, see Kinds, types, and conversions.
  2. Computes percent changed by comparing the average value in the most 10-minute window to the average value from the 10-minute window before the retest window.

    The 10-minute lookback window is a fixed value; you can't change it. However, you do specify the retest window when you create a condition.

Create a process-health alerting policy

To monitor the number of processes running on your VMs that meet conditions you specify, create a process-health alerting policy. For example, you can count the number of processes started by the root user. You can also count the number of processes whose invocation command contained a specific string. An alerting policy can notify you when the number of processes is more than, or less than, a threshold. For information about which processes can be monitored, see Processes that are monitored.

Process-health metrics are available when the Ops Agent or the Monitoring agent is running on the monitored resources. For more information about the agents, see Google Cloud Operations suite agents.

To monitor the count of processes running on a VM, do the following:

  1. In the Google Cloud console, select Monitoring or click the following button:
    Go to Monitoring

  2. In the navigation pane, select Alerting and then click Create policy.

  3. Select ? on the Select metric section header and then select Direct filter mode in the tooltip.

  4. Enter a Monitoring filter.

    For example, to count the number of processes that are running on Compute Engine VM instances whose name includes nginx, enter the following:

    select_process_count("monitoring.regex.full_match(\".*nginx.*\")")
    resource.type="gce_instance"
    

    For syntax information see the following resources:

  5. Complete the alerting policy dialog. These steps are only outlined in this section. For complete details, refer to Default create alerting policy flow:

    1. Optional: Review and update the data transformation settings.
    2. Click Next and configure the condition trigger.
    3. Click Next and complete the notification and documentation steps.
    4. Click Alert name and enter a name for the alerting policy.
    5. Click Create policy.

Processes that are monitored

Not all processes running in your system can be monitored by a process-health condition. This condition selects processes to be monitored by using a regular expression that is applied to the command line that invoked the process. When the command line field isn't available, the process can't be monitored.

One way to determine if a process can be monitored by a process-health condition is to look at the active processes. For example, on a Linux system, you can use the ps command:

    ps aux | grep nfs
    USER      PID  %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root      1598  0.0  0.0      0     0 ?        S<   Oct25   0:00 [nfsd4]
    root      1639  0.0  0.0      0     0 ?        S    Oct25   2:33 [nfsd]
    root      1640  0.0  0.0      0     0 ?        S    Oct25   2:36 [nfsd]

When a COMMAND entry is wrapped with square brackets, for example [nfsd], the command-line information for the process isn't available. In this situation, you can't use Cloud Monitoring to monitor the process.

Create an SLO alerting policy

To be notified when a system is in danger of violating a defined service-level objective (SLO), create an alerting policy. For example, an SLO for some system might be that it has 99% availability over a calendar week. A different SLO might specify that the latency can exceed 300 ms in only 5 percent of the requests over a rolling 30-day period.

For information about how to create an alert for an SLO, see the following documents:

To create an SLO alerting policy when you use the Cloud Monitoring API, the data you provide to the API includes time-series selector. For information about these selectors, see Retrieving SLO data.

You can create an SLO alerting policy by using the alerting interface in the Google Cloud console. To do so, follow the steps described in Create a process-health alerting policy. However, when you reach the step to enter a Monitoring filter, enter a time-series selector instead of a process-health expression.

Create a resource-group alerting policy

If you want to monitor a collection of resources, where membership in the group is defined by some criteria, then create a resource group and monitor the group. For example, you might define a resource group for the Compute Engine} VM instances that you use for production. After you create that group, you can then create an alerting policy that monitors only that group of instances. When you add a VM that matches the group criteria, the alerting policy automatically monitors that VM.

You can create a resource-group alerting policy by using the Google Cloud console. To do so, follow the steps described in Create a process-health alerting policy. However, after you select the metric, add a filter that restricts the time series to those that match the group criteria.

To filter by group, do the following:

  1. Follow the steps described in Create a process-health alerting policy until you reach the step where you can enter a filter.
  2. Click Filter and select Group.
  3. Expand Value and select the group name.
  4. Click Done.
  5. Complete the steps to configure the alerting policy.

Create an uptime-check alerting policy

We recommend that you create an alerting policy to notify you when an uptime check fails. The uptime-check infrastructure includes a guided alert-creation flow. For details on these steps, see Alerting on uptime checks.

Troubleshoot: Metric not listed in menu

By default, the Select a metric menus list every metric type for which there's data. For example, if you don't use Pub/Sub, then these menus don't list any Pub/Sub metrics.

You can configure an alert even when the data you want the alert to monitor doesn't yet exist:

  • To create an alert that monitors a Google Cloud metric, follow the steps described in Default create alerting policy flow. However, in the step where you select a metric, disable Show only active resources & metrics in the Select a metric menu. When disabled, the menu lists all metrics for Google Cloud services, and all metrics with data.

  • To configure an alert for a custom metric type before that metric type generates data, follow the steps described in Create a process-health alerting policy. When you reach the step to enter a Monitoring filter, enter a filter that specifies the metric type and resource. The following is an example of a Monitoring filter that specifies a metric type:

    metric.type="compute.googleapis.com/instance/disk/write_bytes_count"
    resource.type="gce_instance"