Alerting policies with MQL

You can create Monitoring alerting policies whose condition includes an MQL query. MQL queries for alert conditions are like other MQL queries, except that they also include a MQL alerting operation.

This page introduces the MQL alerting operations and describes how to create an alerting policy that uses them. For general information on Monitoring alerting policies, see Behavior of metric-based alerting policies.

MQL alerting operations

You can create both threshold and absence alerting policies with MQL.

You create an MQL-based alerting policy by using one of the following MQL alerting operations in your query:

Your query must end with one of these operations. For detailed information, see the Alerting section in the MQL reference.

You query must not include an explicit time-range specification, that is, a within operation.

When using MQL to create an alerting policy, you build an MQL query with fetch, filter, group_by, and so on, to identify the target time series. This part of the query is the same as a query used for retrieving time series data for a chart. For example, the following query fetches the CPU utilization for all Compute Engine VM instances in any US central region:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
| filter zone =~ 'us-central.*'

This query generates an output table. To create an alert, you pipe the output table into an alerting operation. The alerting operation computes boolean values for the data values in the output table generated by the the query preceding the alerting operation.

The alerting operation specifies an expression for evaluating the data in the input table. For a threshold condition, the expression tests each point against a threshold like "is the value less than 0.5?"

The Monitoring alerting facility uses the results of the alerting operation to determine if and when the alerting policy is triggered. Alert configuration describes how the decision is made.

Threshold alerts

For threshold alerts, use the condition operation. The condition operation takes an expression that evaluates a value against a threshold, like "the value is greater than 15 percent", and returns a boolean.

The condition operation requires that the input table be aligned with an explicit alignment window. To align the input table with an explicit window, specify an alignment window to an align operation—for example, align delta_gauge(5m)—or use a temporal group_by with a sliding time window. The following example illustrates using group_by with a sliding operation:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
| filter zone =~ 'us-central.*'
| group_by sliding(5m), mean(val())
| condition val() > .15 '10^2.%'

Because the default group_by window setting is sliding, the group_by expression in the previous query is identical to group_by 5m, mean(val()).

The condition tests each data point in the aligned input table to determine whether the utilization value exceeds the threshold value of 15%. The table resulting from the condition operator has two value columns, a boolean column recording the result of the threshold evaluation, and a second containing a copy of the utilization value column from the input table.

MQL lets you create user-defined labels and attach them to incidents. For examples, see Add severity levels to an alerting policy.

The CPU-utilization value is stored as fractional utilization; the values range from 0.0 to 1.0. The metric descriptor specifies the unit for these value as 10^2.%, which the chart displays as a percentage. The units for the threshold have to be compatible, so we express the threshold as .15 '10^2.%.

Units for metric types are listed in the relevant table of metric types; for the metric type compute.googleapis.com/instance/cpu/utilization, see the compute table.

For more information on units in MQL, see Units of measure.

Absence alerts

For absence alerts, use the absent_for operation, which takes a duration for which data must be missing. For example, the following tests to see if data has been missing from the US central zones for eight hours:

fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
| filter zone =~ 'us-central.*'
| absent_for 8h

The absent_for operation takes only a duration argument, which indicates for how long data must be absent to satisfy the condition.

Data is considered absent if data has appeared in the last 24-hour period but not within the duration, in this example, the most recent eight hours.

An absent_for query creates an output table with aligned values, using either the default alignment or by using an every operation following the absent_for operation.

The output table has two columns.

  • The first is the active column, which records the boolean results for data absence. A true value means there was an input point within the last 24 hours and none within the duration period.

  • The second column is the signal column. If the input table has value columns, then the signal column contains the value from the first value column of the most recent input point. If the input table has no value columns, then the signal column contains the number of minutes since the last input point was recorded. You can easily force this case, as shown in the following example:

    fetch gce_instance::compute.googleapis.com/instance/cpu/utilization
    | filter zone =~ 'us-central.*'
    | value []
    | absent_for 8h
    

    In the preceding example, the value [] operation removes the value columns from its input table, so the signal column in the table created by the absent_for operation contain the number of minutes since the last input point was recorded.

Alert configuration

In addition to the MQL query, an alerting-policy condition includes two other values:

  • The number of input time series that must satisfy the condition. The value can be any of the following:
    • A single time series.
    • A specific number of time series.
    • A percentage of time series.
    • All time series.
  • Duration of the alert state, that is, how long the alert condition must continuously evaluate to true.

When the alerting query continuously evaluates to true for the specified duration for a particular time series, then that time series is considered active. When the specified number of time series are active, then the alerting policy is triggered and an alert is generated for each active time series. For more information about how alerting policies are evaluated, see Behavior of metric-based alerting policies.

If you use MQL in a condition, that condition must be the only condition in the policy. You can't use multiple conditions in MQL-based alerting policies.

Creating MQL alerting policies (console)

To create a MQL-based alerting policy from the Google Cloud console, do the following:

  1. In the navigation panel of the Google Cloud console, select Monitoring, and then select  Alerting:

    Go to Alerting

  2. To add notification channels or to update notification channels, click Edit notification channels. Add your notification channels and then return to the Alerting page.

    For details about your choices of notification channels, see Create and manage notification channels.

  3. On the Alerting page, click Create Policy.

  4. On the toolbar, select MQL.

    The code editor opens.

  5. Enter the query that selects the data you want to monitor in the code editor. The following query fetches the time series and aligns them over a five-minute window:

    fetch gce_instance
    | metric 'compute.googleapis.com/instance/cpu/utilization'
    | group_by 5m, mean(val())
    

    If you click Run Query at this point, then you see a chart. For one project, this query produced the following result:

    Chart from an alerting condition before specifying the alert.

  6. Add an alert clause to the query by using one of the following operations:

    • The condition operator, for a threshold alert.
    • The absent_for operator, for an absence alert.

    For more information about these alerting operations, see Alerting in the MQL reference.

    The following example uses the condition operation to specify a threshold:

    fetch gce_instance
    | metric 'compute.googleapis.com/instance/cpu/utilization'
    | group_by 5m, mean(val())
    | condition val() > .05
    

    If you click Run Query at this point, then the chart adds a threshold line for the condition, as shown in the following screenshot:

    Chart from an alerting condition after specifying the alert.

  7. If you haven't run your query yet, then click Run Query.

  8. Click Next and configure the alert trigger:

    1. Alert triggers lets you specify how many time series returned by the query must satisfy the alerting operation before the alerting policy can be triggered. You can select from the following criteria:

      • A single time series.
      • A specific number of time series.
      • A percentage of the time series.
      • All of the time series.
    2. Optional: Expand the Advanced options menu and select the Retest window. This field defines how long the condition must be satisfied before the alerting policy is triggered. The Retest window isn't the same as the alignment window used in the MQL query. For more information on the relationship between these values, see The alignment period and the duration.

    3. Enter a name for the condition and click Next.

  9. Optional: Configure notifications, add policy labels, and add documentation.

  10. Click Alert name and enter a name for the alerting policy.

  11. Click Create policy.

    Queries for conditions in alerting policies aren't converted to strict form.

For complete steps, see Managing alerting policies.

Creating MQL alerting policies (API)

If you're using the API, create a condition of the type MonitoringQueryLanguageCondition when you set up the policy. For more information, see Creating conditions for alerting policies.

Then pass the policy to alertPolicies.create as usual.