Troubleshoot MQL alerting policies

This page explains why some alerting policies with Monitoring Query Language (MQL)-based conditions might behave differently than intended, and offers possible remedies for those situations.

Data gaps

You created an alerting policy with an MQL-based condition, and the MQL query results show an unexpected gap in the reported data.

Gaps appear in aligned data when a calculation results in a null value at a given timestamp. For example, the following data table is related to a query with a 30 second period:

Table A1

Timestamp Value
00:00:00 1
00:00:30 2
00:01:30 3
00:02:00 4

Since you have a 30-second period, you would expect to see a timestamp at 00:01:00. Gaps like this can occur for many reasons.

Gaps due to alignment

Overly-narrow aligner windows can cause data gaps. For example, the following table of unaligned raw metric data is written approximately every 30 seconds.

Table B1

Timestamp Value
00:00:01 1
00:00:28 2
00:01:01 3
00:01:32 4

If you run a query at 00:02:00 that aligns your data using a next_older(30s) operation, then you receive the following output, which has a data gap at 00:01:00:

Table B2

Timestamp Value
00:00:30 2
00:00:28 3
00:01:01 4

This data gap occurs because no point in the raw data falls in the 30-second window that ends at 00:01:00. To avoid a gap like this, use a larger window. For example, a next_older(1m) operation produces a table without data gaps:

Table B3

Timestamp Value
00:00:01 1
00:00:28 2
00:01:01 3
00:01:32 4

In general, if your data is written every S seconds, then use an alignment window that is larger than S. This way, you can account for uneven distribution of data points over time.

Gaps due to table operations

Some table operations can produce unexpected gaps. For example, the join operation produces output only at timestamps that have a value in all of the input tables.

Table operations such as join can produce gaps. For example, you join the following two aligned tables:

Table C1

Timestamp Value
00:00:30 2
00:01:30 3
00:02:00 4

Table C2

Timestamp Value
00:00:30 4
00:01:00 3
00:01:30 2
00:02:00 1

You then receive the following output:

Table C3

Timestamp Value A Value B
00:00:30 1 4
00:01:30 2 2
00:02:00 3 1

This table has no value at 00:01:00 due to the absence of a value at 00:01:00 in Table C1.

Gaps due to missing values

Some functions produce gaps when their output can't be converted or is undefined. For example, you apply value.string_to_int64 to the following table of string values:

Table D1

Timestamp Value
00:00:30 '4'
00:01:00 '3'
00:01:30 'init'
00:02:00 '1'

Your resulting table contains a gap at 00:01:30 because MQL can't convert 'init' to an integer:

Table D2

Timestamp Value
00:00:30 4
00:01:00 3
00:01:30 null
00:02:00 1

To avoid gaps in data due to bad or missing values, use the has_value or or_else functions to handle those values.

has_value returns false if its argument evaluates to null. Otherwise, it returns true. For example, if you apply value has_value(1 / val()) to Table D2, then your results don't have gaps:

Table D3

Timestamp Value
00:00:30 true
00:01:00 true
00:01:30 false
00:02:00 true

Threshold alert fires when MQL chart shows threshold hasn't been crossed

You want to be notified if a virtual machine (VM) has large fluctuations in its CPU utilization, so you create an alerting policy that monitors the metric compute.googleapis.com/instance/cpu/utilization. You create and configure the condition to generate an incident when CPU utilization every six hours is greater than a threshold of 50%. Your condition uses the following query:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 5m, [value_utilization_mean: mean(value.utilization)]
| align delta_gauge(6h)
| condition val() > 0.5

You receive an alert after 30 seconds. However, your MQL chart shows that the utilization delta hasn't become greater than the threshold.

Alerting policies have a 30-second output window. This period can't be overwritten by leaving the period undefined or defining a different period in your query. For example, the following queries still use a 30-second output window:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 5m, [value_utilization_mean: mean(value.utilization)]
| align delta_gauge(6h) # period not 30 seconds
| condition val() > 0.5
fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 5m, [value_utilization_mean: mean(value.utilization)]
| align delta_gauge() # undefined period
| condition val() > 0.5

Your metric threshold was crossed in the first 30 seconds of evaluation, so Cloud Monitoring sent an alert. To avoid this problem, add | every 30s to the end of your query to verify that your output window produces the intended results. For example:

fetch gce_instance
| metric 'compute.googleapis.com/instance/cpu/utilization'
| group_by 5m, [value_utilization_mean: mean(value.utilization)]
| align delta_gauge()
| every 30s # explicit 30 second output window
| condition val() > 0.5

Error: Unable to save alerting policy. Request contains an invalid argument.

You created an alerting policy with an MQL-based condition. When you save the alerting policy, you receive the following error message:

Error: Unable to save alerting policy. Request contains an invalid argument.

Some MQL table operations, such as group_by, require their inputs to be aligned. If your query doesn't align its inputs, then MQL automatically aligns the data. However, this automatic alignment sometimes results in invalid arguments.

To avoid this problem, if your query uses a table operation, then ensure that your query includes data alignment. For a list of data alignment functions, see the aligning section in the MQL reference documentation.

Threshold line doesn't appear on MQL chart

You created a metric-threshold alerting policy with an MQL-based condition. However, the threshold line doesn't appear on the MQL chart.

Cloud Monitoring draws the threshold line only when your query contains a boolean expression that compares two values, where one value is a column and one value is a literal. For example, the following expression charts a threshold line:

val() > 5'GBy'

However, the following expressions don't chart a threshold line:

val(0) > val(1) #one of the values must be a literal
5 > 4 #one of the values must be a column
val() #the expression must be a comparison