Best practices for MQL alerting policies

This page contains an index of best practices for alerting policies with a Monitoring Query Language (MQL)-based condition. You can use the information collected here as a quick reference of what to keep in mind when configuring an alerting policy with a MQL-based condition.

This page doesn't describe the basics of how to use MQL queries in your alerting policies. If you're a new user, then see Alerting policies with MQL.

Recommended operations for MQL queries in alerting policies

Alerting policy evaluation involves multiple internal services. Due to the way these services interact with MQL, we strongly recommend using certain MQL operations instead of others. For example, if you use delta instead of delta_gauge, then alerts may trigger at incorrect times.

The following table shows a list of operations to avoid and recommended operations to use instead.

Avoid	Recommended
`adjacent_rate`	`rate`
`adjacent_delta`	`delta_gauge`
`delta`	`delta_gauge`
`window`	`sliding`

Use the `every 30s` statement with alerting policies

Alerting policies evaluate their condition every 30 seconds. This range of time is called an output window. We recommend that your conditions include an explicit every 30s operation as a visual reminder of this period.

For example, the following alerting policy MQL query includes an explicit every 30s statement:

fetch gce_instance :: compute.googleapis.com/instance/disk/write_bytes_count
| group_by sliding(1h)
| every 30s

If you save an alerting policy with an MQL query that uses a different value for the every operation, then Cloud Monitoring still uses a value of 30 seconds when the alerting policy is active. For example, an alerting policy with the following query still has a 30-second output window:

fetch gce_instance :: compute.googleapis.com/instance/disk/write_bytes_count
| group_by sliding(1h)
| every 90s

Improve query efficiency

Queries run slowly when processing large volumes of data. To improve query efficiency, you can try reducing the amount of data that the query processes. The following sections provide several options for reducing the volume of data that your query evaluates.

Place filters earlier in your query

When you place filters earlier in your query, they can filter out unnecessary data before your query runs operations on your data. For example, consider the following query:

fetch gce_instance :: compute.googleapis.com/instance/disk/write_bytes_count
| group_by [resource.zone], .sum
| filter zone = 'us-west1-b'
| condition val() > 5'GBy'

The query might run faster if you move the filter operation before the group_by operation:

fetch gce_instance :: compute.googleapis.com/instance/disk/write_bytes_count
| filter zone = 'us-west1-b'
| group_by [resource.zone], .sum
| condition val() > 5'GBy'

Decrease your query alignment window

When a query uses the align operation, a smaller alignment window represents a smaller range of time that Cloud Monitoring evaluates for each point in the time series. As a result, you can try improving your query efficiency by reducing the value of your align operation. For example, the following query has a two-hour alignment window:

fetch gce_instance :: compute.googleapis.com/instance/disk/read_bytes_count
| group_by window(1h), .sum
| align next_older(2h)
| every 30s
| condition val() > 2'GBy'

However, if you need to see data for only a 1-hour window, then you could reduce the alignment window to 1 hour:

fetch gce_instance :: compute.googleapis.com/instance/disk/read_bytes_count
| group_by window(1h), .sum
| align next_older(1h)
| every 30s
| condition val() > 2'GBy'

For more information, see Alignment.

Decrease your alerting policy duration window

The alerting policy duration window represents the time period in which each measure must violate the condition before an alert is sent. If you reduce the duration window of your alerting policy without increasing your query alignment window, then Cloud Monitoring has fewer points to evaluate for your alerting policy condition.

For more information, see Duration window.

Assign default values to null metadata

If a metadata value is null, then your queries might produce unexpected results. You can avoid unexpected results by using the or_else function to assign a default value to metadata that would otherwise have a null value.

For example, you run a query that aggregates all of your data together:

fetch k8s_pod :: networking.googleapis.com/pod_flow/egress_bytes_count
| group_by [], 24h, sum(egress_bytes_count)
| condition val() > 10'MBy'

The query produces a result of 10MBy. Next, you run a query to verify how the 10MBy is distributed across node zones:

fetch k8s_pod :: networking.googleapis.com/pod_flow/egress_bytes_count
| group_by [metadata.system.node_zone], 24h, sum(egress_bytes_count)
| condition val() > 10'MBy'

The distribution query returns the following results:

`node_zone`	`egress_byte_count`
us-central1-f	7.3MBy
us-west1-b	2.5MBy

These results show a total of 9.8MBy rather than the expected 10MBy. This discrepancy can occur if one of the aggregated metadata labels has a null value, such as in the following dataset:

value	metadata value
7.3MBy	us-central1-f
2.5MBy	us-west1-b
0.2MBy

To avoid problems from null metadata, you can wrap your metadata reference in an or_else operation, which lets you specify a default value in case a metadata column has no value. For example, the following query uses or_else to set a metadata value of no zone for any metadata columns without a value:

fetch k8s_pod :: networking.googleapis.com/pod_flow/egress_bytes_count
| group_by [or_else(metadata.system.node_zone, 'no zone')], 24h, sum(egress_bytes_count)
| condition val() > 10'MBy'

This new query produces the following results, which sum to 10MBy:

`node_zone`	`egress_byte_count`
us-central1-f	7.3MBy
us-west1-b	2.5MBy
no zone	0.2MBy