About ratios of metrics

This document helps you choose the best approach to chart or monitor a ratio of metric data. It also includes links to examples, identifies when you can compute ratios, and describes anomalies that you might see when charting a ratio of two different metrics. These anomalies are due to differences in the sampling rate or alignment parameters.

Ratios let you transform your metric data into a different, and potentially more useful, form. For example, consider a metric type that counts the number of HTTP responses by response code. The metric data reports the number of errors, but not the proportion of requests that failed. However, performance requirements are often specified as a percentage, like "The error rate must be less than 0.1%". To determine the error rate by using the metric data, you compute the ratio of the requests that failed to the total number of requests.

Best practices

To monitor or chart a ratio of metric data, we recommend that you use PromQL. You can use PromQL with the Cloud Monitoring API and with the Google Cloud console. The Google Cloud console includes a code editor that provides suggestions, error detection, and other support for creating valid PromQL queries.

To create an alerting policy that monitors a ratio of metrics when you aren't familiar with PromQL, use the Cloud Monitoring API and include a time-series filter. For an example, see Metric ratio.

To chart a ratio of metric data when you aren't familiar with PromQL, we recommend that you use the Google Cloud console and that you use a menu-driven interface. For detailed instructions see, Chart a ratio of metrics and Add charts and tables to a custom dashboard.

Restrictions with ratios

When you configure a ratio, the following restrictions apply:

After aggregation, the labels in the denominator time series must be the same as, or a subset of, the labels in the numerator time series.

We recommend that you select aggregation options such that after aggregation, the numerator and the denominator time series have the same labels.

Consider a configuration where the numerator time series has method, quota_metric, and project_id labels. The denominator time series has limit_name, quota_metric andproject_id labels. The valid choices for the denominator grouping depend on the selections for the numerator:
- Numerator grouped by the method label: Combine the denominator time series into a single time series. No other grouping results in the labels for the denominator time series being a subset of the labels for the numerator time series.
- Numerator grouped by the quota_metric label: Group the denominator by that label or combine all time series in the denominator into a single time series.
- Numerator grouped by the quota_metric and project_id labels: Group the denominator by both labels, by one label, or combine the denominator time series into a single time series.
The valid denominator aggregation options always eliminate the limit_name label from the grouped time series because that label isn't present in the numerator time series.
The alignment period must be the same for the numerator and the denominator when configuring a chart by using the Google Cloud console; however, these fields can be different when using the Cloud Monitoring API.

We recommend that you use the same alignment period for the numerator and the denominator regardless of the tool you use to create to the chart.
The numerator and denominator must have the same value type. For example, when the numerator is of type DOUBLE, the denominator must also be of type DOUBLE.

Ratios requires that the numerator and denominator metric have a value type of DOUBLE or INT64.
The aligned time series for the numerator and the denominator must have the same metric kind. When the two metrics have different kinds, you must use aligners to convert them to the same kind.

Consider a configuration where a DELTA metric is selected for the numerator and a GAUGE metric is selected for the denominator. In this situation, use the rate aligner, ALIGN_RATE, to convert the DELTA metric to a GAUGE metric. For an example, see Ratio alerting policies on usage of rate quota for one limit.
For ratios that aren't defined with PromQL, the monitored resource type must be the same for the numerator and the denominator.

For example, if the resource for the numerator metric is Compute Engine instances, then the resource for the denominator metric must also be Compute Engine instances.

Anomalies due to sampling and alignment mismatches

In general, it is best to compute ratios based on time series collected for a single metric type, by using label values. A ratio computed over two different metric types is subject to anomalies due to different sampling periods and alignment windows.

For example, suppose that you have two different metric types, an RPC total count and an RPC error count, and you want to compute the ratio of error-count RPCs over total RPCs. The unsuccessful RPCs are counted in the time series of both metric types. Therefore, there is a chance that, when you align the time series, an unsuccessful RPC doesn't appear in the same alignment interval for both time series. This difference can happen for several reasons, including the following:

Because there are two different time series recording the same event, there are two underlying counter values implementing the collection, and they aren't updated atomically.
The sampling rates might differ. When the time series are aligned to a common period, the counts for a single event might appear in adjacent alignment intervals in the time series for the different metrics.

The difference in the number of values in corresponding alignment intervals can lead to nonsensical error/total ratio values like 1/0 or 2/1.

Ratios of larger numbers are less likely to result in nonsensical values. You can get larger numbers by aggregation, either by using an alignment window that is longer than the sampling period, or by grouping data for certain labels. These techniques minimize the effect of small differences in the number of points in a given interval. That is, a two-point disparity is more significant when the expected number of points in an interval is 3 than when the expected number is 300.

If you are using built-in metric types, then you might have no choice but to compute ratios across metric types to get the value you need.

If you are designing custom metrics that might count the same thing—like RPCs returning error status—in two different metrics, consider instead a single metric, which includes each count only once. For example, suppose that you are counting RPCs and you want to track the ratio of unsuccessful RPCs to all RPCs. To solve this problem, create a single metric type to count RPCs, and use a label to record the status of the invocation, including the "OK" status. Then each status value, error or "OK", is recorded by updating a single counter for that case.

What's next

For information about using PromQL to configure alerting policies, see PromQL alerting overview.
For information about creating charts, see the following documents:
- To create temporary charts, see Metrics Explorer.
- To add charts to a dashboard by using the Google Cloud console, see Add charts and tables to a custom dashboard.
- To manage charts by using the Cloud Monitoring API, see Create and manage dashboards by API.