Charting distribution metrics

This page describes how you can create and interpret a chart to display metric data that has a Distribution value type. This value type is used by services when the individual measurements are too numerous to collect, but statistical information, such as averages or percentiles, about those measurements is valuable.

For example, consider a service that measures the HTTP latency of requests. The designers choose to report the latency data for completed HTTP requests by using a distribution value type. The data is to be reported every minute. The service defines a collection of buckets where each bucket defines a range of latency values. When an HTTP request completes, the service increments the count in the bucket whose range includes the request's latency value. These per-bucket counts create a histogram of values for that minute.

If the buckets are [0, 4), [4, 8), [8, 12), and [12, 16), and if the latencies of requests in a one-minute interval are 5, 1, 3, 5, 6, 15, and 30, then the histogram of this data is [2, 3, 1, 1]:

Bucket Latency measurements Number of values in the bucket
[0,4) 1, 3 2
[4,8) 5, 5, 6 3
[8,12) 15 1
[12, 16) 30 1

When this data is written to the time series, a Point object is created. For metrics with a distribution value, that object includes the histogram of values. For this sampling period, the Point would contain [2, 3, 1, 1]. The individual measurements aren't written to the time series.

Because this HTTP data is reported every minute, a time series might include the following histograms:

Bucket Histogram for
1:00
Histogram for
1:01
Histogram for
1:02
Histogram for
1:03
[0, 4) 2 6 10 3
[4, 8) 3 1 1 8
[8,12) 1 0 2 2
[12, 16) 1 6 0 1

Heatmap charts

Heatmap charts are designed to display a single time series with distribution values. For these charts, the X-axis is for time, the Y-axis represents the buckets, and the value is represented by color. The brighter the color indicates a higher value. For example, dark areas of the heatmap indicate lower bucket counts than yellow or white areas.

The next figure is one representation of a heatmap for the example. In this example, the sums range from 0 to 10. When the value is zero, the color is black. When the value is 10, the color is yellow. For an intermediate value, such as six, the color is orange.

Heatmap chart for the example.

Because heatmap charts can represent only one time series, you must set the aggregation options to combine all time series into a single time series:

  • Ensure that the Group by field is empty.
  • Select sum for the group-by function.

Line and bar charts

Line charts, stacked bar charts, and stacked line charts can't display distribution values. If you have a metric with a distribution values and want to display it using one of these chart types, then you must convert the histogram for a Point, for example, [2, 3, 1, 1], into a numerical value. There are many ways you can perform this conversion. For example, you could compute the sum or select a percentile.

Consider the example data that reports the latencies of HTTP requests by using a distribution metric. The following table illustrate the histogram of the time series as a function of time. The last column displays the sum of the histogram counts. This last column can be plotted by using an x-y plot:

Time Histogram1 Sum of histogram values
1:00 [2, 3, 1, 1] 7
1:01 [6, 1, 0, 6] 13
1:02 [10, 1, 2, 0] 13
1:03 [3, 8, 2, 1] 14

1The histogram buckets are [0, 4), [4, 8), [8, 12), and [12, 16). The histogram [2, 3, 1, 1] indicates 2 samples in the bucket [0, 4).

In this example, the sum is a meaningful measure, as it can be thought of as the rate of HTTP request completion:

Line chart for the example.

Aggregation and distribution metrics

Aggregation, the processes of regularizing points within a time series and of combining multiple time series, is the same for distribution type metrics as it is for metrics that have a value type of integer or double. However, the chart type enforces some requirements on the choices used for aligning and grouping time series.

Heatmap charts

The alignment function and the group-by function must be selected such that the result is a single time series with a distribution value.

For the alignment function, you can select either sum or delta. These functions combine, at the bucket level, all samples for a single time series that are in the same alignment period, and the result is a distribution value. For example, if two adjacent samples of a time series are [2, 3, 1, 1] and [2, 5, 4, 1], then the sum alignment function produces [4, 8, 5, 2].

For the group-by function, the only option is sum. This function adds together the values of the same buckets for the different time series. The result is a distribution value. For example, if at time 1:00 the timeseries-A histogram is
[2, 3, 1, 1], and the timeseries-B histogram is [1, 5, 2, 2], then summing these histograms together results in [3, 8, 3, 3].

Line charts

The alignment function and the group-by function must be selected such that, after aggregation is complete, the distribution values are converted into numerical values. You can convert a distribution value into a numeric value with the alignment function or with the group-by function.

  • If you select a percentile for the alignment function, then during the alignment stage of aggregation, each distribution value is converted into a numerical value. Grouping time series is optional.

    For example, to display the 99th percentile of every time series, set the the alignment function to 99th percentile, ensure the group-by options are empty, and set the group-by function to none. With this configuration, the chart can display multiple lines, one for each time series.

  • If you select sum or delta as the alignment function, then at the end of the alignment phase, each sample contains a distribution value. With these alignment functions, you must select a group-by function. The group-by function converts the distribution value into a numeric value.

    For example, to display the 99th percentile of all time series, set the alignment function to sum, ensure the group-by options are empty, and set the group-by function to 99th percentile. Because the group-by options are empty, the group-by function combines the distribution values for all time series into a new distribution value, and then selects the 99th percentile. With this configuration, the chart displays a single line.

Understanding distribution percentiles

When you chart a distribution valued metric on the heatmap, you have the option to overlay the 50th percentile, the 95th percentile, and the 99th percentile. If you display a distribution valued metric on a line chart, you have to convert it to a numeric metric and one way to do that is to select a percentile. What might not be clear is how to interpret these percentiles or how they are generated.

The percentile value is a computed value. The computation takes into account the number of buckets, the width of the buckets, and the total count of samples. Because the actual measured values aren't known, the percentile computation can't rely on this data.

Example with synthetic data

Consider an Exponential bucket model with a scale of one and a growth factor of two. This results in a series of buckets where the (i+1)th bucket is twice as wide as the ith bucket.

Case 1: The total number of samples is 1.

Assume that the histogram of measurements is as shown in the following table:

Bucket count
[0, 1) 0
[1, 2) 0
[2, 4) 0
[4, 8) 0
[8, 16) 0
[16, 32) 0
[32, 64) 0
[64, 128) 0
[128, 256) 1

To compute the 50th percentile:

  1. You use the bucket counts to determine that the [128, 256) bucket contains the 50th percentile.
  2. You assume that the measured values within the selected bucket are uniformly distributed and therefore the best estimate of the 50th percentile is the bucket midpoint.

By using the same logic, you can compute any percentile for any bucket:

Percentile bucket value
50th [128, 256) 192
95th [128, 256) 249.6
99th [128, 256) 254.7

When there is a single measurement, the three percentile values differ but they only show the 50th, 95th, and 99th percentile of the same bucket. The error between the estimate and the actual measurements can't be determined because the measurement isn't known. All that is known is that there was a single reading whose value was in the interval [128, 256). This value might have been 128 or it might have been 255.

Case 2: The total number of samples is 10.

Assume that the histogram of measurements is as shown in the following table:

Bucket count
[0, 1) 4
[1, 2) 2
[2, 4) 1
[4, 8) 1
[8, 16) 1
[16, 32) 0
[32, 64) 0
[64, 128) 0
[128, 256) 1

By using the process described previously, the percentiles can be computed and are shown in the following table:

Percentile bucket value max error
50th [1, 2) 1.5 0.5
95th [128, 256) 249.6 121.6
99th [128, 256) 254.7 126.7

As illustrated by this example, when there are 10 samples, the 50th percentile might be in a different bucket than the 95th and 99th percentiles. However, there still aren't enough measurements to allow the 95th and 99th percentiles to be in different buckets.

Example with real data

This section contains a detailed example that illustrates how you can determine the bucket model used by a particular metric and how you can evaluate the potential error in the computed percentile values.

Determining the bucket model

To determine the buckets used for this metric, use the Cloud Monitoring API's projects.timeSeries/list method.

For this example, the data displayed was pulled from an existing Google Cloud project by using API Explorer with the following settings:

  • Metric filter: metric.type="networking.googleapis.com/google-service/backend_latencies" resource.type="google_service_gce_client" resource.label."zone"="us-central1-f"
  • End time: 2020-11-03T10:06:36-05:00
  • Start time: 2020-11-03T10:04:00-05:00

The response to this query includes the bucket-model and the number of samples in the distribution:

{
  "timeSeries": [
    {
      "metric": {
        "labels": {
          "service_name": "monitoring.googleapis.com",
          "protocol": "HTTP/2.0",
          "response_code_class": "200",
          "service_region": "us-central1"
        },
        "type": "networking.googleapis.com/google-service/backend_latencies"
      },
      "resource": {
          [redacted]
        }
      },
      "metricKind": "DELTA",
      "valueType": "DISTRIBUTION",
      "points": [
        {
          "interval": {
            "startTime": "2020-11-03T15:05:00Z",
            "endTime": "2020-11-03T15:06:00Z"
          },
          "value": {
            "distributionValue": {
              "count": "3",
              "mean": 25.889,
              "bucketOptions": {
                "exponentialBuckets": {
                  "numFiniteBuckets": 66,
                  "growthFactor": 1.4,
                  "scale": 1
                }
              },
              "bucketCounts": [
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "0",
                "3"
              ]
            }
          }
        },

This result shows that Service Networking writes the backend latency data by using an exponential model with 66 buckets, the scale is 1, and the growth factor is 1.4. The API response shows that for the specified 1 minute time interval, there were 3 measurements with an average value of 25.889.

The buckets for this metric, along with the midpoint of each bucket and the percentile contribution are shown in the following table:

ith interval Lower bound1 Upper bound2 Midpoint Each percentile contribution
0 -infinity 0 n/a
1 0 1.4 0.7 0.014
2 1.4 1.96 1.58 0.0056
...
9 14.75 20.66 17.7 0.0591
10 20.66 28.93 24.78 0.0827
11 28.9 40.5 34.7 0.116
...

1 Lower bound = scale * (growth factor)^(i-1)
2 Upper bound = scale * (growth factor)^i

Based on the earlier analysis, we expect that if the 50th percentile is in bucket number 10, then the 50th percentile value is 24.78.

Verifying the percentile computations

To retrieve the 50th, 95th, and 99th percentile values, you can use the API method projects.timeSeries/list, and include an alignment period and aligner. In this example, the following settings were selected:

  • Aligner: ALIGN_50, ALIGN_95, or ALIGN_99
  • Alignment Period: 60s

For the ALIGN_50 selection, the following data was returned by the API method:

{
  "timeSeries": [
    {
      "metric": {
         [redacted]
      },
      "resource": {
         [redacted]
         }
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",
      "points": [
        {
          "interval": {
            "startTime": "2020-11-03T15:06:36Z",
            "endTime": "2020-11-03T15:06:36Z"
          },
          "value": {
            "doubleValue": 24.793256140799986
          }
        },
        {
          "interval": {
            "startTime": "2020-11-03T15:05:36Z",
            "endTime": "2020-11-03T15:05:36Z"
          },
          "value": {
            "doubleValue": 34.710558597119977
          }
        },
        {
          "interval": {
            "startTime": "2020-11-03T15:04:36Z",
            "endTime": "2020-11-03T15:04:36Z"
          },
          "value": {
            "doubleValue": 24.793256140799986
          }
        }
      ]
    },

After running the commands for the 95th and 99th percentile aligners, we have the following data:

Statistic Sample @ 15:06 Sample @ 15:05 Sample @ 15:04
mean1 25.889 33.7435 Not available.
50th percentile 24.79 34.71 24.79
95th percentile 28.51 39.91 28.51
99th percentile 28.84 40.37 28.84

1 The mean is reported with the bucket details.

The percentile values match what is expected. The 50th percentile values are all midpoints of an interval. Similarly, if you know the 99th percentile is in bucket 10, then the 99th percentile value should be about 20.66 +(99*0.0827) or 28.84. This matches the reported value.

In this data, the mean values for the data reported at times 15:05 and 15:06 are very similar to the 50% percentile values. That provides confidence that if the number of samples in the histogram is large enough, then the 50th percentile is a reasonable estimate of the population mean.

Summary

This page provides an introduction to charting metrics with distribution values. For these metrics, each Point contains a time interval and a Distribution. The Distribution defines buckets and includes a histogram of the counts associated with each bucket.

You can plot metrics with distribution values by using a heat map. If you want to use line charts, then you must convert the histogram into a numerical value. One way to perform this conversion is to plot a specific percentile of the distribution.

The percentile values for distribution metrics are computed, and the algorithm depends on the bucket counts, the bucket widths, and the shape of the histogram:

  • The 50th, 95th, and 99th percentile values are always different. However, they might be showing different percentiles within the same bucket.
  • The percentiles aren't generated from measurements because these values aren't available.
  • The width of the bucket determines the maximum error between the computed percentile and the measurements.
  • The number of samples in a histogram is important. For example, if this number is less than 20, then the 95th and 99th percentiles are always in the same bucket.
  • For any distribution metric, you can use the Cloud Monitoring API to identify the bucket model used for that metric. Because this model is timestamped, a service can change the bucket model.