User-defined metrics overview

User-defined metrics are all metrics that aren't defined by Google Cloud. These include metrics that you might define, and they include metrics that a third-party application defines. User-defined metrics let you capture application-specific data or client-side system data. The built-in metrics collected by Cloud Monitoring can give you information on backend latency or disk usage, but they can't tell you, for example, how many background routines your application spawned.

You can also create metrics that are based on the content of log entries. Log-based metrics are a class of user-defined metric, but you must create them from Cloud Logging. For more information about log-based metrics, see Log-based metrics overview.

User-defined metrics are sometimes called custom metrics or application-specific metrics. These metrics let you, or a third-party application, define and collect information the built-in Cloud Monitoring metrics cannot. You capture such metrics by using an API provided by a library to instrument your code, and then you send the metrics to a backend application like Cloud Monitoring.

You can create user-defined metrics, except log-based metrics, by using the Cloud Monitoring API directly. However, we recommend that you use OpenTelemetry. For information about how to create user-defined metrics, see the following documents:

  • Collect OTLP metrics and traces describes how to use the Ops Agent and the agent's OpenTelemetry Protocol (OTLP) receiver to collect metrics and traces from applications instrumented by using OpenTelemetry and running on Compute Engine.

  • Google Cloud Managed Service for Prometheus describes how to collect Prometheus metrics from applications running on Google Kubernetes Engine and Kubernetes.

  • Collect Prometheus metrics describes how to use the Ops Agent to collect Prometheus metrics from applications running on Compute Engine.

  • Create user-defined metrics with the API describes how to create metrics by using the Cloud Monitoring API and how to add metric data to those metrics. This document illustrates how to use the Monitoring API with examples using the APIs Explorer, C#, Go, Java, Node.js, PHP, Python, and Ruby programming languages.

  • Create custom metrics on Cloud Run shows how to use the OpenTelemetry Collector as a sidecar agent in Cloud Run deployments.

As far as Cloud Monitoring is concerned, you can use user-defined metrics like the built-in metrics. You can chart them, set alerts on them, read them, and otherwise monitor them. For information about reading metric data, see the following documents:

  • List metric and resource types explains how to list and examine your user-defined and built-in metric types. For example, you can use the information in that document to list all user-defined metric descriptors in your project.
  • Retrieve time-series data explains how to retrieve time series data from metrics using the Monitoring API. For example, this document describes how you can use the API to get the CPU utilization for virtual machine (VM) instances in your Google Cloud project.

The Google Cloud console provides a dedicated page to help you view your usage of user-defined metrics. For information about the contents of this page, see View and manage metric usage.

Metric descriptors for user-defined metrics

Each metric type must have a metric descriptor that defines how the metric data is organized. The metric descriptor also defines the labels for the metric and the name of the metric. For example, the metrics lists show the metric descriptors for all built-in metric types.

Cloud Monitoring can create the metric descriptor for you, by using the metric data you write, or you can explicitly create the metric descriptor, and then write metric data. In either case, you must decide how you want to organize your metric data.

Design example

Suppose you have a program that runs on a single machine, and that this program calls auxiliary programs A and B. You want to count how often programs A and B are called. You also want to know when program A is called more than 10 times per minute and when program B is called more than 5 times per minute. Lastly, assume that you have a single Google Cloud project and you plan to write the data against the global monitored resource.

This example describes a few different designs that you could use for your user-defined metrics:

  • You use two metrics: Metric-type-A counts calls to program A and Metric-type-B counts calls to program B. In this case, Metric-type-A contains 1 time series, and Metric-type-B contains 1 time series.

    You can create a single alerting policy with two conditions, or you can create two alerting policies each with one condition with this data mode. An alerting policy can support multiple conditions, but it has a single configuration for the notification channels.

    This model might be appropriate when you aren't interested in similarities in the data between the activities being monitored. In this example, the activities are the rate of calls to programs A and B.

  • You use a single metric and use a label to store a program identifier. For example, the label might store the value A or B. Monitoring creates a time series for each unique combination of labels. Therefore, there is a time series whose label value is A and another time series whose label value is B.

    As with the previous model, you can create a single alerting policy or two alerting policies. However, the conditions for the alerting policy are more complicated. A condition that generates an incident when the rate of calls for program A exceeds a threshold must use a filter that includes only data points whose label value is A.

    One advantage of this model is that it is simple to compute ratios. For example, you can determine how much of the total is due to calls to A.

  • You use a single metric to count the number of calls, but you don't use a label to record which program was called. In this model, there is a single time series that combines the data for the two programs. However, you can't create an alerting policy that meets your objectives because the data for two programs can't be separated.

The first two designs let you meet your data analysis requirements; however, the last design doesn't.

For more information, see Create a user-defined metric.

Names of user-defined metrics

When you create a user-defined metric, you define a string identifier that represents the metric type. This string must be unique among the user-defined metrics in your Google Cloud project and it must use a prefix that marks the metric as a user-defined metric. For Monitoring, the allowable prefixes are custom.googleapis.com/, workload.googleapis.com/, external.googleapis.com/user, and external.googleapis.com/prometheus. The prefix is followed by a name that describes what you are collecting. For details on the recommended way to name a metric, see Metric naming conventions. Here are examples of the two kinds of identifiers for metric types:

    custom.googleapis.com/cpu_utilization
    custom.googleapis.com/instance/cpu/utilization

In the previous example, the prefix custom.googleapis.com indicates that both metrics are user-defined metrics. Both examples are for metrics that measure the CPU utilization; however, they use different organizational models. When you anticipate having a large number of user-defined metrics, we recommend that you use a hierarchical naming structure like that used by the second example.

All metric types have globally unique identifiers called resource names. The structure of a resource name for a metric type is:

projects/PROJECT_ID/metricDescriptors/METRIC_TYPE

where METRIC_TYPE is the string identifier of the metric type. If the previous metric examples are created in project my-project-id, then their resource names for these metrics would be the following:

    projects/my-project-id/metricDescriptors/custom.googleapis.com/cpu_utilization
    projects/my-project-id/metricDescriptors/custom.googleapis.com/instance/cpu/utilization

Name or type? In the metric descriptor, the name field stores the metric type's resource name and the type field stores the METRIC_TYPE string.

Monitored-resource types for user-defined metrics

When you write your data to a time series, you must indicate where the data came from. To specify the source of the data, you choose a monitored-resource type that represents where your data comes from, and then use that to describe the specific origin. The monitored resource isn't part of the metric type. Instead, the time series to which you write data includes a reference to the metric type and a reference to the monitored resource. The metric type describes the data while the monitored resource describes where the data originated.

Consider the monitored resource before creating your metric descriptor. The monitored-resource type you use affects which labels you need to include in the metric descriptor. For example, the Compute Engine VM resource contains labels for the project ID, the instance ID, and the instance zone. Therefore, if you plan to write your metric against a Compute Engine VM resource, then the resource labels include the instance ID so you don't need a label for the instance ID in the metric descriptor.

Each of your metric's data points must be associated with a monitored resource object. Points from different monitored-resource objects are held in different time series.

You must use one of the following monitored resource types with user-defined metrics:

A common practice is to use the monitored resource objects that represent the physical resources where your application code is running. This approach has several advantages:

  • You get better performance compared with using a single resource type.
  • You avoid out-of-order data caused by multiple processes writing to the same time series.
  • You can group your user-defined-metric data with other metric data from the same resources.

global and generic resources

The generic_task and generic_node resource types are useful in situations where none of the more specific resource types are appropriate. The generic_task type is useful for defining task-like resources such as applications. The generic_node type is useful for defining node-like resources such as virtual machines. Both generic_* types have several common labels you can use to define unique resource objects, making it easy to use them in metric filters for aggregations and reductions.

In contrast, the global resource type has only the project_id label. When you have many sources of metrics within a project, using the same global resource object can cause collisions and over-writes of your metric data.

API methods that support user-defined metrics

The following table shows which methods in the Monitoring API support user-defined metrics and which methods support built-in metrics:

Monitoring API method Use with
user-defined metrics
Use with
built-in metrics
monitoredResourceDescriptors.get yes yes
monitoredResourceDescriptors.list yes yes
metricDescriptors.get yes yes
metricDescriptors.list yes yes
timeSeries.list yes yes
timeSeries.create yes
metricDescriptors.create yes
metricDescriptors.delete yes

Limits and latencies

For limits related to user-defined metrics and data retention, see Quotas and limits.

To keep your metric data beyond the retention period, you must manually copy the data to another location, such as Cloud Storage or BigQuery.

For information about latencies associated with writing data to user-defined metrics, see Latency of metric data.

What's next