Cloud Monitoring overview

This document provides an overview of the services that Cloud Monitoring provides. These services can help you to understand the behavior, health, and performance of your applications and of other Google Cloud services. Cloud Monitoring automatically collects and stores performance information for most Google Cloud services. You can collect Prometheus metrics by using Google Cloud Managed Service for Prometheus. If you install the Ops Agent on your Compute Engine virtual machines (VMs), then you can collect metrics and logs from your applications and from third-party applications.

The alerting, testing, and visualization services provided by Cloud Monitoring help you answer important questions like the following:

  • What is the load on my service?
  • Is my website responding correctly?
  • Is my service performing well?

Cloud Monitoring provides both Google Cloud console and API support for most of its services. Some services also support the Google Cloud CLI or Terraform. The Cloud Monitoring API reference pages, such as the page alertPolicies.list, let you experiment with API calls directly from the reference page.

Cloud Monitoring services

Cloud Monitoring provides different services that you can use to understand the health and performance of your applications, and of the other Google Cloud services that you use:

Alerts and notifications

To be notified when the value of a performance metric meets criteria that you define, create an alerting policy. The alerting policy includes the list of people or groups who are to receive notifications. Monitoring supports common notification channels, including email, Cloud Mobile App, and services such as PagerDuty or Slack. For example, you might create an alerting policy so that you are notified when the CPU utilization of a VM exceeds 80%.

Each notification includes relevant information about a failure, and it includes a link to an incident. An incident is a persistent record that stores information that you can use to troubleshoot the failure. Typically, a record lists the status of the incident, links to logs, a chart of the recorded metric data, labels, and duration.

The alerting service is integrated with many Google Cloud services. When these integrations exist, you might see a panel that lists recommended alerts, or you might see a button on a chart that lets you create an alerting policy. In both cases, the alerting policies are pre-configured, you only specify the list of people or groups to be notified.

You can create and manage alerting policies by using the Google Cloud console, the Cloud Monitoring API, the Google Cloud CLI, or Terraform.

Proactive monitoring and validation

To test the availability, consistency, and performance of your services, applications, web pages, and APIs, create synthetic monitors. For example, you can probe HTTP, HTTPS, and TCP endpoints for responsiveness with uptime checks, and then get notified when an endpoint fails to respond. You can also create a broken-link checker to crawl a web page and then notify you when broken links are detected.

You can create and manage synthetic monitors by using the Google Cloud console, the Cloud Monitoring API, the Google Cloud CLI, or Terraform.

Data visualization

To visualize your data to see trends, identify outliers, and view other details about your data, you can use the dashboard and charting services:

  • The dashboard service automatically creates a Google Cloud-managed dashboard when you add a resource to your Google Cloud project. For example, a dashboard is created when you create a Compute Engine instance, an alerting policy, or a synthetic monitor. You can use these dashboards to view performance and configuration information, such as the disk usage or the IP address, and to view events.

    To control what data you view and the display format for that data, create a custom dashboard. For example, you might import a Grafana dashboard or install a dashboard from a template. Your custom dashboards can display charts, tables, logs and error groups, alerting policies and incidents, and text. You can also share custom dashboards with people or groups in your organization, and configure these dashboards to display events.

    You can create and manage dashboards by using the Google Cloud console or the API.

  • The chart service, Metrics Explorer, lets you quickly visualize and explore time-series data. The chart settings let you compare current data to previous data, display outliers and percentiles, and display multiple metrics. You can also save charts to a custom dashboard.

Data collection and storage

Cloud Monitoring collects and stores the following types of metric data:

Query languages

When you create an alerting policy or a chart, you must provide a query that describes the data that you want to monitor or chart:

  • Google Cloud console: You can build your query by making selections from menus, or you can write a query. Query editors are available for the Prometheus Query Language (PromQL) and the Monitoring Query Language (MQL). The query editors provide syntax checks and suggestions. You can also write a Monitoring filter expression.

  • Cloud Monitoring API: The API supports PromQL queries, MQL queries, and Monitoring filter expressions.

Monitor large systems

This section describes how you can manage resources as a collection and how you can monitor metrics that are stored in multiple Google Cloud projects.

Manage resources as a collection

To manage your resources as a collection instead of individually, create a resource group. A resource group is a dynamic collection of resources that satisfy some criteria that you provide. As you add and remove resources, for example by adding Compute Engine VM instances to your Google Cloud project, the membership in the group automatically changes. The following are examples of resource groups:

  • Compute Engine instances whose names start with the string prod-.
  • Resources with the tag test-cluster.
  • Amazon EC2 instances in region A or region B.

After you define a resource group, you can monitor the group as if it were a single resource. For example, you can configure an uptime check to monitor a resource group. For charts and alerting policies, you can also filter based on the group name.

For more information, see Configure resource groups.

Monitor metrics for multiple Google Cloud projects

To view and monitor the time-series data for multiple Google Cloud projects and AWS accounts through a single interface, configure a multi-project metrics scope.

By default, Cloud Monitoring pages in the Google Cloud console provide access only to the time series stored in the scoping project. The scoping project is the project that you selected with the Google Cloud console project picker. The scoping project stores the alerts, synthetic monitors, dashboards, and monitoring groups that you configure.

The scoping project also hosts a metrics scope. The metrics scope defines the projects and accounts whose metrics are visible to the scoping project. You can configure the metrics scope to include time-series data from other Google Cloud projects and from AWS accounts. For information about how to modify a metrics scope, see Configure a metrics scope for multiple projects.

Cloud Monitoring data model

This section introduces the Cloud Monitoring data model:

  • A metric describes something that is measured. Examples of metrics include a VM's CPU utilization and the percentage of a disk that is used.

  • A time series is a data structure that contains time-stamped measurements of a metric and information about the source and meaning of those measurements.

For example, the following illustrates a time series:

  "timeSeries": [
    {
      "points": [
        {
          "interval": {
            "startTime": "2020-07-27T20:20:21.597143Z",
            "endTime": "2020-07-27T20:20:21.597143Z"
          },
          "value": {
            "doubleValue": 0.473005
          }
        },
        {
          "interval": {
            "startTime": "2020-07-27T20:19:21.597239Z",
            "endTime": "2020-07-27T20:19:21.597239Z"
          },
          "value": {
            "doubleValue": 0.473025
          }
        },
      ],
      "resource": {
        "type": "gce_instance",
        "labels": {
          "instance_id": "2708613220420473591",
          "zone": "us-east1-b",
          "project_id": "sampleproject"
        }
      },
      "metric": {
        "labels": {
          "device": "sda1",
          "state": "free"
        },
        "type": "agent.googleapis.com/disk/percent_used"
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",

    },

Here are some details about what a time series contains:

  • The points array contains the time-stamped measurements.

    In the previous example, the points array contains two values:

      "points": [
        {
          "interval": {
            "startTime": "2020-07-27T20:20:21.597143Z",
            "endTime": "2020-07-27T20:20:21.597143Z"
          },
          "value": {
            "doubleValue": 0.473005
          }
        },
        {
          "interval": {
            "startTime": "2020-07-27T20:19:21.597239Z",
            "endTime": "2020-07-27T20:19:21.597239Z"
          },
          "value": {
            "doubleValue": 0.473025
          }
        },
      ],
    

    To understand the meaning of a value, you need to refer to the other data included in the time series and to the definitions of that data.

  • The resource field describes the hardware or software component that is being monitored. In Cloud Monitoring, the hardware or software component is referred to as the monitored resource. Examples of monitored resources include Compute Engine instances and App Engine applications. For a list of monitored resources, see the Monitored resource list.

    In the previous example, the resource field is as shown:

      "resource": {
        "type": "gce_instance",
        "labels": {
          "instance_id": "2708613220420473591",
          "zone": "us-east1-b",
          "project_id": "sampleproject"
        }
    
    • The type field lists the monitored resource as a gce_instance, which indicates that these measurements are taken on a Compute Engine VM instance.

    • The labels field contains key-value pairs that provide additional information about the monitored resource. For a gce_instance type, the labels identify the VM instance that is being monitored.

  • The metric field describes what is being measured.

    In the previous example, the metric field is as shown:

      "metric": {
        "labels": {
          "device": "sda1",
          "state": "free"
        },
        "type": "agent.googleapis.com/disk/percent_used"
      },
    
    • For Google services, the type field specifies the service and what is being monitored. In this example, the Cloud Monitoring agent is the service, and it's measuring the percentage of the disk that is used. When the type field begins with custom or external, the metric is either a custom metric or one defined by a third party.

    • The labels field contains key-value pairs that provide additional information about the measurement. These labels are defined as part of the MetricDescriptor, which is a data structure that defines the attributes of the measured data. The MetricDescriptor for the metric agent.googleapis.com/disk/percent_used includes the labels device and state.

  • The metricKind field describes the relationship between adjacent measurements within a time series:

    • GAUGE metrics store the value of the thing being measured at a given moment in time—for example, an hourly temperature record.

    • CUMULATIVE metrics store the accumulated value of the thing being measured at a given moment in time—for example, an odometer in a vehicle.

    • DELTA metrics store the change in the value of the thing being measured over a specified period—for example, a stock summary that shows the stock's gains or losses.

  • The valueType field describes the data type for the measurement: INT64, DOUBLE, BOOL, STRING, or DISTRIBUTION.

Cloud Monitoring writes one time series for each combination of resource and metric label values. You can use these labels to group and to filter time series. For example, when a Google Cloud project contains multiple Compute Engine VM instances, the CPU utilization for each VM instance is a unique time series. Here are a few of the ways that you can display this data:

  • You can show the CPU utilization of every VM instance.
  • You can show the CPU utilization for a specific VM instance by filtering the time series for a single value of the instance_id label.
  • You can group by the VM instances by the machine_type label, and then display the average CPU utilization. The following screenshot illustrates a chart with this configuration:

    Average CPU utilization grouped by machine type.

Pricing

In general, Cloud Monitoring system metrics are free, and metrics from external systems, agents, or applications are not. Billable metrics are billed by either the number of bytes or the number of samples ingested.

For more information about Cloud Monitoring pricing, see the following documents:

What's next