Introduction to Cloud Monitoring

This page provides a brief overview of Cloud Monitoring's tools and data model. By using Cloud Monitoring, you can answer important questions like the following:

  • Is my service healthy?
  • What is the load on my service?
  • Is my website accessible and responding correctly?
  • Is my service performing well?

This page is intended for anyone who is new to Cloud Monitoring and who needs to be able to monitor the performance of a service or system.

Overview of Cloud Monitoring

Cloud Monitoring collects measurements of your service and of the Google Cloud resources that you use. This section provides an overview of the Cloud Monitoring tools that you can use to visualize and monitor these measurements.

Alerting policies and uptime checks

If you want to be notified when a service isn't healthy or when it's performance metrics don't meet criteria you define, then create an alerting policy. For example, you can create an alerting policy that notifies your on-call team if the 90th percentile of the latency of HTTP 200 responses from your service exceeds 100ms.

If you want to be notified if a deployed service isn't accessible or if it isn't responding correctly, then configure an uptime check and attach an alerting policy:

  • The uptime check periodically probes your service and stores the success and latency of that probe as metric data.
  • The alerting policy monitors the success status of the uptime check and notifies you if a probe fails.

Charts and dashboards

If you are interested in understanding the current load on a service, or if you want to view the performance data of your service for the past month, then use the charts and dashboards tools. Cloud Monitoring populates dashboards for you based on the services and resources your service uses; however, you can also create custom dashboards to chart data, display indicators, or display text.

You can chart and monitor any (numeric) metric data that is collected by your Google Cloud project:

  • System metrics generated by Google Cloud services. These metrics provide information about how the service is operating. For example, Compute Engine reports more than 25 unique metrics for each virtual machine (VM) instance. For a complete list of metrics, see Google Cloud metrics.

  • System and application metrics the Cloud Monitoring agent gathers. These metrics provide additional information about system resources and applications running on Compute Engine instances and on Amazon Elastic Compute Cloud (Amazon EC2) instances. Optionally, you can configure the agent to collect metrics from third-party plugins. This information might include metrics about your Apache or Nginx web servers, or metrics about your MongoDB or PostgreSQL databases.

  • Custom metrics that your service writes by using the Cloud Monitoring API or by using a library like OpenCensus.

  • Logs-based metrics, which collect numeric information about the logs written to Cloud Logging, are defined by you or by Google. Google-defined logs-based metrics include counts of errors that your service detects and the total number of log entries received by your Google Cloud project. You can also define custom logs-based metrics. For example, you might count the number of log entries that match a given query.

Understanding metrics and time series

This section introduces the Cloud Monitoring data model:

  • A metric describes something that is measured. Examples of metrics include a VM's CPU utilization and the percentage of a disk that is used.

  • A time series is a data structure that contains time-stamped measurements of a metric and information about the source and meaning of those measurements.

For example, the following illustrates a time series:

  "timeSeries": [
    {
      "points": [
        {
          "interval": {
            "startTime": "2020-07-27T20:20:21.597143Z",
            "endTime": "2020-07-27T20:20:21.597143Z"
          },
          "value": {
            "doubleValue": 0.473005
          }
        },
        {
          "interval": {
            "startTime": "2020-07-27T20:19:21.597239Z",
            "endTime": "2020-07-27T20:19:21.597239Z"
          },
          "value": {
            "doubleValue": 0.473025
          }
        },
      ],
      "resource": {
        "type": "gce_instance",
        "labels": {
          "instance_id": "2708613220420473591",
          "zone": "us-east1-b",
          "project_id": "sampleproject"
        }
      },
      "metric": {
        "labels": {
          "device": "sda1",
          "state": "free"
        },
        "type": "agent.googleapis.com/disk/percent_used"
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",

    },

Here are some details about what a time series contains:

  • The points array contains the timestamped measurements.

    In the previous example, the points array contains two values:

      "points": [
        {
          "interval": {
            "startTime": "2020-07-27T20:20:21.597143Z",
            "endTime": "2020-07-27T20:20:21.597143Z"
          },
          "value": {
            "doubleValue": 0.473005
          }
        },
        {
          "interval": {
            "startTime": "2020-07-27T20:19:21.597239Z",
            "endTime": "2020-07-27T20:19:21.597239Z"
          },
          "value": {
            "doubleValue": 0.473025
          }
        },
      ],
    

    To understand the meaning of a value, you need to refer to the other data included in the time series and to the definitions of that data.

  • The resource field describes the hardware or software component that is being monitored. In Cloud Monitoring, the hardware or software component is referred to as the monitored resource. Examples of monitored resources include Compute Engine instances and App Engine applications. For a complete list of monitored resources, see the Monitored resource list.

    In the previous example, the resource field is as shown:

      "resource": {
        "type": "gce_instance",
        "labels": {
          "instance_id": "2708613220420473591",
          "zone": "us-east1-b",
          "project_id": "sampleproject"
        }
    

    The type sub-field lists the monitored resource as a gce_instance, which indicates that these are measurements taken on a Compute Engine VM instance. The labels sub-field contains key-value pairs that provide additional information about the monitored resource. For a gce_instance, the labels identify the specific VM instance being monitored.

  • The metric field describes what is being measured.

    In the previous example, the metric field is as shown:

      "metric": {
        "labels": {
          "device": "sda1",
          "state": "free"
        },
        "type": "agent.googleapis.com/disk/percent_used"
      },
    

    For Google services, the type field specifies the service and what is being monitored. In this example, the Cloud Monitoring agent is the service, and it's measuring the percentage of the disk that is used. If the type field begins with custom or external, then the metric is either a custom metric or one defined by a third-party.

    The labels field contains key-value pairs that provide additional information about the measurement. These labels are defined as part of the MetricDescriptor, which is a data structure that defines the attributes of the measured data. The MetricDescriptor for the metric agent.googleapis.com/disk/percent_used includes two labels:

    • device: The device name, which is "sda1" for the example.
    • state: The type of usage, which must be one of "free", "used", or "reserved". In the example, the value is "free".
  • The metricKind field describes the relationship between adjacent measurements within a time series:

    • GAUGE metrics store the value of the thing being measured at a given moment in time. An analogy is an hourly temperature record.

    • CUMULATIVE metrics store the accumulated value of the thing being measured at a given moment in time. An analogy is an odometer in a vehicle.

    • DELTA metrics store the change in the value of the thing being measured over a specified period of time. An example is a daily stock summary which shows the gains, or the losses, of a stock.

  • The valueType field describes the data type for the measurement: INT64, DOUBLE, BOOL, STRING, or DISTRIBUTION.

Cloud Monitoring writes one time series for each combination of resource and metric label values. You can use these labels to group and to filter time series. For example, if you have a Google Cloud project with multiple Compute Engine VM instances, then the disk utilization for each VM instance is a unique time series. Here are a few of the ways that you can display this data:

  • You can show the disk utilization of every VM instance.
  • You can group by the VM instances by the state label, and then display the average disk utilization. The following screenshot illustrates a chart with this configuration:

    Average disk usage grouped by state.

  • You can show the disk utilization for a specific VM instance by filtering the time series for a single value of the instance_id label. The following screenshot illustrates a chart with this configuration:

    Percent used of the disk for a specific disk.

Viewing time series data with charts and dashboards

Cloud Monitoring provides you with multiple ways to visualize your time-series data:

  • Metrics Explorer is a stand-alone charting tool designed to let you quickly chart and explore time-series data. By default, the charts you create with this tool aren't saved; however, you can save these charts to a custom dashboard. Metrics Explorer is only available in the Google Cloud Console.

  • Predefined dashboards are automatically populated by Cloud Monitoring. There are dashboards for your VM instances, disks, and Pub/Sub instances. The following screenshot illustrates the list of predefined dashboards available to one Google Cloud project:

    List of predefined dashboards.

    For example, by using the VM instances dashboard, you can view details such as memory and disk usage, identify IP addresses, and identify which VMs are dropping network packets. This dashboard also displays information about your usage of the Cloud Monitoring agent and provides suggestions for instrumentation.

  • Custom dashboards let you define what data you want to view and how to view that data. On your dashboards, you can use widgets that display charts, text, and scorecards that tell you visually whether the most recent measurement is in a danger zone, warning zone, or good zone. For examples of all dashboard widgets, see Dashboard widgets.

    You can create custom dashboards with the Dashboards API or with the Cloud Console.

When you create a chart you select the monitored resource and the metric type whose time-series data you want to view. After you make these selections, you can apply filters to select time series that match certain label values, and you can group data by label.

The chart settings let you compare current data to previous data, and to create charts that display time-series data for multiple metrics. For example, the following screenshot shows a chart that displays the number of bytes both read and written by a single VM:

Metrics Explorer displaying disk read and write bytes.

For more information about view time-series data, see Using dashboards and charts.

Configuring alerts

You can be notified when time series meet certain conditions if you create an alerting policy. Alerting policies can be simple or complex:

  • "Notify me if any uptime check to the domain example.com fails for more than 3 minutes."

  • "Notify the on-call team if the 90th percentile of HTTP 200 responses from 3 or more web servers in 2 distinct Google Cloud locations exceeds a response latency of 100ms, as long as there are fewer than 15 QPS on the server."

  • "Notify me if the CPU Utilization for 75% of the VM instances in my Google Cloud project is above a threshold of 75."

    The following screenshot illustrates this alerting policy:

    Alerting policy that monitors the CPU utilization.

You can create alerting policies by using the Cloud Monitoring API and by using the Google Cloud Console. In both cases, you can manage and view your policies in the Google Cloud Console by using the Alerting page.

Conditions are the core component of an alerting policy. A condition describes a potential problem with your system that you want Cloud Monitoring to watch for. For example, you might describe conditions like the following:

  • Any uptime check to the domain example.com fails for more than 3 minutes.
  • The free space of any monitored VM instance is less than 10%.

In the first example, if every uptime check to the domain example.com is successful, then the condition isn't met. In the second example, if VM-A and VM-B are being monitored and if the free space of VM-A is 8%, then this condition is met.

When the conditions of an alerting policy are met, Cloud Monitoring opens an incident and issues notifications.

  • An incident is a persistent record that stores information about the monitored resources when the condition was met. When the condition stops being met, the incident is automatically closed. You can view all incidents, open and closed, by using the alerting dashboard.
  • You specify who is to be notified when you configure an alerting policy. Monitoring provides support for common notifications channels, including email, Cloud Mobile App, and services such as PagerDuty or Slack. For a full list of notification channels, see Notification options.

For more information about alerting policies, see Introduction to alerting.

Verifying your service is accessible

You can configure Cloud Monitoring to periodically probe your service in a way that mimics how users access the service. When you configure an uptime check, servers in at least three different locations periodically probe your service and then record the success and latency of the probe. If you want to be notified when your uptime check fails, then you can create an alerting policy to monitor the uptime_check/check_passed metric, which records the results of uptime checks.

Cloud Monitoring provides an Uptime checks page that displays a summary of your uptime checks. You can filter the display and you can use the embedded links to view the details of a specific uptime check. The detail view for an uptime check displays the success or failure of the response and the latency of the response, along with details about the uptime check:

Sample detail view of an uptime check.

For more information about this topic, see Managing uptime checks.

Monitoring large systems

This section describes features that are designed to help you monitor large systems.

Utilizing resource groups

A Resource group is a dynamic collection of Google Cloud or Amazon resources that satisfy some criteria that you provide. The following are examples of groups:

  • Compute Engine instances whose names start with the string "prod-".
  • Resources with the tag "test-cluster".
  • Amazon EC2 instances in region A or region B.

After you define a resource group, you can monitor the group as if it were a single resource. For example, you can configure an uptime check to monitor a resource group. For charts and alerting policies, you can filter based on the group name because the group name is treated the same way as a label in a time series.

As you add and remove resources, the membership in the group automatically changes.

For more information about this topic, see Using resource groups.

Viewing metrics for multiple Google Cloud projects

Cloud Monitoring lets you view and manage the time-series data for multiple Google Cloud projects and AWS accounts.

By default, Cloud Monitoring pages in the Google Cloud Console provide access only to the time-series data stored in the scoping project, that is, the project you selected with the Cloud Console project picker. The scoping project stores the alerts, uptime checks, dashboards, and monitoring groups that you configure.

The scoping project also hosts a metrics scope. The metrics scope defines the projects and accounts whose metrics are visible to the scoping project. You can configure the metrics scope to include time-series data from other Google Cloud projects and from AWS accounts. For information about how to modify a metrics scope, see Modifying your project's Cloud Monitoring configuration.

Using programmatic and graphical interfaces

You can use the Google Cloud Console to view your metric data, and to create and manage alerting policies, dashboards, and uptime checks.

You can also directly use the Cloud Monitoring API to write custom metric data and to create and manage alerting policies, dashboards, and uptime checks. Cloud Monitoring API reference pages, such as the page alertPolicies.list, let you experiment with API calls directly from the reference page.

In addition to the API reference pages, documentation for a Cloud Monitoring topic typically includes examples that illustrate how to use the Cloud Monitoring API, the client libraries, and the Google Cloud Console.

What's next

To explore Cloud Monitoring, try the Quickstart for monitoring a Compute Engine instance. The quickstart guides you through creating an uptime check, an alerting policy, and creating a custom dashboard by using the Cloud Monitoring console.

For more information about Monitoring, see the following resources:

For information about the Cloud Monitoring API, see APIs and reference.

For lists of metrics and monitored resources, see the following:

For information about pricing, quotas, and limits, see Resources.