Introduction to Cloud Monitoring

Cloud Monitoring is a rich collection of tools that let you answer important questions:

  • "Is my service healthy?"
  • "What is the load on my service?"
  • "Is my website up and responding correctly?"
  • "Is my service performing well?"

Cloud Monitoring measures key aspects of your services, gives you the ability to graph the measurements, and notifies you when the measurements don't have acceptable values. This document provides a brief overview of these capabilities.

Monitoring techniques

Cloud Monitoring provides you with four kinds of monitoring:

  1. Black-box monitoring enables you to probe your service in the same way that a user would use it: by requesting a web page, connecting to a TCP port, or making a REST API call. This type of monitoring provides no information about the internals of your service; your service is treated as an opaque entity. Cloud Monitoring provides this type of monitoring with uptime checks. For more information, see Monitoring a service with uptime checks.

  2. White-box monitoring enables you to monitor aspects of your service that are important to you. You can instrument your service to write time-stamped data by using a library like OpenCensus, or you can write custom time-series data by using the Cloud Monitoring API. For more information, see Using custom metrics.

  3. Grey-box monitoring collects information about the state of the environment in which your services are running. This type of monitoring is provided by a combination of Google Cloud products and third-party partners of Cloud Monitoring, such as Blue Medora. For example:

    • Google Cloud services generate metrics that provide information about how the service is operating. For example, Compute Engine reports the CPU usage and CPU utilization of each VM instance, it also reports the count of bytes and packets dropped by the firewall. For a complete list, see Google Cloud metrics.
    • The Cloud Monitoring agent gathers system and application metrics. When installed on Compute Engine VM instances, this agent collects disk, CPU, network, and process metrics. When installed on Linux, the agent can also be configured to collect metrics from third-party plugins.
    • Third-party plugins provide service-level data on your Linux VMs. This information might include metrics about your Apache or Nginx web servers, or metrics about your MongoDB or PostgresSQL databases.
  4. Logs-based metrics are metrics collected from the content of logs written to Cloud Logging. The predefined logs-based metrics includes, for example, errors that your service detects or the total number of log entries received. You can also define custom logs-based metrics. For example, you could count the number of log entries that match a given query, or keep track of particular values within matching log entries.

Monitored resources, metric descriptors, and time series

This section introduces the terminology Cloud Monitoring uses. For a more detailed discussion of the concepts presented in this section, see Structure of time series.

Monitored resources

A monitored resource is the hardware or software component that is being monitored. Examples of monitored resources include Compute Engine disks and instances, and App Engine applications and instances. There are about 100 types of monitored resources available. For the current list, see Monitored resource list.

Each type of monitored resource is formally described in a data structure called a MonitoredResourceDescriptor. For example, here is the monitored-resource descriptor for the gce_instance resource:

{
  "type": "gce_instance",
  "displayName": "G​C​E VM Instance",
  "description": "A virtual machine instance hosted in Compute Engine (G​C​E).",
  "name": "projects/PROJECT_ID/monitoredResourceDescriptors/gce_instance"
  "labels": [
    {
      "key": "project_id",
      "description": "The identifier of the Google Cloud project associated with this resource, such as \"my-project\"."
    },
    {
      "key": "instance_id",
      "description": "The numeric VM instance identifier assigned by Compute Engine."
    },
    {
      "key": "zone",
      "description": "The Compute Engine zone in which the VM is running."
    }
  ],
}

When a service writes data to Cloud Monitoring, the data being written always refers to a monitored resource. If you view that data, then you can use these labels, which are represented as key-value-pairs, to identify what generated the data. For example, if data refers to the gce_instance monitored resource, then you can identify the specific VM instance by viewing the value of the label instance_id.

In Cloud Monitoring, when you are creating charts or alerting policies, you can filter and group your data based on the values of the labels in a time series. For example, if you have a Google Cloud project with multiple Compute Engine VM instances, you could create a chart that displays the time series of the CPU utilization of each instance. If you add a filter for the instance_id, then you can display only the CPU utilization of a specific VM instance.

Metric descriptors

All data sent to Cloud Monitoring is described by a metric descriptor. A metric descriptor is a definition that describes the attributes of the data. Cloud Monitoring has approximately 1,500 built-in metric descriptors; see the Metrics list for details.

The following is an example of a metric descriptor:

    Metric type: agent.googleapis.com/disk/percent_used
    Display name: Disk utilization
    Metric kind: GAUGE
    Value type: DOUBLE
    Units: % (this symbol indicates a percentage which is a value between 0.0 and 100.0)
    Labels: device, state
    Monitored resource: gce_instance (this value refers to a Compute Engine VM instance)

The remainder of this section describes some of the key attributes of a metric descriptor. For a complete description, see MetricDescriptor:

  • Metric type

    The Metric type looks similar to a URL. For Google Cloud services and for certain third-party integrations, the first part of the metric type identifies the source of the time series, and the remainder describes what is being monitored. In the example metric descriptor, agent.googleapis.com identifies the source as the Cloud Monitoring agent and disk/percent_used indicates that the used percentage is measured. For custom metrics which are user-defined, the metric type is prefixed by external.googleapis.com or custom.googleapis.com.

    Metric types are globally unique.

    Because the metric type is globally unique, the terms metric descriptor and metric type are often interchanged. In the Google Cloud Console, the term metric is often used in place of metric type.

  • Display name

    The Display name is a short, descriptive name for the metric descriptor. In the example, the display name is "Disk utilization". The display name, which might not be unique, is used in the Google Cloud Console to simplify the data display.

  • Metric kind

    The Metric kind describes the relationship between adjacent measured values within a time series:

    • GAUGE metrics store the value of the thing being measured at a given moment in time. An analogy is the speedometer in your car, which records your current speed.

    • CUMULATIVE metrics store the accumulated value of the thing being measured at a given moment in time. An analogy is the odometer in your car, which records the total distance you have traveled.

    • DELTA metrics store the change in the value of the thing being measured over a specified period of time. An analogy is the trip odometer that you reset every day, which measures the total distance you have traveled during that day, or since it was last reset. Another example is a stock summary which tells you how much money you made or lost in the market today.

    For more information, see MetricKind.

  • Value type

    The Value type describes the data type for the measurement. Numeric measurements are INT64 or DOUBLE. Metrics can also have values of type BOOL, STRING, or DISTRIBUTION. All data points within a time series have the same value type.

    For more information about value types, see ValueType.

  • Metric unit

    The Metric unit describes the unit or measurement in which a data point is reported.

    For example, By is the standard notation to indicate "bytes" and kBy is kilobytes, or "thousands of bytes". To record that 1126 bytes had been written, if the unit is kBy, the value would be 1.126. The available metric units also include those that are suitable for digital information. For example, KiBy is "1024's of bytes". To record that 1126 bytes had been written, if the unit is KiBy, the value would be 1.099.

    More information on metric units, see Units.

  • Labels

    Some metric descriptors specify labels to augment the labels defined in the monitored resource. You can filter and group the data by the label value when you create charts or alerting policies.

    The example metric descriptor includes the labels for device and state. The label device refers the disk identifier and the label state identifies if the time series contains values for free disk space, used disk space, or reserved disk space. If you view data that shows the disk usage of a Compute Engine VM instance, then you are viewing data for the metric type agent.googleapis.com/disk/percent_used written against a Compute Engine VM instance. The data contains five labels; the three specified in the monitored resource descriptor and the two defined in the metric descriptor. By using filters, you can, for example, view only the free disk space, or only the free disk space for a specific VM.

Time series

A time series is a collection of measurements and metadata about those measurements. The following illustrates part of a time series:

{
  "timeSeries": [
    {
      "metric": {
        "labels": {
          "device": "sda1",
          "state": "free"
        },
        "type": "agent.googleapis.com/disk/percent_used"
      },
      "resource": {
        "type": "gce_instance",
        "labels": {
          "instance_id": "2708613220420473591",
          "zone": "us-east1-b",
          "project_id": "sampleproject"
        }
      },
      "metricKind": "GAUGE",
      "valueType": "DOUBLE",
      "points": [
        {
          "interval": {
            "startTime": "2020-07-27T20:20:21.597143Z",
            "endTime": "2020-07-27T20:20:21.597143Z"
          },
          "value": {
            "doubleValue": 0.473005
          }
        },
        {
          "interval": {
            "startTime": "2020-07-27T20:19:21.597239Z",
            "endTime": "2020-07-27T20:19:21.597239Z"
          },
          "value": {
            "doubleValue": 0.473025
          }
        },
      ],
    },

Each time series has a unique collection of label values. As illustrated in this example, the time series contains metric labels and resource labels. The values of the labels show that the time series is for the free space of the disk "sda1" which is part of the VM instance with the id 2708613220420473591. A different time series exists for the disk's free space, and yet another for the disk's reserved space. The example contains two measurements, where each measurement is a point that is represented by a time interval and a value.

Viewing time-series data

To view time-series data, you can you use Metrics Explorer or you can view a chart on a dashboard.

Metrics Explorer

Metrics Explorer provides a menu-driven interface where you select the monitored resource type and the metric type whose time-series data you want to view. After you make these selections, you can apply filters to display only specific time series. To let you manage complex configurations, Metrics Explorer provides a set of aggregation options. For more information about these options, see Filtering and aggregation.

For example, the following screenshot illustrates the average free, used, and reserved disk utilization for the disks of all VM instances that are located in the "us-central1-a" zone:

Metrics Explorer displaying the disk utilization.

If you are interested in examining trends, you can configure Metrics Explorer to compare current time-series data to previous data.

If you want to perform comparative analysis of different data, you can create charts that display time-series data for multiple metric descriptors. For example, you can display "Traffic", "Latency", and "Sales" data on the same chart.

For more information, see Metrics Explorer.

Charts and dashboards

To display information about a collection of resources, you use dashboards. Cloud Monitoring provides support for two different types of dashboards:

  • Preconfigured dashboards are automatically created by Cloud Monitoring when a resource is in use by your service. For example, if your service is built using an Apache web server on Google Cloud, then dashboards are automatically created for the web server, for each of the Compute Engine disks, for the firewalls, and for Compute Engine VM instances. If the disks are backed up by snapshots, additional dashboards are automatically created.

    Preconfigured dashboards are designed to display the information most commonly viewed. For example, the dashboard for a Compute Engine VM instance includes information about the zone, public and private IP addresses, and charts that display CPU usage and other interesting data.

  • Custom Dashboards let you to create a collection of charts that you want to view. To add a chart to a dashboard, you can use the add-chart capability within a dashboard or you can create a chart with Metrics Explorer and save it to the dashboard. If you use the Google Cloud Console to create a dashboard, the charts are displayed in a grid pattern; however, you can create more complex configurations by using the Dashboards API.

For more information about charts and dashboards, see Using dashboards and charts.

Monitoring time-series data

To be notified when time series meet certain conditions, create an alerting policy. You can create both simple and complex alerting policies, for example:

  • "Notify me if an uptime check has failed for more than 3 minutes in any location."

  • "Notify the on-call team if the 90th percentile of HTTP 200 responses from 3 or more web servers in 2 distinct Google Cloud locations exceeds a response latency of 100ms, as long as there are fewer than 15 QPS on the server."

To manage and view your policies, Cloud Monitoring provides an alerting dashboard.

This section provides a brief overview of alerting. For more information, see Introduction to alerting.

Alert policy components

In Cloud Monitoring, an alerting policy consists of four components:

  • A name that is displayed on the alerting dashboard and is included in notifications that are sent.
  • A list of notification channels that specify who is notified. Monitoring provides support for common notifications channels. You can configure a notification to be sent through email, to a mobile device, or to a service such as PagerDuty or Slack. For a full list, see Notification options.

  • Custom documentation to be included in the notification. For example, you can configure this content to describe what actions should be taken by a human operator. This field supports the use of parameterized variables; for more information, see Variables in documentation templates.

  • One or more conditions that the alerting policy evaluates. Each condition specifies a monitored resource, a metric type, and when that condition is met. For example, a condition might monitor the disk utilization of a VM instance and be met if the free space is less than 10% for at least 5 minutes.

When the conditions of an alerting policy are met, an incident is generated and notifications are issued. When the conditions are no longer met, the incident is resolved automatically and another notification is sent to the specified notification channels.

Example: Alerting policy for free disk space

Suppose that you want to be notified if the free disk space is below 35% for any disks named tmpfs on the VM instances in the zone "us-central1-a".

You decide to create an alerting policy, so you do the following:

  1. In the Google Cloud Console, you go to Cloud Monitoring and then click Alerting.

  2. You click Create policy to create a new alerting policy. In the dialog that opens, you enter a name and then select Add condition.

  3. In the condition dialog, you need to select the monitored resource, the metric type, and apply filters:

    Select the resource type.

    For this alert, you take the following actions:

    1. For Resource type, you select VM instance
    2. For Metric, you select Disk Utilization
    3. You add filters for the zone, state and disk:

      • zone = "us-central1-a".
      • state = "free"
      • disk = "tmpfs"

    With these options, the interactive chart displays the free disk space for each "tmpfs" disk in the "us-central1-a" zone:

    Alert policy displaying the disk utilization.

    The screenshot shows that in one project, there are two "tmpfs" disks in the "us-central1-a" zone.

  4. You want the condition to be met when "the free disk space for any disk is below 35%". You enter this information into the Configuration pane:

    Condition displaying the configuration.

    • You set Condition triggers if to Any time series violates.

      You chose this value because you want to be notified in any of the time series has a value below 35%. There are other options available. For example, you could also have set this field to be all time series, a specific number of time series, or a percentage of time series.

    • You set the Condition to is below and the Threshold to 35%.

      You chose these settings because you want to compare the value of the time series to 35%, and you want to be notified if the value is below that number. Other options include being above, being absent, or by how fast the value is changing.

    • You set the For field to most recent value.

      You chose the value of most recent value because you want to be notified immediately when the free disk space value is less 35%. The For field defines a duration. If this field is set to five minutes, then the free disk space would need to be below 35% for five minutes before the notification occurs. In this case, the default value of the duration field was set to one minute, which is long enough to ensure that a single measurement doesn't cause an incident to be created.

  5. You complete the policy by saving the condition, adding your notification channels, and adding the documentation.

In this example, when the condition is met, an incident is created and notifications are sent.

Monitoring a service with uptime checks

Cloud Monitoring provides black-box monitoring with uptime checks. If you configure an uptime check for a service, then your service is periodically probed for responsiveness by servers located in at least three different locations.

The uptime check records the success or failure of the response and the latency of the response as time series. Cloud Monitoring creates a dashboard for each uptime check in your Google Cloud project. From an uptime checks dashboard, you can view the latency of the responses, the response history, and detailed information about the check. You can also view this data by creating a chart with Metrics Explorer or by adding a chart to a custom dashboard. For chart settings, see Creating an uptime-latency chart.

You can also configure your uptime check to be associated with an alerting policy. With this configuration, the alerting policy notifies you if the uptime check fails. For more information, see Uptime checks.

Cloud Monitoring provides an uptime dashboard that displays a summary of your uptime checks. You can filter the display, as shown in the following screenshot, and you can use the embedded links to view the details of a specific uptime check:

Sample uptime checks overview with filters.

Monitoring groups

A Cloud Monitoring group is a collection of Google Cloud or AWS resources that possess specific attributes that you specify. Examples of groups include "every Compute Engine instance whose name starts with a specific string", "every resource with certain tags", and "every AWS compute resource in region A or region B". As you add and remove resources, the membership in the group automatically changes.

If you configure an uptime check to monitor a group, then the uptime check probes every member in the group when the probe is issued.

If resources are added, or removed, the uptime check automatically adjusts what it probes. In the dashboard for that uptime check, you can view the latency and check-passed time series for each member of the group. You can also configure these types of uptime checks to be associated with an alerting policy.

When you are creating a chart or an alerting policy, you can filter the time series by Cloud Monitoring group name, and you can group time series by the group name.

For more information, see Using resource groups.

Monitoring Workspaces

Cloud Monitoring uses Workspaces as a mechanism to let you view and manage time series data stored in multiple Google Cloud projects from a single place. The Workspace stores the charts, dashboards, uptime checks, and other configuration actions you perform.

Workspaces are designed to be transparent to most users. In the simplest and most common scenario, when you access Monitoring in the Google Cloud Console for the first time, a Workspace is automatically created for your project. In this simplest case, the Workspace monitors a single Google Cloud project.

After you have created a Workspace, from the Workspace's Setting page, you can add other Google Cloud projects to the Workspace. When you add a project, you enable Cloud Monitoring to read the time-series data stored within that project.

For a conceptual overview, see Workspaces.

To monitor your services the way Google manages its services, see Concepts in service monitoring. With service monitoring, you define your service's performance objectives, how to measure the performance, and an error budget. You can create alerting policies that notify you if your error budget is being consumed at a faster than desired rate.

If you are using Google Kubernetes Engine, see Overview of Google Cloud's operations suite for GKE, which describes how to observe your GKE by using Cloud Monitoring and Cloud Logging.

Getting started with Monitoring

This page has provided a brief overview of the main components of Cloud Monitoring.

To explore the capabilities of Cloud Monitoring, try the Cloud Monitoring quickstart for Compute Engine. The quickstart guides you through creating an uptime check, an alerting policy, and creating a custom dashboard by using the Cloud Monitoring console. The Cloud Monitoring console provides a menu-driven interface and, in some situations, supports Monitoring Query Language. MQL is an expressive, text-based interface for querying time-series data.

You can also use Cloud Monitoring's programmatic interface, the Cloud Monitoring API, to create and manage charts and dashboards, uptime checks, alerting policies, and groups. The API supports the traditional filter-based language and MQL. For more information, see Creating a chart with Monitoring Query Language and Creating a dashboard by the API.

For more information about Monitoring, see the following resources:

For information about pricing, quotas, and limits, see Resources.

For a list of monitored resources, see Monitored resource list.

For a list of supported metrics, see Metrics list.

For information about the Cloud Monitoring API, see the following: