Observability in Google Cloud

Google Cloud Observability includes observability services that help you to understand the behavior, health, and performance of your applications. Visibility into how applications behave and how components are connected help you to anticipate, identify, and respond to unexpected changes more quickly and effectively.

This document includes the following information:

An overview of observability and why observability is important for keeping your applications available and reliable.
How Google Cloud Observability helps you to monitor and maintain application and infrastructure health.
How to get started with observability in Google Cloud.

About observability

Observability is a holistic approach to gathering and analyzing telemetry data in order to understand the state of your environment. Telemetry data is metrics, logs, traces, and other data generated by your applications and the application infrastructure that provide information about application health and performance.

Metrics

Metrics are numeric data about health or performance that you measure at regular intervals over time, such as CPU utilization and request latency. Unexpected changes to a metric might indicate an issue to investigate. Over time, you can also analyze metric patterns to better understand usage patterns and anticipate resource needs.

Logs

A log is a generated record of system or application activity over time. Each log is a collection of time stamped log entries, and each log entry describes an event at a specific point in time.

A log often contains rich, detailed information that helps you understand what happened with a specific part of your application. However, logs don't provide good information about how a change in one component of your application relates to activity in another component. Traces can help to bridge that gap.

Traces

Traces represent the path of a request across the parts of your distributed application. A metric or log entry in one application component that triggered an alert notification might be a symptom of a problem that originates in another component. Traces let you follow the flow of a request and examine latency data to help you to identify the root cause of an issue.

Other data

You can gain additional insights by analyzing metrics, logs, and traces in the context of other data. For example, a label for the severity of an alert or the customer ID associated with a request in logs provide context that can be useful for troubleshooting and debugging.

Monitoring, debugging, and troubleshooting distributed applications can be difficult because there are many systems and software components involved, often with a mix of open source and commercial software.

Observability tools help you to navigate this complexity by collecting meaningful data and providing features to explore, analyze, and correlate the data. An observable environment helps you to:

Proactively detect issues before they impact your users
Troubleshoot both known and new issues
Debug applications during development
Plan for and understand the impacts of changes to your applications
Explore data to discover new insights

In short, an observable environment helps you to maintain application reliability. An application is reliable when it meets your current objectives for availability and resilience to failures.

To learn more about reliability practices, including principles and practices related to observability, read the book Site Reliability Engineering: How Google Runs Production Systems. Topics include:

Google Cloud Observability

Services in Google Cloud Observability help you to collect, analyze, and correlate telemetry data. They also provide built-in defaults to help you get started faster such as default dashboards and alert policies.

Cloud Monitoring, Cloud Logging, and Cloud Trace are among the services enabled by default when you create a Google Cloud project.

Monitoring: Use collected metrics to monitor health and performance, identify trends and issues, and notify for changes in behavior.

Automatically collect metrics for most Google Cloud services.
Collect system and application metrics from third-party applications.
Visualize and analyze metrics with default or customized dashboards.
Use synthetic monitoring to test the performance of your applications.
Define service level objectives (SLOs) to monitor service reliability.
Receive alerts when issues occur.

Logging: Use collected logs to debug, troubleshoot, and gain insights about your applications.

Automatically collect logs for most Google Cloud services.
Automatically collect audit logs for most Google Cloud services.
Collect logs from third-party software.
Explore and analyze logs.
Use Log Analytics to perform an analysis across your logs and other data with BigQuery. For example, you can use BigQuery to compare URLs in your logs with a public dataset of known malicious URLs.
Create metrics from logs.
Receive alerts when a specified message appears in a log.

Error Reporting: View and analyze errors from running cloud services:

Aggregate errors that Error Reporting detects in log entries, and view the associated logs.
Aggregate errors that your applications send to the Error Reporting API.

Trace: View and analyze the flow and latency of application requests when you are debugging and troubleshooting.

Track how requests propagate through your applications.
Collect latency data from your applications and view graphs of the data.
View latency reports that show performance degradations.
Receive alerts for changes in the latency profile for your applications.
Annotate traces with custom attributes.
Export traces to BigQuery so that you can explore it with other data.

Cloud Profiler: Analyze CPU and memory usage for your applications so that you can identify opportunities to improve performance.

Collect CPU usage and memory allocation data from your applications.
Identify the parts of an application that are consuming the most resources and gain insights about the application's overall performance.

Get started

This section describes steps you can take to get familiar with observability features in Google Cloud.

Try the quickstarts

Try the quickstarts to get familiar with the available services.

Look at automatically collected data

Most Google Cloud services automatically generate predefined metrics and logs. This means that you can start looking at some observability data for supported Google Cloud services without additional configuration.

Some Google Cloud services such as Google Kubernetes Engine (GKE), Compute Engine, and Cloud SQL provide default dashboards in the Google Cloud console to view observability data in context of the service.
Compute Engine, GKE, and Cloud Run generate system metrics and logs by default, and you configure collection of additional data.
Cloud Functions, and App Engine automatically generate metrics, logs, and traces.

You can also chart collected metrics in Metrics Explorer, view logs in Logs Explorer, or view traces in Trace. To review related data together, create custom dashboards. For example, you can create a dashboard that includes logs, performance metrics, and alerting policies for virtual machines.

Configure Compute Engine VMs to collect additional data

Compute Engine VMs only collect basic system metrics and logs by default without the Ops agent

Install the Ops Agent to collect additional telemetry data (logs, metrics, and traces) from your Compute Engine instances and applications for troubleshooting, performance monitoring, and alerting.

Automatically collect host metrics such as CPU, GPU, memory, and process metrics.
Automatically collect system logs such as syslog from Linux VMs and Windows Event Log from Windows VMs.
Observe your applications with:
- Third-party application integrations for popular software like Postgres, MongoDB, and Java Virtual Machine with pre-configured dashboards and alert policies
- Prometheus metrics
- OpenTelemetry Protocol (OTLP) metrics and traces
- Application logs
For a summary of the collected telemetry data, see the Ops Agent overview.

Configure GKE clusters to collect additional data

By default, GKE clusters send system logs and system metrics to Logging and Monitoring. Google Cloud Managed Service for Prometheus handles collection of third-party and user-defined metrics.

Use observability metrics packages to better understand the state of your applications and cluster resources. For example, control plane metrics are useful for creating SLOs to monitor service availability and latency.
Monitor third-party applications such as Postgres, MongoDB, and Redis. These integrations provide pre-configured dashboards and alert policies.

Configure Cloud Run to collect custom data

If you have a have a Cloud Run service that writes Prometheus metrics, then you can use the Prometheus sidecar to send the metrics to Cloud Monitoring.

If your Cloud Run service writes OTLP metrics instead, then you can use an OpenTelemetry sidecar. For an example, see the tutorial for collecting OTLP metrics by using the sidecar.

Instrument your applications

Instrumentation is code that you add to an application to emit telemetry data. There are several open-source instrumentation frameworks let you collect metrics, logs, and traces from your application and send that data to any vendor, including Google Cloud. However, you might not need to instrument your application. For example, Cloud Run, Cloud Functions, and App Engine provide automatic tracing.

To instrument your application, we recommend that you use a vendor-neutral instrumentation framework that is open source, such as OpenTelemetry, instead of vendor- and product-specific APIs or client libraries. For information about instrumenting your application, see Instrumentation and observability.

For code samples that illustrate how to instrument your application to send telemetry to Google Cloud, see the following:

You might also be interested in exploring the following topics: