Build observability into your infrastructure and applications

Last reviewed 2023-08-21 UTC

This document in the Google Cloud Architecture Framework provides best practices to add observability into your services so that you can better understand your service performance and quickly identify issues. Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.

Monitoring is at the base of the service reliability hierarchy in the Google SRE Handbook. Without proper monitoring, you can't tell whether an application works correctly.

Instrument your code to maximize observability

A well-designed system aims to have the right amount of observability that starts in its development phase. Don't wait until an application is in production before you start to observe it. Instrument your code and consider the following guidance:

To debug and troubleshoot efficiently, think about what log and trace entries to write out, and what metrics to monitor and export. Prioritize by the most likely or frequent failure modes of the system.
Periodically audit and prune your monitoring. Delete unused or useless dashboards, graphs, alerts, tracing, and logging to eliminate clutter.

Google Cloud Observability provides real-time monitoring, hybrid multi-cloud monitoring and logging (such as for AWS and Azure), plus tracing, profiling, and debugging. Google Cloud Observability can also auto-discover and monitor microservices running on App Engine or in a service mesh like Istio.

If you generate lots of application data, you can optimize large-scale ingestion of analytics events logs with BigQuery. BigQuery is also suitable for persisting and analyzing high-cardinality timeseries data from your monitoring framework. This approach is useful because it lets you run arbitrary queries at a lower cost rather than trying to design your monitoring perfectly from the start, and decouples reporting from monitoring. You can create reports from the data using Looker Studio or Looker.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:

Implement monitoring early, such as before you initiate a migration or before you deploy a new application to a production environment.
Disambiguate between application issues and underlying cloud issues. Use the Monitoring API, or other Cloud Monitoring products and the Google Cloud Status Dashboard.
Define an observability strategy beyond monitoring that includes tracing, profiling, and debugging.
Regularly clean up observability artifacts that you don't use or that don't provide value, such as unactionable alerts.
If you generate large amounts of observability data, send application events to a data warehouse system such as BigQuery.

What's next

Design for scale and high availability (next document in this series)

Explore other categories in the Architecture Framework such as system design, operational excellence, security, privacy, and compliance.