Set up monitoring, alerting, and logging

Last reviewed 2023-08-08 UTC

This document in the Google Cloud Architecture Framework shows you how to set up monitoring, alerting, and logging so that you can act based on the behavior of your system. This includes identifying meaningful metrics to track and building dashboards to make it easier to view information about your systems.

The DevOps Resource and Assessment (DORA) research program defines monitoring as:

"The process of collecting, analyzing, and using information to track applications and infrastructure in order to guide business decisions. Monitoring is a key capability because it gives you insight into your systems and your work."

Monitoring enables service owners to:

  • Make informed decisions when changes to the service affect performance
  • Apply a scientific approach to incident response
  • Measure your service's alignment with business goals

With monitoring, logging, and alerting in place, you can do the following:

  • Analyze long-term trends
  • Compare your experiments over time
  • Define alerting on critical metrics
  • Build relevant real-time dashboards
  • Perform retrospective analysis
  • Monitor both business-driven metrics and system-health metric
    • Business-driven metrics help you understand how well your systems support your business. For example, use metrics to monitor the following:
      • The cost to an application to serve a user
      • The volume change in site traffic following a redesign
      • How long it takes a customer to purchase a product on your site
    • System health metrics help you understand whether your systems are operating correctly and within acceptable performance levels.

Use the following four golden signals to monitor your system:

  • Latency. The time it takes to service a request.
  • Traffic. How much demand is being placed on your system.
  • Errors. The rate of requests that fail. Failure can be explicit (for example, HTTP 500s), implicit (for example, an HTTP 200 success response coupled with the wrong content), or by policy (for example, if you commit to one-second response times, any request over one second is an error).
  • Saturation. How full your service is. Saturation is a measure of your system fraction, emphasizing the resources that are most constrained (that is, in a memory-constrained system, show memory; in an I/O-constrained system, show I/O).

Create a monitoring plan

Create a monitoring plan that aligns with your organization's mission and its operations strategy. Include monitoring and observability planning during application development. Including a monitoring plan early in application development can drive an organization toward operational excellence.

Include the following details in your monitoring plan:

  • Include all your systems, including on-premises resources and cloud resources.
  • Include monitoring of your cloud costs to help make sure that scaling events doesn't cause usage to cross your budget thresholds.
  • Build different monitoring strategies for measuring infrastructure performance, user experience, and business key performance indicators (KPIs). For example, static thresholds might work well to measure infrastructure performance but don't truly reflect the user's experience.

Update the plan as your monitoring strategies mature. Iterate on the plan to improve the health of your systems.

Define metrics that measure all aspects of your organization

Define the metrics that are required to measure how your deployment behaves. To do so:

  • Define your business objectives.
  • Identify the metrics and KPIs that can provide you with quantifiable information to measure performance. Make sure your metric definitions translate to all aspects of your organization, from business needs—including cloud costs—to technical components.
  • Use these metrics to create service level indicators (SLIs) for your applications. For more information, see Choose appropriate SLIs.

Common metrics for various components

Metrics are generated at all levels of your service, from infrastructure and networking to business logic. For example:

  • Infrastructure metrics:
    • Virtual machine statistics, including instances, CPU, memory, utilization, and counts
    • Container-based statistics, including cluster utilization, cluster capacity, pod level utilization, and counts
    • Networking statistics, including ingress/egress, bandwidth between components, latency, and throughput
    • Requests per second, as measured by the load balancer
    • Total disk blocks read, per disk
    • Packets sent over a given network interface
    • Memory heap size for a given process
    • Distribution of response latencies
    • Number of invalid queries rejected by a database instance
  • Application metrics:
    • Application-specific behavior, including queries per second, writes per second, and messages sent per second
  • Managed services statistics metrics:
    • QPS, throughput, latency, utilization for Google-managed services (APIs or products such as BigQuery, App Engine, and Bigtable)
  • Network connectivity statistics metrics:
    • VPN/interconnect-related statistics about connecting to on-premises systems or systems that are external to Google Cloud.
  • SLIs
    • Metrics associated with the overall health of the system.

Set up monitoring

Set up monitoring to monitor both on-premises resources and cloud resources.

Choose a monitoring solution that:

  • Is platform independent
  • Provides uniform capabilities for monitoring of on-premises, hybrid, and multi-cloud environments

Using a single platform to consolidate the monitoring data that comes in from different sources lets you build uniform metrics and visualization dashboards.

As you set up monitoring, automate monitoring tasks where possible.

Monitoring with Google Cloud

Using a monitoring service, such as Cloud Monitoring, is easier than building a monitoring service yourself. Monitoring a complex application is a substantial engineering endeavor by itself. Even with existing infrastructure for instrumentation, data collection and display, and alerting in place, it is a full-time job for someone to build and maintain.

Consider using Cloud Monitoring to obtain visibility into the performance, availability, and health of your applications and infrastructure for both on-premises and cloud resources.

Cloud Monitoring is a managed service that is part of the Google Cloud Observability. You can use Cloud Monitoring to monitor Google Cloud services and custom metrics. Cloud Monitoring provides an API for integration with third-party monitoring tools.

Cloud Monitoring aggregates metrics, logs, and events from your system's cloud-based infrastructure. That data gives developers and operators a rich set of observable signals that can speed root-cause analysis and reduce mean time to resolution. You can use Cloud Monitoring to define alerts and custom metrics that meet your business objectives and help you aggregate, visualize, and monitor system health.

Cloud Monitoring provides default dashboards for cloud and open source application services. Using the metrics model, you can define custom dashboards with powerful visualization tools and configure charts in Metrics Explorer.

Set up alerting

A good alerting system improves your ability to release features. It helps compare performance over time to determine the velocity of feature releases or the need to roll back a feature release. For information about rollbacks, see Restore previous releases seamlessly.

As you set up alerting, map alerts directly to critical metrics. These critical metrics include:

  • The four golden signals:
    • Latency
    • Traffic
    • Errors
    • Saturation
  • System health
  • Service usage
  • Security events
  • User experience

Make alerts actionable to minimize the time to resolution. To do so, for each alert:

  • Include a clear description, including stating what is monitored and its business impact.
  • Provide all the information necessary to act immediately. If it takes a few clicks and navigation to understand alerts, it is challenging for the on-call person to act.
  • Define priority levels for various alerts.
  • Clearly identify the person or team responsible for responding to the alert.

For critical applications and services, build self-healing actions into the alerts triggered due to common fault conditions such as service health failure, configuration change, or throughput spikes.

As you set up alerts, try to eliminate toil. For example, eliminate toil by eliminating frequent errors, or automating fixes for these errors which possibly avoids an alert being triggered. Eliminating toil lets those on call focus on making your application's operational components reliable. For more information, see Create a culture of automation.

Build monitoring and alerting dashboards

Once monitoring is in place, build relevant, uncomplicated dashboards that include information from your monitoring and alerting systems.

Choosing an appropriate way to visualize your dashboard can be difficult to tie into your reliability goals. Create dashboards to visualize both:

  • Short-term and real-time analysis
  • Long-term analysis

For more information about implementing visual management, see the capability article Visual management.

Enable logging for critical applications

Logging services are critical to monitoring your systems. While metrics form the basis of specific items to monitor, logs contain valuable information that you need for debugging, security-related analysis, and compliance requirements.

Logging the data your systems generate helps you ensure an effective security posture. For more information about logging and security, see Implement logging and detective controls in the security category of the Architecture Framework.

Cloud Logging is an integrated logging service you can use to store, search, analyze, monitor, and alert on log data and events. Logging automatically collects logs from the services of Google Cloud and other cloud providers. You can use these logs to build metrics for monitoring and to create logging exports to external services such as Cloud Storage, BigQuery, and Pub/Sub.

Set up an audit trail

To help answer questions like "who did what, where, and when" in your Google Cloud projects, use Cloud Audit Logs.

Cloud Audit Logs captures several types of activity, such as the following:

  • Admin Activity logs contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources. Admin Activity logs are always enabled.
  • Data Access audit logs record API calls that create, modify, or read user-provided data. Data Access audit logs are disabled by default because they can be quite large. You can configure which Google Cloud services produce data access logs.

For a list of Google Cloud services that write audit logs, see Google services with audit logs. Use Identity and Access Management (IAM) controls to limit who has access to view audit logs.

What's next