Operational excellence

This section of the architecture framework explores how operational excellence results from efficiently running, managing, and monitoring systems that deliver business value.

The framework consists of the following series of articles:

Operational excellence helps you build a foundation for another critical principle, reliability. (See the Reliability section for related technical and procedural requirements for architecting and operating reliable services on Google Cloud.)


Use these strategies to achieve operational excellence.

Automate build, test, and deploy. Use continuous integration and continuous deployment (CI/CD) pipelines to build automated testing into your releases. Perform automated integration testing and deployment.

Monitor business objectives metrics. Define, measure, and alert on relevant business metrics.

Conduct disaster recovery testing. Don't wait for a disaster to strike. Instead, periodically verify that your disaster recovery procedures work, and test the processes regularly.

Best practices

Follow these practices to achieve operational excellence.

  • Increase software development and release velocity.
  • Monitor for system health and business health.
  • Plan and design for failures.

The following sections cover the best practices in detail.

Increase development and release velocity

Use a CI/CD approach to increase velocity. First you make your software development team more productive and automate integration testing into the build process. You automate deployment after your build meets your specific testing criteria. Your developers can make smaller and more frequent changes. The changes are thoroughly tested, and the time to deploy them is reduced.

This section describes elements of a CI/CD approach: release engineering, automation, central code repositories, build pipelines, testing, and deployment.

Release engineering

Release engineering is a job function that oversees how software is built and delivered. Release engineering is guided by four practices:

  • Self-service mode. Establish guidelines to help software engineers avoid common mistakes. Enforced by automated processes.
  • Frequent releases. High velocity helps troubleshooting and makes fixing issues easier. Frequent releases rely on automated unit tests.
  • Hermetic builds. Ensure consistency with your build tools. Version the build compilers you use to build versions now versus one month ago.
  • Policy enforcement. All changes need code review, ideally including a set of guidelines and policy to enforce security. This improves code review, troubleshooting, and testing a new release.


Automate your build and release pipeline to scan for any known issues and perform rapid testing. You can also use automation to eliminate repetitive tasks.

Central code repositories

Store your code in a central repository, versioned, and labeled (for example, test, dev, prod) as needed. Taking these steps helps ensure that your build pipeline produces consistent results. In Google Cloud, you can store your code in Cloud Source Repositories version and integrate it with various products.

Build pipelines

Version your build configuration to make sure all your builds are consistent, and to make sure you can roll back to the last, best known configuration if necessary. In Google Cloud, Cloud Build helps you define dependencies and versions for building an application package. You can use Cloud Functions to trigger a build process periodically, or trigger builds on specific events when new code is checked in. You can also use Cloud Functions to trigger testing and to automate the entire pipeline.


Testing is a critical part of a successful launch. Testing examples include:

  • Unit testing. Unit tests are fast and help you perform rapid deployments.
  • Integration testing. These tests can get complex when you test for integration with interconnected services.
  • System testing. System tests are time consuming and complex, but they help you identify edge cases and fix issues before deployment.

You can perform other tests, including static testing, load testing, security, and so on, before you deploy your application in production. After you automate testing, you can update and add new tests to improve and maintain the operational health of your deployment.


You can choose how your application is rolled out. It's a best practice to do canary testing and observe your system for any errors, which is easier if you have a robust monitoring and alerting system. In Google Cloud, you can use managed instance groups (MIGs) to do A/B or canary testing, as well as to perform a slow rollout or a rollback if required.

Design questions

  • How does your development team manage build and release?
  • What integration and security testing does your development team employ?
  • How do you roll back?


  • Make the CI/CD pipeline the only way to deploy to production.
  • Isolate and secure your CI/CD environment.
  • Build only once and promote the result through the pipeline.
  • Keep your CI/CD pipelines fast.
  • Minimize branching in your version control system.

Key services

Cloud Source Repositories is a fully-featured, private Git repository service hosted on Google Cloud. You can use Cloud Source Repositories for collaborative development of any application or service.

Container Registry is a single place for your team to manage Docker images, perform vulnerability analysis, and decide who can access what with fine-grained access control. Existing CI/CD integrations allow you to set up fully automated Docker pipelines to get fast feedback.

Cloud Build is a service that executes your builds on the Google Cloud infrastructure. Cloud Build can import source code from GitHub, Bitbucket, Cloud Storage, or Cloud Source Repositories, execute a build to your specifications, and produce artifacts such as Docker containers or Java archives.

Monitor system health and business health

The DevOps Resource and Assessment (DORA) project defines monitoring as follows:

Monitoring is the process of collecting, analyzing, and using information to track applications and infrastructure in order to guide business decisions. Monitoring is a key capability because it gives you insight into your systems and your work.

Through monitoring, you can make decisions about the impact of changes to your service, apply the scientific method to incident response, and measure your service's alignment with your business goals. With monitoring in place, you can do the following:

  • Analyze long-term trends.
  • Compare your experiments over time.
  • Define alerting on critical metrics.
  • Build relevant real-time dashboards.
  • Perform retrospective analysis.

Monitor both business-driven metrics and system health metrics. Business-driven metrics help you understand how well your systems support your business. For example, you could monitor the cost to serve a user in an application, the change in volume of traffic to your site following a redesign, or how long it takes a customer to purchase a product on your site. System health metrics help you understand whether your systems are operating correctly and within acceptable performance levels.

Use the following four golden signals to monitor your system:

  • Latency. The time it takes to service a request.
  • Traffic. How much demand is being placed on your system.
  • Errors. The rate of requests that fail. Requests can fail explicitly (for example, HTTP 500s), implicitly (for example, an HTTP 200 success response, but with the wrong content), or by policy (for example, if you committed to one-second response times, any request that takes more than one second is an error).
  • Saturation. How full your service is. A measure of your most constrained resources. (That is, in a memory-constrained system, show memory; in an I/O-constrained system, show I/O).


Logging services are critical to monitoring your systems. While metrics form the basis of specific items to monitor, logs contain valuable information that you need for debugging, security-related analysis and for compliance requirements. Google Cloud includes Cloud Logging, an integrated logging service you can use to store, search, analyze, monitor, and alert on log data and events from Google Cloud. Cloud Logging automatically collects logs from Google Cloud services. You can use these logs to build metrics for monitoring and to create logging exports to external services such as Cloud Storage, BigQuery, and Pub/Sub.


Define metrics to measure how your deployment behaves. Make sure your metric definitions always translate to business needs, and consider promoting or combining some metrics to form service level indicators (SLIs). For details, see Reliability.

All levels of your service generate metrics, from infrastructure and networking to business logic. Examples include the following:

  • Requests per second, as measured by the load balancer.
  • Total disk blocks read, per disk.
  • Packets sent over a given network interface.
  • Memory heap size for a given process.
  • Distribution of response latencies.
  • Number of invalid queries rejected by a database instance.


Monitoring a complex application is a significant engineering endeavor in and of itself. Google Cloud provides Cloud Monitoring, a managed service that is part of the Google Cloud Operations Suite. You can use Cloud Monitoring to monitor Google Cloud services and custom metrics, and Cloud Monitoring provides an API for integration with third-party monitoring tools.

Cloud Monitoring aggregates metrics, logs, and events from infrastructure, giving developers and operators a rich set of observable signals to help you speed root-cause analysis and reduce mean time to resolution (MTTR). You can define alerts and custom metrics that meet your business objectives and help you aggregate, visualize, and monitor your system health.

Cloud Monitoring provides default dashboards for cloud and open source application services. Using the metrics model, you can define custom dashboards with powerful visualization tools and configure charts in Metrics Explorer.


After you have monitoring in place, build dashboards that are relevant to you to take actions. Make your dashboards simple and easy to read. You should perform both short-term, or real-time, and long-term analyses and visualize them. For details, see Reliability.


Make sure your alerting system maps directly to the four golden signals of monitoring your system, so that you can compare performance over time to determine feature velocity or to roll back changes.

Make alerts actionable. When you send alerts, include a description, and provide all the information necessary for the on-call person to take action immediately. It shouldn't take a few clicks and navigation to understand how to take action on alerts.

Always try to eliminate toil, for example, by eliminating or automating fixes for errors you see frequently. Enable the on-call person to focus on making the operational components reliable. For details, see Reliability.

Escalation path

A well-defined escalation path is key to reducing the effort you spend in getting support for Google Cloud products. This path includes learning how to work with Google's support team, finding architecture docs tuned for support engineers, defining how to communicate during an outage, and setting up monitoring and logging for diagnosing issues.

You can start defining an escalation path by making sure security-, network-, and system-admins are properly set up to receive critical email and alerts from Google Cloud. This helps admins make informed decisions and potentially fix issues early. Similarly, make sure that project owners have email-routable usernames so that they receive critical emails.


  • Choose relevant metrics that map to your business needs.
  • Use Cloud Monitoring and deploy monitoring agents for custom metrics if necessary.
  • Ensure that Cloud Logging is configured for all your log entries.
  • Design well-defined alerts, such as percentage success or failure.
  • Alert with information to take action.
  • Consider purchasing a role-based or enterprise support package.
  • Define an escalation path and provide useful indicators such as time, product, and location while working with Cloud Support.

Key services

Cloud Monitoring provides metrics collection, aggregation, and dashboards, as well as an alerting framework and endpoint checks to web applications and other internet-accessible services.

Cloud Logging lets you filter, search, view, and export to BigQuery, Cloud Storage, or Pub/Sub logs from your cloud and open source application services. You can define metrics based on log contents that are incorporated into dashboards and alerts.

Cloud Debugger connects your application's production data to your source code by inspecting the state of your application at any code location in production without stopping or slowing down your application requests.

Error Reporting analyzes and aggregates the errors in your cloud applications and notifies you when new errors are detected.

Cloud Trace provides latency sampling and reporting for App Engine, including per-URL statistics and latency distributions.

Cloud Profiler provides continuous profiling of resource consumption in your production applications to help you identify and eliminate performance issues.


Design patterns for logging exports

Design for disaster recovery

Designing your system to anticipate and handle failure scenarios helps ensure that if there is a catastrophe, the impact on your systems is minimized. To anticipate failures, make sure you have a well-defined and regularly tested disaster recovery (DR) plan to back up and restore services and data.

Service-interrupting events can happen at any time. Your network could have an outage, your latest application push might introduce a critical bug, or you might have to contend with a natural disaster. When things go awry, it's important to have a robust, targeted, and well-tested DR plan.


DR is a subset of business continuity planning. DR planning begins with a business impact analysis that defines two key metrics:

  • A recovery time objective (RTO), which is the maximum acceptable length of time that your application can be offline. This value is usually defined as part of a larger service level agreement (SLA).

  • A recovery point objective (RPO), which is the maximum acceptable length of time during which data might be lost from your application due to a major incident. This metric varies based on the ways the data is used. For example, user data that's frequently modified could have an RPO of just a few minutes. Less critical, infrequently modified data could have an RPO of several hours. This metric describes only the length of time; it doesn't address the amount or quality of the data that's lost.

Typically, the smaller your RTO and RPO values (that is, the faster your application must recover from an interruption), the more your application costs to run. The following graph shows the ratio of cost to RTO/RPO:

Ratio of cost to RTO/RPO, showing that the faster your application must recover,
the more the application costs to run.

Because smaller RTO and RPO values often mean greater complexity, administrative overhead follows a similar curve. For example, a high-availability (HA) application might require you to manage distribution between two physically separated data centers, manage replication, and more.

RTO and RPO values typically roll up into another metric: the service level objective (SLO), which is a key measurable element of an SLA.

  • An SLA is the entire agreement that specifies what service is to be provided, how it is supported, times, locations, costs, performance, penalties, and responsibilities of the parties involved.
  • SLOs are specific, measurable characteristics of the SLA, such as availability, throughput, frequency, response time, or quality.

A single SLA can contain many SLOs. RTOs and RPOs are measurable and should be considered SLOs.

Infrastructure requirements

In DR, it's a best practice to account for a number of requirements, including the following:

  • Capacity: securing enough resources to scale as needed.
  • Security: providing physical security to protect assets.
  • Network infrastructure: including software components such as firewalls and load balancers.
  • Support: making available skilled technicians to perform maintenance and to address issues.
  • Bandwidth: planning suitable bandwidth for peak load.
  • Facilities: ensuring physical infrastructure, including equipment and power.

Disaster recovery on Google Cloud

Google Cloud can help you reduce the cost of fulfilling RTO and RPO requirements compared to fulfilling them on-premises. Google Cloud helps you bypass most or all of the complicating factors related to physical hardware, removing many business costs in the process. In addition, Google Cloud's focus on administrative simplicity is designed to help reduce the costs of managing a complex application.

Google Cloud offers several features that are relevant to DR planning:

Global network. Google has one of the largest and most advanced computer networks in the world. The Google backbone network uses advanced software-defined networking and has edge-caching services to deliver fast, consistent, and scalable performance.

Redundancy. Multiple points of presence (PoPs) across the globe ensure strong redundancy. Your data is mirrored automatically across storage devices in multiple locations.

Scalability. Google Cloud is designed to scale like other Google products (for example, Search and Gmail), even when you experience a huge traffic spike. Managed services such as App Engine, Compute Engine autoscalers, and Datastore provide automatic scaling so your application can grow and shrink as needed.

Security. The Google security model is built on over 15 years of experience with helping to keep customers safe on Google applications like Gmail and Google Workspace. In addition, the site reliability engineering teams at Google help ensure high availability and help prevent abuse of platform resources.

Compliance. Google undergoes regular independent third-party audits to verify that Google Cloud is in alignment with security, privacy, and compliance regulations and best practices. Google Cloud supports compliance with certifications such as ISO 27001, SOC 2/3, and PCI DSS 3.2.1.


  • Define your RTO and RPO objectives.
  • Design your DR plan based on the solutions for data and applications.
  • Test your DR plan manually at least once a year.
  • Evaluate implementing controlled fault injection to catch regressions early.
  • Leverage chaos engineering to find areas of risk.

Key services

Persistent Disk snapshot offers incremental backups or snapshots of Compute Engine virtual machines (VMs) that you can copy across regions and use to recreate persistent disks in the event of a disaster.

Live Migration keeps your VM instances running even when a host system event occurs, such as a software or hardware update.

Cloud Storage is an object store that provides storage classes, such as Nearline and Coldline, that are suited for specific use cases, such as backup.

Cloud DNS provides a programmatic way to manage your DNS entries as part of an automated recovery process. Cloud DNS uses the Google global network of Anycast name servers to serve your DNS zones from redundant locations around the world, providing high availability and lower latency for your users.