This document in the Google Cloud Architecture Framework provides operational principles to run your service in a reliable manner, such as how to deploy updates, run services in production environments, and test for failures. Architecting for reliability should cover the whole lifecycle of your service, not just software design.
Choose good names for applications and services
Avoid using internal code names in production configuration files, because they can be confusing, particularly to newer employees, potentially increasing time to mitigate (TTM) for outages. As much as possible, choose good names for all of your applications, services, and critical system resources such as VMs, clusters, and database instances, subject to their respective limits on name length. A good name describes the entity's purpose; is accurate, specific, and distinctive; and is meaningful to anybody who sees it. A good name avoids acronyms, code names, abbreviations, and potentially offensive terminology, and would not create a negative public response even if published externally.
Implement progressive rollouts with canary testing
Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn't perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally.
Set up a canary testing system that's aware of service changes and does A/B comparison of the metrics of the changed servers with the remaining servers. The system should flag unexpected or anomalous behavior. If the change doesn't perform as you expect, the canary testing system should automatically halt rollouts. Problems can be clear, such as user errors, or subtle, like CPU usage increase or memory bloat.
It's better to stop and roll back at the first hint of trouble and diagnose issues without the time pressure of an outage. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone. Allow time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs.
Spread out traffic for timed promotions and launches
You might have promotional events, such as sales that start at a precise time and encourage many users to connect to the service simultaneously. If so, design client code to spread the traffic over a few seconds. Use random delays before they initiate requests.
You can also pre-warm the system. When you pre-warm the system, you send the user traffic you anticipate to it ahead of time to ensure it performs as you expect. This approach prevents instantaneous traffic spikes that could crash your servers at the scheduled start time.
Automate build, test, and deployment
Eliminate manual effort from your release process with the use of continuous integration and continuous delivery (CI/CD) pipelines. Perform automated integration testing and deployment. For example, create a modern CI/CD process with GKE.
For more information, see continuous integration, continuous delivery, test automation, and deployment automation.
Design for safety
Design your operational tools to reject potentially invalid configurations. Detect and alert when a configuration version is empty, partial or truncated, corrupt, logically incorrect or unexpected, or not received within the expected time. Tools should also reject configuration versions that differ too much from the previous version.
Disallow changes or commands with too broad a scope that are potentially destructive. These broad commands might be to "Revoke permissions for all users", "Restart all VMs in this region", or "Reformat all disks in this zone". Such changes should only be applied if the operator adds emergency override command-line flags or option settings when they deploy the configuration.
Tools must display the breadth of impact of risky commands, such as number of VMs the change impacts, and require explicit operator acknowledgment before the tool proceeds. You can also use features to lock critical resources and prevent their accidental or unauthorized deletion, such as Cloud Storage retention policy locks.
Test failure recovery
Regularly test your operational procedures to recover from failures in your service. Without regular tests, your procedures might not work when you need them if there's a real failure. Items to test periodically include regional failover, how to roll back a release, and how to restore data from backups.
Conduct disaster recovery tests
Like with failure recovery tests, don't wait for a disaster to strike. Periodically test and verify your disaster recovery procedures and processes.
You might create a system architecture to provide high availability (HA). This architecture doesn't entirely overlap with disaster recovery (DR), but it's often necessary to take HA into account when you think about recovery time objective (RTO) and recovery point objective (RPO) values.
HA helps you to meet or exceed an agreed level of operational performance, such as uptime. When you run production workloads on Google Cloud, you might deploy a passive or active standby instance in a second region. With this architecture, the application continues to provide service from the unaffected region if there's a disaster in the primary region. For more information, see Architecting disaster recovery for cloud outages.
Practice chaos engineering
Consider the use of chaos engineering in your test practices. Introduce actual failures into different components of production systems under load in a safe environment. This approach helps to ensure that there's no overall system impact because your service handles failures correctly at each level.
Failures you inject into the system can include crashing tasks, errors and timeouts on RPCs, or reductions in resource availability. Use random fault injection to test intermittent failures (flapping) in service dependencies. These behaviors are hard to detect and mitigate in production.
Chaos engineering ensures that the fallout from such experiments is minimized and contained. Treat such tests as practice for actual outages and use all of the information collected to improve your outage response.
What's next
- Build efficient alerts (next document in this series)
Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.