This pillar of the Google Cloud Architecture Framework covers the design principles that are required to architect and operate reliable services on a cloud platform at a high level.
The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework aims to help you design your Google Cloud deployment so that it best matches your business needs.
For reliability principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Reliability in the Architecture Framework.
To run a reliable service, your architecture must include the following:
- Measurable reliability goals that you promptly correct whenever deviations occur
- Design patterns for the following:
- Scalability
- High availability
- Disaster recovery
- Automated change management
- Components that self-heal (have the ability to remediate issues without manual interventions)
- Code that includes instrumentation for observability
- Hands-free operation such as service runs with minimal manual work, cognitive operator load, and rapid failure detection and mitigation
The entire engineering organization is responsible for the reliability of the service, including development, product management, operations, and site reliability engineering (SRE) teams. Teams must understand their application's reliability targets, risk, and error budgets, and be accountable to these requirements. Conflicts between reliability and product feature development are to be prioritized and escalated accordingly.
Core reliability principles
This section explores the core principles of a reliable service and sets the foundation for the more detailed documents that follow. As you read further about this topic, you'll learn Google's approach to reliability is based on the following reliability principles.
Reliability is your top feature
Engineering teams sometimes prioritize new product development. While users anticipate new and exciting updates to their favorite applications, product updates are a short term goal for your users. Your customers always expect service reliability, even if they don't realize it. An expanded set of tools or flashy graphics in your application won't matter if your users can't access your service or your service exhibits poor performance. Poor application performance quickly makes these expanded features irrelevant.
Reliability is defined by the user
In short, your service is reliable when your customers are happy. Users aren't always predictable, and you may overestimate what it takes to satisfy them.
By today's standard, a web page should load in about two seconds. Page abandonment is roughly 53% when load time is delayed by an additional second, and dramatically increases to 87% when load time is delayed by three seconds. However, striving for a site that delivers pages in a second is probably not the best investment. To determine the right level of service reliability for your customers, you need to measure the following:
- User-facing workload: Measure user experience. For example, measure the success ratio of user requests, not just server metrics like CPU usage.
- Batch and streaming workloads: Measure key performance indicators (KPIs) for data throughput, such as rows scanned per time window. This approach is more informative than a server metric like disk usage. Throughput KPIs help ensure user requested processing finishes on time.
100% reliability is the wrong target
This principle is an extension of the previous one. Your systems are reliable enough when users are happy. Typically, users don't need 100% reliability to be happy. Thus, define service level objectives (SLOs) that set the reliability threshold to the percentage needed to make users happy, and then use error budgets to manage the appropriate rate of change.
Apply the design and operational principles in this framework to a product only if the SLO for that product or application justifies the cost.
Reliability and rapid innovation are complementary
Use error budgets to achieve a balance between system stability and developer agility. The following guidance helps you determine when to focus more on stability or on development:
- When the error budget is diminished, slow down and focus on reliability features.
- When an adequate error budget is available, you can innovate rapidly and improve the product or add product features.
Design and operational principles
The remaining documents in the reliability pillar of the Architecture Framework provide design and operational principles that help you maximize system reliability. The following sections provide a summary of the design and operational principles that you'll find in each document in this series.
Establish your reliability goals
Remember, user happiness defines reliability and your reliability goals are represented by the SLOs you set. When setting your SLOs, consider the following:
- Choose appropriate service level indicators (SLI).
- Set SLOs based on the user experience.
- Iteratively improve SLOs.
- Use strict internal SLOs.
- Use error budgets to manage development velocity.
For more information, see Components of service level objectives.
Build observability into your infrastructure and applications
Instrument your code to maximize observability. For more information, see Build observability into your infrastructure and applications.
Design for scale and high availability
When it comes to scale and high availability (HA), consider the following principles:
- Create redundancy for HA
- Replicate data across regions for disaster recovery (DR)
- Design a multi-region architecture for resilience to regional outages
- Degrade service levels gracefully when overloaded
- Fail safe in a way that preserves system functionality
- Design API calls and operational commands to be retryable
- Consider dependencies:
- Identify and manage system dependencies
- Minimize critical dependencies
- Ensure every change can be rolled back
Additionally, the following activities help the reliability of your service:
- Eliminate scalability bottlenecks
- Prevent and mitigate traffic spikes
- Sanitize and validate inputs
For more information, see Design for scale and high availability.
Create reliable tools and operational processes
Build reliability into tools and operations processes by doing the following:
- Choose logical, self-defining names for applications and services
- Use canary testing to implement progressive rollouts of procedures
- Time your promotions and launches so that they spread out traffic and reduce system overload
- Develop programmatic build, test, and deployment processes
- Defend against human-caused incidents, intentional or not
- Develop, test, and document failure response activities
- Develop and test disaster recovery steps on a regular basis
- Chaos engineering: Make it a practice of injecting faults into the system to determine your service's fault tolerance and resilience
For more information, see Create reliable operational processes and tools.
Build efficient alerts
When creating your alerts, we recommend that you do the following:
- Optimize alerts for appropriate delays
- Alert on symptoms, not causes
- Alert on outliers, not averages
For more information, see Build efficient alerts in the Architecture Framework reliability pillar.
Build a collaborative incident management process
Incident response and management (IRM) is essential for service recovery and minimizing damage. Effective IRM includes:
- Ownership: Assign clear service owners.
- Well-tuned alerts: Improve incident response (IR) and reduce time to detect (TTD) with carefully designed alerts.
- IRM plans and training: Reduce time to mitigate (TTM) with comprehensive plans, documentation and training.
- Dashboards: Design dashboard layouts and content to efficiently alert when issues occur to minimize TTM.
- Documentation: Create and maintain clear, concise content for all aspects of service support including diagnostic procedures and mitigation for outage scenarios.
- Blameless culture:
- Cultivate a blameless environment in your organization.
- Establish a postmortem process that focuses on what, not who.
- Learn from your outages by investigating properly and identifying areas to improve and prevent recurrences.
For more information, see Build a collaborative incident management process in the Architecture Framework reliability pillar.
What's next
- Learn about Components of SLOs.
- Explore recommendations in other pillars of the Architecture Framework.
- For more reference architectures, diagrams, and best practices, explore the Cloud Architecture Center.