Google Cloud Architecture Framework: Reliability

This section in the Google Cloud Architecture Framework shows you how to architect and operate reliable services on a cloud platform. You also learn about some of the Google Cloud products and features that support reliability.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework aims to help you design your Google Cloud deployment so that it best matches your business needs.

To run a reliable service, your architecture must include the following:

  • Measurable reliability goals, with deviations that you promptly correct.
  • Design patterns for scalability, high availability, disaster recovery, and automated change management.
  • Components that self-heal where possible, and code that includes instrumentation for observability.
  • Operational procedures that run the service with minimal manual work and cognitive load on operators, and that let you rapidly detect and mitigate failures.

Reliability is the responsibility of everyone in engineering, such as the development, product management, operations, and site reliability engineering (SRE) teams. Everyone must be accountable and understand their application's reliability targets, and risk and error budgets. Teams should be able to prioritize work appropriately and escalate priority conflicts between reliability and product feature development.

In the reliability section of the Architecture Framework, you learn to do the following: