Google Cloud Architecture Framework: Operational excellence

Last reviewed 2024-10-31 UTC

The operational excellence pillar in the Google Cloud Architecture Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

The operational excellence pillar is relevant to the following audiences:

Managers and leaders: A framework to establish and maintain operational excellence in the cloud and to ensure that cloud investments deliver value and support business objectives.
Cloud operations teams: Guidance to manage incidents and problems, plan capacity, optimize performance, and manage change.
Site reliability engineers (SREs): Best practices that help you to achieve high levels of service reliability, including monitoring, incident response, and automation.
Cloud architects and engineers: Operational requirements and best practices for the design and implementation phases, to help ensure that solutions are designed for operational efficiency and scalability.
DevOps teams: Guidance about automation, CI/CD pipelines, and change management, to help enable faster and more reliable software delivery.

To achieve operational excellence, you should embrace automation, orchestration, and data-driven insights. Automation helps to eliminate toil. It also streamlines and builds guardrails around repetitive tasks. Orchestration helps to coordinate complex processes. Data-driven insights enable evidence-based decision-making. By using these practices, you can optimize cloud operations, reduce costs, improve service availability, and enhance security.

Operational excellence in the cloud goes beyond technical proficiency in cloud operations. It includes a cultural shift that encourages continuous learning and experimentation. Teams must be empowered to innovate, iterate, and adopt a growth mindset. A culture of operational excellence fosters a collaborative environment where individuals are encouraged to share ideas, challenge assumptions, and drive improvement.

For operational excellence principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Operational excellence in the Architecture Framework.

Core principles

The recommendations in the operational excellence pillar of the Architecture Framework are mapped to the following core principles:

Ensure operational readiness and performance using CloudOps: Ensure that cloud solutions meet operational and performance requirements by defining service level objectives (SLOs) and by performing comprehensive monitoring, performance testing, and capacity planning.
Manage incidents and problems: Minimize the impact of cloud incidents and prevent recurrence through comprehensive observability, clear incident response procedures, thorough retrospectives, and preventive measures.
Manage and optimize cloud resources: Optimize and manage cloud resources through strategies like right-sizing, autoscaling, and by using effective cost monitoring tools.
Automate and manage change: Automate processes, streamline change management, and alleviate the burden of manual labor.
Continuously improve and innovate: Focus on ongoing enhancements and the introduction of new solutions to stay competitive.

Contributors

Authors:

Ryan Cox | Principal Architect
Hadrian Knotz | Enterprise Architect

Other contributors:

Daniel Lees | Cloud Security Architect
Filipe Gracio, PhD | Customer Engineer
Gary Harmson | Customer Engineer
Jose Andrade | Enterprise Infrastructure Customer Engineer
Kumar Dhanagopal | Cross-Product Solution Developer
Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Radhika Kanakam | Senior Program Manager, Cloud GTM
Zach Seils | Networking Specialist
Wade Holmes | Global Solutions Director

Ensure operational readiness and performance using CloudOps