The Google Cloud Architecture Framework provides recommendations to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that's secure, efficient, resilient, high-performing, and cost-effective. The Google Cloud Architecture Framework is our version of a well-architected framework.
A cross-functional team of experts at Google validates the recommendations in the Architecture Framework. The team curates the Architecture Framework to reflect the expanding capabilities of Google Cloud, industry best practices, community knowledge, and feedback from you. For a summary of the significant changes to the Architecture Framework, see What's new.
The Architecture Framework is relevant to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.
Architecture Framework pillars and perspectives
The Google Cloud Architecture Framework is organized into five pillars, as shown in the following diagram. We also provide cross-pillar perspectives that focus on recommendations for selected domains, industries, and technologies like AI and machine learning (ML).
Pillars
- Operational excellence
- Efficiently deploy, operate, monitor, and manage your cloud workloads.
- Security, privacy, and compliance
- Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
- Reliability
- Design and operate resilient and highly available workloads in the cloud.
- Cost optimization
- Maximize the business value of your investment in Google Cloud.
- Performance optimization
- Design and tune your cloud resources for optimal performance.
Perspectives
- AI and ML
- A cross-pillar view of recommendations that are specific to AI and ML workloads.
Core principles
Before you explore the recommendations in each pillar of the Architecture Framework, review the following core principles:
Design for change
No system is static. The needs of its users, the goals of the team that builds the system, and the system itself are constantly changing. With the need for change in mind, build a development and production process that enables teams to regularly deliver small changes and get fast feedback on those changes. Consistently demonstrating the ability to deploy changes helps to build trust with stakeholders, including the teams responsible for the system, and the users of the system. Using DORA's software delivery metrics can help your team monitor the speed, ease, and safety of making changes to the system.
Document your architecture
When you start to move your workloads to the cloud or build your applications, lack of documentation about the system can be a major obstacle. Documentation is especially important for correctly visualizing the architecture of your current deployments.
Quality documentation isn't achieved by producing a specific amount of documentation, but by how clear content is, how useful it is, and how it's maintained as the system changes.
A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. The documentation also provides the information that's necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.
Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current design, strategy, or history.
Analysis by DORA has found a clear link between documentation quality and organizational performance — the organization's ability to meet their performance and profitability goals.
Simplify your design and use fully managed services
Simplicity is crucial for design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.
If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.
Decouple your architecture
Research from DORA shows that architecture is an important predictor for achieving continuous delivery. Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might separate a monolithic application stack into individual service components. In a loosely coupled architecture, an application can run its functions independently, regardless of the various dependencies.
A decoupled architecture gives you increased flexibility to do the following:
- Apply independent upgrades.
- Enforce specific security controls.
- Establish reliability goals for each subsystem.
- Monitor health.
- Granularly control performance and cost parameters.
You can start the decoupling process early in your design phase or incorporate it as part of your system upgrades as you scale.
Use a stateless architecture
A stateless architecture can increase both the reliability and scalability of your applications.
Stateful applications rely on various dependencies to perform tasks, such as local caching of data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.
Google Cloud Architecture Framework: Operational excellence
The operational excellence pillar in the Google Cloud Architecture Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.
The operational excellence pillar is relevant to the following audiences:
- Managers and leaders: A framework to establish and maintain operational excellence in the cloud and to ensure that cloud investments deliver value and support business objectives.
- Cloud operations teams: Guidance to manage incidents and problems, plan capacity, optimize performance, and manage change.
- Site reliability engineers (SREs): Best practices that help you to achieve high levels of service reliability, including monitoring, incident response, and automation.
- Cloud architects and engineers: Operational requirements and best practices for the design and implementation phases, to help ensure that solutions are designed for operational efficiency and scalability.
- DevOps teams: Guidance about automation, CI/CD pipelines, and change management, to help enable faster and more reliable software delivery.
To achieve operational excellence, you should embrace automation, orchestration, and data-driven insights. Automation helps to eliminate toil. It also streamlines and builds guardrails around repetitive tasks. Orchestration helps to coordinate complex processes. Data-driven insights enable evidence-based decision-making. By using these practices, you can optimize cloud operations, reduce costs, improve service availability, and enhance security.
Operational excellence in the cloud goes beyond technical proficiency in cloud operations. It includes a cultural shift that encourages continuous learning and experimentation. Teams must be empowered to innovate, iterate, and adopt a growth mindset. A culture of operational excellence fosters a collaborative environment where individuals are encouraged to share ideas, challenge assumptions, and drive improvement.
For operational excellence principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Operational excellence in the Architecture Framework.
Core principles
The recommendations in the operational excellence pillar of the Architecture Framework are mapped to the following core principles:
- Ensure operational readiness and performance using CloudOps: Ensure that cloud solutions meet operational and performance requirements by defining service level objectives (SLOs) and by performing comprehensive monitoring, performance testing, and capacity planning.
- Manage incidents and problems: Minimize the impact of cloud incidents and prevent recurrence through comprehensive observability, clear incident response procedures, thorough retrospectives, and preventive measures.
- Manage and optimize cloud resources: Optimize and manage cloud resources through strategies like right-sizing, autoscaling, and by using effective cost monitoring tools.
- Automate and manage change: Automate processes, streamline change management, and alleviate the burden of manual labor.
- Continuously improve and innovate: Focus on ongoing enhancements and the introduction of new solutions to stay competitive.
Contributors
Authors:
- Ryan Cox | Principal Architect
- Hadrian Knotz | Enterprise Architect
Other contributors:
- Daniel Lees | Cloud Security Architect
- Filipe Gracio, PhD | Customer Engineer
- Gary Harmson | Customer Engineer
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Zach Seils | Networking Specialist
- Wade Holmes | Global Solutions Director
Ensure operational readiness and performance using CloudOps
This principle in the operational excellence pillar of the Google Cloud Architecture Framework helps you to ensure operational readiness and performance of your cloud workloads. It emphasizes establishing clear expectations and commitments for service performance, implementing robust monitoring and alerting, conducting performance testing, and proactively planning for capacity needs.
Principle overview
Different organizations might interpret operational readiness differently. Operational readiness is how your organization prepares to successfully operate workloads on Google Cloud. Preparing to operate a complex, multilayered cloud workload requires careful planning for both go-live and day-2 operations. These operations are often called CloudOps.
Focus areas of operational readiness
Operational readiness consists of four focus areas. Each focus area consists of a set of activities and components that are necessary to prepare to operate a complex application or environment in Google Cloud. The following table lists the components and activities of each focus area:
Focus area of operational readiness | Activities and components |
---|---|
Workforce |
|
Processes |
|
Tooling | Tools that are required to support CloudOps processes. |
Governance |
|
Recommendations
To ensure operational readiness and performance by using CloudOps, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.
Define SLOs and SLAs
A core responsibility of the cloud operations team is to define service level objectives (SLOs) and service level agreements (SLAs) for all of the critical workloads. This recommendation is relevant to the governance focus area of operational readiness.
SLOs must be specific, measurable, achievable, relevant, and time-bound (SMART), and they must reflect the level of service and performance that you want.
- Specific: Clearly articulates the required level of service and performance.
- Measurable: Quantifiable and trackable.
- Achievable: Attainable within the limits of your organization's capabilities and resources.
- Relevant: Aligned with business goals and priorities.
- Time-bound: Has a defined timeframe for measurement and evaluation.
For example, an SLO for a web application might be "99.9% availability" or "average response time less than 200 ms." Such SLOs clearly define the required level of service and performance for the web application, and the SLOs can be measured and tracked over time.
SLAs outline the commitments to customers regarding service availability, performance, and support, including any penalties or remedies for noncompliance. SLAs must include specific details about the services that are provided, the level of service that can be expected, the responsibilities of both the service provider and the customer, and any penalties or remedies for noncompliance. SLAs serve as a contractual agreement between the two parties, ensuring that both have a clear understanding of the expectations and obligations that are associated with the cloud service.
Google Cloud provides tools like Cloud Monitoring and service level indicators (SLIs) to help you define and track SLOs. Cloud Monitoring provides comprehensive monitoring and observability capabilities that enable your organization to collect and analyze metrics that are related to the availability, performance, and latency of cloud-based applications and services. SLIs are specific metrics that you can use to measure and track SLOs over time. By utilizing these tools, you can effectively monitor and manage cloud services, and ensure that they meet the SLOs and SLAs.
Clearly defining and communicating SLOs and SLAs for all of your critical cloud services helps to ensure reliability and performance of your deployed applications and services.
Implement comprehensive observability
To get real-time visibility into the health and performance of your cloud environment, we recommend that you use a combination of Google Cloud Observability tools and third-party solutions. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
Implementing a combination of observability solutions provides you with a comprehensive observability strategy that covers various aspects of your cloud infrastructure and applications. Google Cloud Observability is a unified platform for collecting, analyzing, and visualizing metrics, logs, and traces from various Google Cloud services, applications, and external sources. By using Cloud Monitoring, you can gain insights into resource utilization, performance characteristics, and overall health of your resources.
To ensure comprehensive monitoring, monitor important metrics that align with system health indicators such as CPU utilization, memory usage, network traffic, disk I/O, and application response times. You must also consider business-specific metrics. By tracking these metrics, you can identify potential bottlenecks, performance issues, and resource constraints. Additionally, you can set up alerts to notify relevant teams proactively about potential issues or anomalies.
To enhance your monitoring capabilities further, you can integrate third-party solutions with Google Cloud Observability. These solutions can provide additional functionality, such as advanced analytics, machine learning-powered anomaly detection, and incident management capabilities. This combination of Google Cloud Observability tools and third-party solutions lets you create a robust and customizable monitoring ecosystem that's tailored to your specific needs. By using this combination approach, you can proactively identify and address issues, optimize resource utilization, and ensure the overall reliability and availability of your cloud applications and services.
Implement performance and load testing
Performing regular performance testing helps you to ensure that your cloud-based applications and infrastructure can handle peak loads and maintain optimal performance. Load testing simulates realistic traffic patterns. Stress testing pushes the system to its limits to identify potential bottlenecks and performance limitations. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
Tools like Cloud Load Balancing and load testing services can help you to simulate real-world traffic patterns and stress-test your applications. These tools provide valuable insights into how your system behaves under various load conditions, and can help you to identify areas that require optimization.
Based on the results of performance testing, you can make decisions to optimize your cloud infrastructure and applications for optimal performance and scalability. This optimization might involve adjusting resource allocation, tuning configurations, or implementing caching mechanisms.
For example, if you find that your application is experiencing slowdowns during periods of high traffic, you might need to increase the number of virtual machines or containers that are allocated to the application. Alternatively, you might need to adjust the configuration of your web server or database to improve performance.
By regularly conducting performance testing and implementing the necessary optimizations, you can ensure that your cloud-based applications and infrastructure always run at peak performance, and deliver a seamless and responsive experience for your users. Doing so can help you to maintain a competitive advantage and build trust with your customers.
Plan and manage capacity
Proactively planning for future capacity needs—both organic or inorganic—helps you to ensure the smooth operation and scalability of your cloud-based systems. This recommendation is relevant to the processes focus area of operational readiness.
Planning for future capacity includes understanding and managing quotas for various resources like compute instances, storage, and API requests. By analyzing historical usage patterns, growth projections, and business requirements, you can accurately anticipate future capacity requirements. You can use tools like Cloud Monitoring and BigQuery to collect and analyze usage data, identify trends, and forecast future demand.
Historical usage patterns provide valuable insights into resource utilization over time. By examining metrics like CPU utilization, memory usage, and network traffic, you can identify periods of high demand and potential bottlenecks. Additionally, you can help to estimate future capacity needs by making growth projections based on factors like growth in the user base, new products and features, and marketing campaigns. When you assess capacity needs, you should also consider business requirements like SLAs and performance targets.
When you determine the resource sizing for a workload, consider factors that can affect utilization of resources. Seasonal variations like holiday shopping periods or end-of-quarter sales can lead to temporary spikes in demand. Planned events like product launches or marketing campaigns can also significantly increase traffic. To make sure that your primary and disaster recovery (DR) system can handle unexpected surges in demand, plan for capacity that can support graceful failover during disruptions like natural disasters and cyberattacks.
Autoscaling is an important strategy for dynamically adjusting your cloud resources based on workload fluctuations. By using autoscaling policies, you can automatically scale compute instances, storage, and other resources in response to changing demand. This ensures optimal performance during peak periods while minimizing costs when resource utilization is low. Autoscaling algorithms use metrics like CPU utilization, memory usage, and queue depth to determine when to scale resources.
Continuously monitor and optimize
To manage and optimize cloud workloads, you must establish a process for continuously monitoring and analyzing performance metrics. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
To establish a process for continuous monitoring and analysis, you track, collect, and evaluate data that's related to various aspects of your cloud environment. By using this data, you can proactively identify areas for improvement, optimize resource utilization, and ensure that your cloud infrastructure consistently meets or exceeds your performance expectations.
An important aspect of performance monitoring is regularly reviewing logs and traces. Logs provide valuable insights into system events, errors, and warnings. Traces provide detailed information about the flow of requests through your application. By analyzing logs and traces, you can identify potential issues, identify the root causes of problems, and get a better understanding of how your applications behave under different conditions. Metrics like the round-trip time between services can help you to identify and understand bottlenecks that are in your workloads.
Further, you can use performance-tuning techniques to significantly enhance application response times and overall efficiency. The following are examples of techniques that you can use:
- Caching: Store frequently accessed data in memory to reduce the need for repeated database queries or API calls.
- Database optimization: Use techniques like indexing and query optimization to improve the performance of database operations.
- Code profiling: Identify areas of your code that consume excessive resources or cause performance issues.
By applying these techniques, you can optimize your applications and ensure that they run efficiently in the cloud.
Manage incidents and problems
This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage incidents and problems related to your cloud workloads. It involves implementing comprehensive monitoring and observability, establishing clear incident response procedures, conducting thorough root cause analysis, and implementing preventive measures. Many of the topics that are discussed in this principle are covered in detail in the Reliability pillar.
Principle overview
Incident management and problem management are important components of a functional operations environment. How you respond to, categorize, and solve incidents of differing severity can significantly affect your operations. You must also proactively and continuously make adjustments to optimize reliability and performance. An efficient process for incident and problem management relies on the following foundational elements:
- Continuous monitoring: Identify and resolve issues quickly.
- Automation: Streamline tasks and improve efficiency.
- Orchestration: Coordinate and manage cloud resources effectively.
- Data-driven insights: Optimize cloud operations and make informed decisions.
These elements help you to build a resilient cloud environment that can handle a wide range of challenges and disruptions. These elements can also help to reduce the risk of costly incidents and downtime, and they can help you to achieve greater business agility and success. These foundational elements are spread across the four focus areas of operational readiness: Workforce, Processes, Tooling, and Governance.
Recommendations
To manage incidents and problems effectively, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.
Establish clear incident response procedures
Clear roles and responsibilities are essential to ensure effective and coordinated response to incidents. Additionally, clear communication protocols and escalation paths help to ensure that information is shared promptly and effectively during an incident. This recommendation is relevant to these focus areas of operational readiness: workforce, processes, and tooling.
To establish incident response procedures, you need to define the roles and expectations of each team member, such as incident commanders, investigators, communicators, and technical experts. Establishing communication and escalation paths includes identifying important contacts, setting up communication channels, and defining the process for escalating incidents to higher levels of management when necessary. Regular training and preparation helps to ensure that teams are equipped with the knowledge and skills to respond to incidents effectively.
By documenting incident response procedures in a runbook or playbook, you can provide a standardized reference guide for teams to follow during an incident. The runbook must outline the steps to be taken at each stage of the incident response process, including communication, triage, investigation, and resolution. It must also include information about relevant tools and resources and contact information for important personnel. You must regularly review and update the runbook to ensure that it remains current and effective.
Centralize incident management
For effective tracking and management throughout the incident lifecycle, consider using a centralized incident management system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
A centralized incident management system provides the following advantages:
- Improved visibility: By consolidating all incident-related data in a single location, you eliminate the need for teams to search in various channels or systems for context. This approach saves time and reduces confusion, and it gives stakeholders a comprehensive view of the incident, including its status, impact, and progress.
- Better coordination and collaboration: A centralized system provides a unified platform for communication and task management. It promotes seamless collaboration between the different departments and functions that are involved in incident response. This approach ensures that everyone has access to up-to-date information and it reduces the risk of miscommunication and misalignment.
- Enhanced accountability and ownership: A centralized incident management system enables your organization to allocate tasks to specific individuals or teams and it ensures that responsibilities are clearly defined and tracked. This approach promotes accountability and encourages proactive problem-solving because team members can easily monitor their progress and contributions.
A centralized incident management system must offer robust features for incident tracking, task assignment, and communication management. These features let you customize workflows, set priorities, and integrate with other systems, such as monitoring tools and ticketing systems.
By implementing a centralized incident management system, you can optimize your organization's incident response processes, improve collaboration, and enhance visibility. Doing so leads to faster incident resolution times, reduced downtime, and improved customer satisfaction. It also helps foster a culture of continuous improvement because you can learn from past incidents and identify areas for improvement.
Conduct thorough post-incident reviews
After an incident occurs, you must conduct a detailed post-incident review (PIR), which is also known as a postmortem, to identify the root cause, contributing factors, and lessons learned. This thorough review helps you to prevent similar incidents in the future. This recommendation is relevant to these focus areas of operational readiness: processes and governance.
The PIR process must involve a multidisciplinary team that has expertise in various aspects of the incident. The team must gather all of the relevant information through interviews, documentation review, and site inspections. A timeline of events must be created to establish the sequence of actions that led up to the incident.
After the team gathers the required information, they must conduct a root cause analysis to determine the factors that led to the incident. This analysis must identify both the immediate cause and the systemic issues that contributed to the incident.
Along with identifying the root cause, the PIR team must identify any other contributing factors that might have caused the incident. These factors could include human error, equipment failure, or organizational factors like communication breakdowns and lack of training.
The PIR report must document the findings of the investigation, including the timeline of events, root cause analysis, and recommended actions. The report is a valuable resource for implementing corrective actions and preventing recurrence. The report must be shared with all of the relevant stakeholders and it must be used to develop safety training and procedures.
To ensure a successful PIR process, your organization must foster a blameless culture that focuses on learning and improvement rather than assigning blame. This culture encourages individuals to report incidents without fear of retribution, and it lets you address systemic issues and make meaningful improvements.
By conducting thorough PIRs and implementing corrective measures based on the findings, you can significantly reduce the risk of similar incidents occurring in the future. This proactive approach to incident investigation and prevention helps to create a safer and more efficient work environment for everyone involved.
Maintain a knowledge base
A knowledge base of known issues, solutions, and troubleshooting guides is essential for incident management and resolution. Team members can use the knowledge base to quickly identify and address common problems. Implementing a knowledge base helps to reduce the need for escalation and it improves overall efficiency. This recommendation is relevant to these focus areas of operational readiness: workforce and processes.
A primary benefit of a knowledge base is that it lets teams learn from past experiences and avoid repeating mistakes. By capturing and sharing solutions to known issues, teams can build a collective understanding of how to resolve common problems and best practices for incident management. Use of a knowledge base saves time and effort, and helps to standardize processes and ensure consistency in incident resolution.
Along with helping to improve incident resolution times, a knowledge base promotes knowledge sharing and collaboration across teams. With a central repository of information, teams can easily access and contribute to the knowledge base, which promotes a culture of continuous learning and improvement. This culture encourages teams to share their expertise and experiences, leading to a more comprehensive and valuable knowledge base.
To create and manage a knowledge base effectively, use appropriate tools and technologies. Collaboration platforms like Google Workspace are well-suited for this purpose because they let you easily create, edit, and share documents collaboratively. These tools also support version control and change tracking, which ensures that the knowledge base remains up-to-date and accurate.
Make the knowledge base easily accessible to all relevant teams. You can achieve this by integrating the knowledge base with existing incident management systems or by providing a dedicated portal or intranet site. A knowledge base that's readily available lets teams quickly access the information that they need to resolve incidents efficiently. This availability helps to reduce downtime and minimize the impact on business operations.
Regularly review and update the knowledge base to ensure that it remains relevant and useful. Monitor incident reports, identify common issues and trends, and incorporate new solutions and troubleshooting guides into the knowledge base. An up-to-date knowledge base helps your teams resolve incidents faster and more effectively.
Automate incident response
Automation helps to streamline your incident response and remediation processes. It lets you address security breaches and system failures promptly and efficiently. By using Google Cloud products like Cloud Run functions or Cloud Run, you can automate various tasks that are typically manual and time-consuming. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
Automated incident response provides the following benefits:
- Reduction in incident detection and resolution times: Automated tools can continuously monitor systems and applications, detect suspicious or anomalous activities in real time, and notify stakeholders or respond without intervention. This automation lets you identify potential threats or issues before they escalate into major incidents. When an incident is detected, automated tools can trigger predefined remediation actions, such as isolating affected systems, quarantining malicious files, or rolling back changes to restore the system to a known good state.
- Reduced burden on security and operations teams: Automated incident response lets the security and operations teams focus on more strategic tasks. By automating routine and repetitive tasks, such as collecting diagnostic information or triggering alerts, your organization can free up personnel to handle more complex and critical incidents. This automation can lead to improved overall incident response effectiveness and efficiency.
- Enhanced consistency and accuracy of the remediation process: Automated tools can ensure that remediation actions are applied uniformly across all affected systems, minimizing the risk of human error or inconsistency. This standardization of the remediation process helps to minimize the impact of incidents on users and the business.
Manage and optimize cloud resources
This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage and optimize the resources that are used by your cloud workloads. It involves right-sizing resources based on actual usage and demand, using autoscaling for dynamic resource allocation, implementing cost optimization strategies, and regularly reviewing resource utilization and costs. Many of the topics that are discussed in this principle are covered in detail in the Cost optimization pillar.
Principle overview
Cloud resource management and optimization play a vital role in optimizing cloud spending, resource usage, and infrastructure efficiency. It includes various strategies and best practices aimed at maximizing the value and return from your cloud spending.
This pillar's focus on optimization extends beyond cost reduction. It emphasizes the following goals:
- Efficiency: Using automation and data analytics to achieve peak performance and cost savings.
- Performance: Scaling resources effortlessly to meet fluctuating demands and deliver optimal results.
- Scalability: Adapting infrastructure and processes to accommodate rapid growth and diverse workloads.
By focusing on these goals, you achieve a balance between cost and functionality. You can make informed decisions regarding resource provisioning, scaling, and migration. Additionally, you gain valuable insights into resource consumption patterns, which lets you proactively identify and address potential issues before they escalate.
Recommendations
To manage and optimize resources, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.
Right-size resources
Continuously monitoring resource utilization and adjusting resource allocation to match actual demand are essential for efficient cloud resource management. Over-provisioning resources can lead to unnecessary costs, and under-provisioning can cause performance bottlenecks that affect application performance and user experience. To achieve an optimal balance, you must adopt a proactive approach to right-sizing cloud resources. This recommendation is relevant to the governance focus area of operational readiness.
Cloud Monitoring and Recommender can help you to identify opportunities for right-sizing. Cloud Monitoring provides real-time visibility into resource utilization metrics. This visibility lets you track resource usage patterns and identify potential inefficiencies. Recommender analyzes resource utilization data to make intelligent recommendations for optimizing resource allocation. By using these tools, you can gain insights into resource usage and make informed decisions about right-sizing the resources.
In addition to Cloud Monitoring and Recommender, consider using custom metrics to trigger automated right-sizing actions. Custom metrics let you track specific resource utilization metrics that are relevant to your applications and workloads. You can also configure alerts to notify administrators when predefined thresholds are met. The administrators can then take necessary actions to adjust resource allocation. This proactive approach ensures that resources are scaled in a timely manner, which helps to optimize cloud costs and prevent performance issues.
Use autoscaling
Autoscaling compute and other resources helps to ensure optimal performance and cost efficiency of your cloud-based applications. Autoscaling lets you dynamically adjust the capacity of your resources based on workload fluctuations, so that you have the resources that you need when you need them and you can avoid over-provisioning and unnecessary costs. This recommendation is relevant to the processes focus area of operational readiness.
To meet the diverse needs of different applications and workloads, Google Cloud offers various autoscaling options, including the following:
- Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. With MIGs, you can define autoscaling policies that specify the minimum and maximum number of VMs to maintain in the group, and the conditions that trigger autoscaling. For example, you can configure a policy to add VMs in a MIG when the CPU utilization reaches a certain threshold and to remove VMs when the utilization drops below a different threshold.
Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application's needs. It offers the following tools:
- Cluster Autoscaler adds or removes nodes based on Pod resource demands.
- Horizontal Pod Autoscaler changes the number of Pod replicas based on CPU, memory, or custom metrics.
- Vertical Pod Autoscaler fine-tunes Pod resource requests and limits based on usage patterns.
- Node Auto-Provisioning automatically creates optimized node pools for your workloads.
These tools work together to optimize resource utilization, ensure application performance, and simplify cluster management.
Cloud Run is a serverless platform that lets you run code without having to manage infrastructure. Cloud Run offers built-in autoscaling, which automatically adjusts the number of instances based on the incoming traffic. When the volume of traffic increases, Cloud Run scales up the number of instances to handle the load. When traffic decreases, Cloud Run scales down the number of instances to reduce costs.
By using these autoscaling options, you can ensure that your cloud-based applications have the resources that they need to handle varying workloads, while avoiding overprovisioning and unnecessary costs. Using autoscaling can lead to improved performance, cost savings, and more efficient use of cloud resources.
Leverage cost optimization strategies
Optimizing cloud spending helps you to effectively manage your organization's IT budgets. This recommendation is relevant to the governance focus area of operational readiness.
Google Cloud offers several tools and techniques to help you optimize cloud costs. By using these tools and techniques, you can get the best value from your cloud spending. These tools and techniques help you to identify areas where costs can be reduced, such as identifying underutilized resources or recommending more cost-effective instance types. Google Cloud options to help optimize cloud costs include the following:
- Committed use discounts (CUDs) are discounts for committing to a certain level of usage over a period of time.
- Sustained use discounts in Compute Engine provide discounts for consistent usage of a service.
- Spot VMs provide access to unused VM capacity at a lower cost compared to regular VMs.
Pricing models might change over time, and new features might be introduced that offer better performance or lower cost compared to existing options. Therefore, you should regularly review pricing models and consider alternative features. By staying informed about the latest pricing models and features, you can make informed decisions about your cloud architecture to minimize costs.
Google Cloud's Cost Management tools, such as budgets and alerts, provide valuable insights into cloud spending. Budgets and alerts let users set budgets and receive alerts when the budgets are exceeded. These tools help users track their cloud spending and identify areas where costs can be reduced.
Track resource usage and costs
You can use tagging and labeling to track resource usage and costs. By assigning tags and labels to your cloud resources like projects, departments, or other relevant dimensions, you can categorize and organize the resources. This lets you monitor and analyze spending patterns for specific resources and identify areas of high usage or potential cost savings. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.
Tools like Cloud Billing and Cost Management help you to get a comprehensive understanding of your spending patterns. These tools provide detailed insights into your cloud usage and they let you identify trends, forecast costs, and make informed decisions. By analyzing historical data and current spending patterns, you can identify the focus areas for your cost-optimization efforts.
Custom dashboards and reports help you to visualize cost data and gain deeper insights into spending trends. By customizing dashboards with relevant metrics and dimensions, you can monitor key performance indicators (KPIs) and track progress towards your cost optimization goals. Reports offer deeper analyses of cost data. Reports let you filter the data by specific time periods or resource types to understand the underlying factors that contribute to your cloud spending.
Regularly review and update your tags, labels, and cost analysis tools to ensure that you have the most up-to-date information on your cloud usage and costs. By staying informed and conducting cost postmortems or proactive cost reviews, you can promptly identify any unexpected increases in spending. Doing so lets you make proactive decisions to optimize cloud resources and control costs.
Establish cost allocation and budgeting
Accountability and transparency in cloud cost management are crucial for optimizing resource utilization and ensuring financial control. This recommendation is relevant to the governance focus area of operational readiness.
To ensure accountability and transparency, you need to have clear mechanisms for cost allocation and chargeback. By allocating costs to specific teams, projects, or individuals, your organization can ensure that each of these entities is responsible for its cloud usage. This practice fosters a sense of ownership and encourages responsible resource management. Additionally, chargeback mechanisms enable your organization to recover cloud costs from internal customers, align incentives with performance, and promote fiscal discipline.
Establishing budgets for different teams or projects is another essential aspect of cloud cost management. Budgets enable your organization to define spending limits and track actual expenses against those limits. This approach lets you make proactive decisions to prevent uncontrolled spending. By setting realistic and achievable budgets, you can ensure that cloud resources are used efficiently and aligned with business objectives. Regular monitoring of actual spending against budgets helps you to identify variances and address potential overruns promptly.
To monitor budgets, you can use tools like Cloud Billing budgets and alerts. These tools provide real-time insights into cloud spending and they notify stakeholders of potential overruns. By using these capabilities, you can track cloud costs and take corrective actions before significant deviations occur. This proactive approach helps to prevent financial surprises and ensures that cloud resources are used responsibly.
Automate and manage change
This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you automate and manage change for your cloud workloads. It involves implementing infrastructure as code (IaC), establishing standard operating procedures, implementing a structured change management process, and using automation and orchestration.
Principle overview
Change management and automation play a crucial role in ensuring smooth and controlled transitions within cloud environments. For effective change management, you need to use strategies and best practices that minimize disruptions and ensure that changes are integrated seamlessly with existing systems.
Effective change management and automation include the following foundational elements:
- Change governance: Establish clear policies and procedures for change management, including approval processes and communication plans.
- Risk assessment: Identify potential risks associated with changes and mitigate them through risk management techniques.
- Testing and validation: Thoroughly test changes to ensure that they meet functional and performance requirements and mitigate potential regressions.
- Controlled deployment: Implement changes in a controlled manner, ensuring that users are seamlessly transitioned to the new environment, with mechanisms to seamlessly roll back if needed.
These foundational elements help to minimize the impact of changes and ensure that changes have a positive effect on business operations. These elements are represented by the processes, tooling, and governance focus areas of operational readiness.
Recommendations
To automate and manage change, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.
Adopt IaC
Infrastructure as code (IaC) is a transformative approach for managing cloud infrastructure. You can define and manage cloud infrastructure declaratively by using tools like Terraform. IaC helps you achieve consistency, repeatability, and simplified change management. It also enables faster and more reliable deployments. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
The following are the main benefits of adopting the IaC approach for your cloud deployments:
- Human-readable resource configurations: With the IaC approach, you can declare your cloud infrastructure resources in a human-readable format, like JSON or YAML. Infrastructure administrators and operators can easily understand and modify the infrastructure and collaborate with others.
- Consistency and repeatability: IaC enables consistency and repeatability in your infrastructure deployments. You can ensure that your infrastructure is provisioned and configured the same way every time, regardless of who is performing the deployment. This approach helps to reduce errors and ensures that your infrastructure is always in a known state.
- Accountability and simplified troubleshooting: The IaC approach helps to improve accountability and makes it easier to troubleshoot issues. By storing your IaC code in a version control system, you can track changes, and identify when changes were made and by whom. If necessary, you can easily roll back to previous versions.
Implement version control
A version control system like Git is a key component of the IaC process. It provides robust change management and risk mitigation capabilities, which is why it's widely adopted, either through in-house development or SaaS solutions. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.
By tracking changes to IaC code and configurations, version control provides visibility into the evolution of the code, making it easier to understand the impact of changes and identify potential issues. This enhanced visibility fosters collaboration among team members who work on the same IaC project.
Most version control systems let you easily roll back changes if needed. This capability helps to mitigate the risk of unintended consequences or errors. By using tools like Git in your IaC workflow, you can significantly improve change management processes, foster collaboration, and mitigate risks, which leads to a more efficient and reliable IaC implementation.
Build CI/CD pipelines
Continuous integration and continuous delivery (CI/CD) pipelines streamline the process of developing and deploying cloud applications. CI/CD pipelines automate the building, testing, and deployment stages, which enables faster and more frequent releases with improved quality control. This recommendation is relevant to the tooling focus area of operational readiness.
CI/CD pipelines ensure that code changes are continuously integrated into a central repository, typically a version control system like Git. Continuous integration facilitates early detection and resolution of issues, and it reduces the likelihood of bugs or compatibility problems.
To create and manage CI/CD pipelines for cloud applications, you can use tools like Cloud Build and Cloud Deploy.
- Cloud Build is a fully managed build service that lets developers define and execute build steps in a declarative manner. It integrates seamlessly with popular source-code management platforms and it can be triggered by events like code pushes and pull requests.
- Cloud Deploy is a serverless deployment service that automates the process of deploying applications to various environments, such as testing, staging, and production. It provides features like blue-green deployments, traffic splitting, and rollback capabilities, making it easier to manage and monitor application deployments.
Integrating CI/CD pipelines with version control systems and testing frameworks helps to ensure the quality and reliability of your cloud applications. By running automated tests as part of the CI/CD process, development teams can quickly identify and fix any issues before the code is deployed to the production environment. This integration helps to improve the overall stability and performance of your cloud applications.
Use configuration management tools
Tools like Puppet, Chef, Ansible, and VM Manager help you to automate the configuration and management of cloud resources. Using these tools, you can ensure resource consistency and compliance across your cloud environments. This recommendation is relevant to the tooling focus area of operational readiness.
Automating the configuration and management of cloud resources provides the following benefits:
- Significant reduction in the risk of manual errors: When manual processes are involved, there is a higher likelihood of mistakes due to human error. Configuration management tools reduce this risk by automating processes, so that configurations are applied consistently and accurately across all cloud resources. This automation can lead to improved reliability and stability of the cloud environment.
- Improvement in operational efficiency: By automating repetitive tasks, your organization can free up IT staff to focus on more strategic initiatives. This automation can lead to increased productivity and cost savings and improved responsiveness to changing business needs.
- Simplified management of complex cloud infrastructure: As cloud environments grow in size and complexity, managing the resources can become increasingly difficult. Configuration management tools provide a centralized platform for managing cloud resources. The tools make it easier to track configurations, identify issues, and implement changes. Using these tools can lead to improved visibility, control, and security of your cloud environment.
Automate testing
Integrating automated testing into your CI/CD pipelines helps to ensure the quality and reliability of your cloud applications. By validating changes before deployment, you can significantly reduce the risk of errors and regressions, which leads to a more stable and robust software system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.
The following are the main benefits of incorporating automated testing into your CI/CD pipelines:
- Early detection of bugs and defects: Automated testing helps to detect bugs and defects early in the development process, before they can cause major problems in production. This capability saves time and resources by preventing the need for costly rework and bug fixes at later stages in the development process.
- High quality and standards-based code: Automated testing can help improve the overall quality of your code by ensuring that the code meets certain standards and best practices. This capability leads to more maintainable and reliable applications that are less prone to errors.
You can use various types of testing techniques in CI/CD pipelines. Each test type serves a specific purpose.
- Unit testing focuses on testing individual units of code, such as functions or methods, to ensure that they work as expected.
- Integration testing tests the interactions between different components or modules of your application to verify that they work properly together.
- End-to-end testing is often used along with unit and integration testing. End-to-end testing simulates real-world scenarios to test the application as a whole, and helps to ensure that the application meets the requirements of your end users.
To effectively integrate automated testing into your CI/CD pipelines, you must choose appropriate testing tools and frameworks. There are many different options, each with its own strengths and weaknesses. You must also establish a clear testing strategy that outlines the types of tests to be performed, the frequency of testing, and the criteria for passing or failing a test. By following these recommendations, you can ensure that your automated testing process is efficient and effective. Such a process provides valuable insights into the quality and reliability of your cloud applications.
Continuously improve and innovate
This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you continuously optimize cloud operations and drive innovation.
Principle overview
To continuously improve and innovate in the cloud, you need to focus on continuous learning, experimentation, and adaptation. This helps you to explore new technologies and optimize existing processes and it promotes a culture of excellence that enables your organization to achieve and maintain industry leadership.
Through continuous improvement and innovation, you can achieve the following goals:
- Accelerate innovation: Explore new technologies and services to enhance capabilities and drive differentiation.
- Reduce costs: Identify and eliminate inefficiencies through process-improvement initiatives.
- Enhance agility: Adapt rapidly to changing market demands and customer needs.
- Improve decision making: Gain valuable insights from data and analytics to make data-driven decisions.
Organizations that embrace the continuous improvement and innovation principle can unlock the full potential of the cloud environment and achieve sustainable growth. This principle maps primarily to the Workforce focus area of operational readiness. A culture of innovation lets teams experiment with new tools and technologies to expand capabilities and reduce costs.
Recommendations
To continuously improve and innovate your cloud workloads, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.
Foster a culture of learning
Encourage teams to experiment, share knowledge, and learn continuously. Adopt a blameless culture where failures are viewed as opportunities for growth and improvement. This recommendation is relevant to the workforce focus area of operational readiness.
When you foster a culture of learning, teams can learn from mistakes and iterate quickly. This approach encourages team members to take risks, experiment with new ideas, and expand the boundaries of their work. It also creates a psychologically safe environment where individuals feel comfortable sharing failures and learning from them. Sharing in this way leads to a more open and collaborative environment.
To facilitate knowledge sharing and continuous learning, create opportunities for teams to share knowledge and learn from each other. You can do this through informal and formal learning sessions and conferences.
By fostering a culture of experimentation, knowledge sharing, and continuous learning, you can create an environment where teams are empowered to take risks, innovate, and grow. This environment can lead to increased productivity, improved problem-solving, and a more engaged and motivated workforce. Further, by promoting a blameless culture, you can create a safe space for employees to learn from mistakes and contribute to the collective knowledge of the team. This culture ultimately leads to a more resilient and adaptable workforce that is better equipped to handle challenges and drive success in the long run.
Conduct regular retrospectives
Retrospectives give teams an opportunity to reflect on their experiences, identify what went well, and identify what can be improved. By conducting retrospectives after projects or major incidents, teams can learn from successes and failures, and continuously improve their processes and practices. This recommendation is relevant to these focus areas of operational readiness: processes and governance.
An effective way to structure a retrospective is to use the Start-Stop-Continue model:
- Start: In the Start phase of the retrospective, team members identify new practices, processes, and behaviors that they believe can enhance their work. They discuss why the changes are needed and how they can be implemented.
- Stop: In the Stop phase, team members identify and eliminate practices, processes, and behaviors that are no longer effective or that hinder progress. They discuss why these changes are necessary and how they can be implemented.
- Continue: In the Continue phase, team members identify practices, processes, and behaviors that work well and must be continued. They discuss why these elements are important and how they can be reinforced.
By using a structured format like the Start-Stop-Continue model, teams can ensure that retrospectives are productive and focused. This model helps to facilitate discussion, identify the main takeaways, and identify actionable steps for future enhancements.
Stay up-to-date with cloud technologies
To maximize the potential of Google Cloud services, you must keep up with the latest advancements, features, and best practices. This recommendation is relevant to the workforce focus area of operational readiness.
Participating in relevant conferences, webinars, and training sessions is a valuable way to expand your knowledge. These events provide opportunities to learn from Google Cloud experts, understand new capabilities, and engage with industry peers who might face similar challenges. By attending these sessions, you can gain insights into how to use new features effectively, optimize your cloud operations, and drive innovation within your organization.
To ensure that your team members keep up with cloud technologies, encourage them to obtain certifications and attend training courses. Google Cloud offers a wide range of certifications that validate skills and knowledge in specific cloud domains. Earning these certifications demonstrates commitment to excellence and provides tangible evidence of proficiency in cloud technologies. The training courses that are offered by Google Cloud and our partners delve deeper into specific topics. They provide direct experience and practical skills that can be immediately applied to real-world projects. By investing in the professional development of your team, you can foster a culture of continuous learning and ensure that everyone has the necessary skills to succeed in the cloud.
Actively seek and incorporate feedback
Collect feedback from users, stakeholders, and team members. Use the feedback to identify opportunities to improve your cloud solutions. This recommendation is relevant to the workforce focus area of operational readiness.
The feedback that you collect can help you to understand the evolving needs, issues, and expectations of the users of your solutions. This feedback serves as a valuable input to drive improvements and prioritize future enhancements. You can use various mechanisms to collect feedback:
- Surveys are an effective way to gather quantitative data from a large number of users and stakeholders.
- User interviews provide an opportunity for in-depth qualitative data collection. Interviews let you understand the specific challenges and experiences of individual users.
- Feedback forms that are placed within the cloud solutions offer a convenient way for users to provide immediate feedback on their experience.
- Regular meetings with team members can facilitate the collection of feedback on technical aspects and implementation challenges.
The feedback that you collect through these mechanisms must be analyzed and synthesized to identify common themes and patterns. This analysis can help you prioritize future enhancements based on the impact and feasibility of the suggested improvements. By addressing the needs and issues that are identified through feedback, you can ensure that your cloud solutions continue to meet the evolving requirements of your users and stakeholders.
Measure and track progress
Key performance indicators (KPIs) and metrics are crucial for tracking progress and measuring the effectiveness of your cloud operations. KPIs are quantifiable measurements that reflect the overall performance. Metrics are specific data points that contribute to the calculation of KPIs. Review the metrics regularly and use them to identify opportunities for improvement and measure progress. Doing so helps you to continuously improve and optimize your cloud environment. This recommendation is relevant to these focus areas of operational readiness: governance and processes.
A primary benefit of using KPIs and metrics is that they enable your organization to adopt a data-driven approach to cloud operations. By tracking and analyzing operational data, you can make informed decisions about how to improve the cloud environment. This data-driven approach helps you to identify trends, patterns, and anomalies that might not be visible without the use of systematic metrics.
To collect and analyze operational data, you can use tools like Cloud Monitoring and BigQuery. Cloud Monitoring enables real-time monitoring of cloud resources and services. BigQuery lets you store and analyze the data that you gather through monitoring. Using these tools together, you can create custom dashboards to visualize important metrics and trends.
Operational dashboards can provide a centralized view of the most important metrics, which lets you quickly identify any areas that need attention. For example, a dashboard might include metrics like CPU utilization, memory usage, network traffic, and latency for a particular application or service. By monitoring these metrics, you can quickly identify any potential issues and take steps to resolve them.
Google Cloud Architecture Framework: Security, privacy, and compliance
The Security, Privacy and Compliance pillar in the Google Cloud Architecture Framework provides recommendations to help you design, deploy, and operate cloud workloads that meet your requirements for security, privacy, and compliance.
This document is designed to offer valuable insights and meet the needs of a range of security professionals and engineers. The following table describes the intended audiences for this document:
Audience | What this document provides |
---|---|
Chief information security officers (CISOs), business unit leaders, and IT managers | A general framework to establish and maintain security excellence in the cloud and to ensure a comprehensive view of security areas to make informed decisions about security investments. |
Security architects and engineers | Key security practices for the design and operational phases to help ensure that solutions are designed for security, efficiency, and scalability. |
DevSecOps teams | Guidance to incorporate overarching security controls to plan automation that enables secure and reliable infrastructure. |
Compliance officers and risk managers | Key security recommendations to follow a structured approach to risk management with safeguards that help to meet compliance obligations. |
To ensure that your Google Cloud workloads meet your security, privacy, and compliance requirements, all of the stakeholders in your organization must adopt a collaborative approach. In addition, you must recognize that cloud security is a shared responsibility between you and Google. For more information, see Shared responsibilities and shared fate on Google Cloud.
The recommendations in this pillar are grouped into core security principles. Each principle-based recommendation is mapped to one or more of the key deployment focus areas of cloud security that might be critical to your organization. Each recommendation highlights guidance about the use and configuration of Google Cloud products and capabilities to help improve your organization's security posture.
Core principles
The recommendations in this pillar are grouped within the following core principles of security. Every principle in this pillar is important. Depending on the requirements of your organization and workload, you might choose to prioritize certain principles.
- Implement security by design: Integrate cloud security and network security considerations starting from the initial design phase of your applications and infrastructure. Google Cloud provides architecture blueprints and recommendations to help you apply this principle.
- Implement zero trust: Use a never trust, always verify approach, where access to resources is granted based on continuous verification of trust. Google Cloud supports this principle through products like Chrome Enterprise Premium and Identity-Aware Proxy (IAP).
- Implement shift-left security: Implement security controls early in the software development lifecycle. Avoid security defects before system changes are made. Detect and fix security bugs early, fast, and reliably after the system changes are committed. Google Cloud supports this principle through products like Cloud Build, Binary Authorization, and Artifact Registry.
- Implement preemptive cyber defense: Adopt a proactive approach to security by implementing robust fundamental measures like threat intelligence. This approach helps you build a foundation for more effective threat detection and response. Google Cloud's approach to layered security controls aligns with this principle.
- Use AI securely and responsibly: Develop and deploy AI systems in a responsible and secure manner. The recommendations for this principle are aligned with guidance in the AI and ML perspective of the Architecture Framework and in Google's Secure AI Framework (SAIF).
- Use AI for security: Use AI capabilities to improve your existing security systems and processes through Gemini in Security and overall platform-security capabilities. Use AI as a tool to increase the automation of remedial work and ensure security hygiene to make other systems more secure.
- Meet regulatory, compliance, and privacy needs: Adhere to industry-specific regulations, compliance standards, and privacy requirements. Google Cloud helps you meet these obligations through products like Assured Workloads, Organization Policy Service, and our compliance resource center.
Organizational security mindset
A security-focused organizational mindset is crucial for successful cloud adoption and operation. This mindset should be deeply ingrained in your organization's culture and reflected in its practices, which are guided by core security principles as described earlier.
An organizational security mindset emphasizes that you think about security during system design, assume zero trust, and integrate security features throughout your development process. In this mindset, you also think proactively about cyber-defense measures, use AI securely and for security, and consider your regulatory, privacy, and compliance requirements. By embracing these principles, your organization can cultivate a security-first culture that proactively addresses threats, protects valuable assets, and helps to ensure responsible technology usage.
Focus areas of cloud security
This section describes the areas for you to focus on when you plan, implement, and manage security for your applications, systems, and data. The recommendations in each principle of this pillar are relevant to one or more of these focus areas. Throughout the rest of this document, the recommendations specify the corresponding security focus areas to provide further clarity and context.
Focus area | Activities and components | Related Google Cloud products, capabilities, and solutions |
---|---|---|
Infrastructure security |
|
|
Identity and access management |
|
|
Data security |
|
|
AI and ML security |
|
|
Security operations (SecOps) |
|
|
Application security |
|
|
Cloud governance, risk, and compliance |
|
|
Logging, auditing, and monitoring |
|
Contributors
Authors:
- Wade Holmes | Global Solutions Director
- Hector Diaz | Cloud Security Architect
- Carlos Leonardo Rosario | Google Cloud Security Specialist
- John Bacon | Partner Solutions Architect
- Sachin Kalra | Global Security Solution Manager
Other contributors:
- Anton Chuvakin | Security Advisor, Office of the CISO
- Daniel Lees | Cloud Security Architect
- Filipe Gracio, PhD | Customer Engineer
- Gary Harmson | Customer Engineer
- Gino Pelliccia | Principal Architect
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Laura Hyatt | Enterprise Cloud Architect
- Marwan Al Shawi | Partner Customer Engineer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
- Noah McDonald | Cloud Security Consultant
- Osvaldo Costa | Networking Specialist Customer Engineer
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Susan Wu | Outbound Product Manager
Implement security by design
This principle in the security pillar of the Google Cloud Architecture Framework provides recommendations to incorporate robust security features, controls, and practices into the design of your cloud applications, services, and platforms. From ideation to operations, security is more effective when it's embedded as an integral part of every stage of your design process.
Principle overview
As explained in An Overview of Google's Commitment to Secure by Design, secure by default and secure by design are often used interchangeably, but they represent distinct approaches to building secure systems. Both approaches aim to minimize vulnerabilities and enhance security, but they differ in scope and implementation:
- Secure by default: focuses on ensuring that a system's default settings are set to a secure mode, minimizing the need for users or administrators to take actions to secure the system. This approach aims to provide a baseline level of security for all users.
- Secure by design: emphasizes proactively incorporating security considerations throughout a system's development lifecycle. This approach is about anticipating potential threats and vulnerabilities early and making design choices that mitigate risks. This approach involves using secure coding practices, conducting security reviews, and embedding security throughout the design process. The secure-by-design approach is an overarching philosophy that guides the development process and helps to ensure that security isn't an afterthought but is an integral part of a system's design.
Recommendations
To implement the secure by design principle for your cloud workloads, consider the recommendations in the following sections:
- Choose system components that help to secure your workloads
- Build a layered security approach
- Use hardened and attested infrastructure and services
- Encrypt data at rest and in transit
Choose system components that help to secure your workloads
This recommendation is relevant to all of the focus areas.
A fundamental decision for effective security is the selection of robust system components—including both hardware and software components—that constitute your platform, solution, or service. To reduce the security attack surface and limit potential damage, you must also carefully consider the deployment patterns of these components and their configurations.
In your application code, we recommend that you use straightforward, safe, and reliable libraries, abstractions, and application frameworks in order to eliminate classes of vulnerabilities. To scan for vulnerabilities in software libraries, you can use third-party tools. You can also use Assured Open Source Software, which helps to reduce risks to your software supply chain by using open source software (OSS) packages that Google uses and secures.
Your infrastructure must use networking, storage, and compute options that support safe operation and align with your security requirements and risk acceptance levels. Infrastructure security is important for both internet-facing and internal workloads.
For information about other Google solutions that support this recommendation, see Implement shift-left security.
Build a layered security approach
This recommendation is relevant to the following focus areas:
- AI and ML security
- Infrastructure security
- Identity and access management
- Data security
We recommend that you implement security at each layer of your application and infrastructure stack by applying a defense-in-depth approach.
Use the security features in each component of your platform. To limit access and identify the boundaries of the potential impact (that is, the blast radius) in the event of a security incident, do the following:
- Simplify your system's design to accommodate flexibility where possible.
- Document the security requirements of each component.
- Incorporate a robust secured mechanism to address resiliency and recovery requirements.
When you design the security layers, perform a risk assessment to determine the security features that you need in order to meet internal security requirements and external regulatory requirements. We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and that is relevant to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). Your risk assessment provides you with a catalog of risks and corresponding security controls to mitigate them.
When you perform the risk assessment, remember that you have a shared responsibility arrangement with your cloud provider. Therefore, your risks in a cloud environment differ from your risks in an on-premises environment. For example, in an on-premises environment, you need to mitigate vulnerabilities to your hardware stack. In contrast, in a cloud environment, the cloud provider bears these risks. Also, remember that the boundaries of shared responsibilities differ between IaaS, PaaS, and SaaS services for each cloud provider.
After you identify potential risks, you must design and create a mitigation plan that uses technical, administrative, and operational controls, as well as contractual protections and third-party attestations. In addition, a threat modeling method, such as the OWASP application threat modeling method, helps you to identify potential gaps and suggest actions to address the gaps.
Use hardened and attested infrastructure and services
This recommendation is relevant to all of the focus areas.
A mature security program mitigates new vulnerabilities as described in security bulletins. The security program should also provide remediation to fix vulnerabilities in existing deployments and secure your VM and container images. You can use hardening guides that are specific to the OS and application of your images, as well as benchmarks like the one provided by the Center of Internet Security (CIS).
If you use custom images for your Compute Engine VMs, you need to patch the images yourself. Alternatively, you can use Google-provided curated OS images, which are patched regularly. To run containers on Compute Engine VMs, use Google-curated Container-optimized OS images. Google regularly patches and updates these images.
If you use GKE, we recommend that you enable node auto-upgrades so that Google updates your cluster nodes with the latest patches. Google manages GKE control planes, which are automatically updated and patched. To further reduce the attack surface of your containers, you can use distroless images. Distroless images are ideal for security-sensitive applications, microservices, and situations where minimizing the image size and attack surface is paramount.
For sensitive workloads, use Shielded VM, which prevents malicious code from being loaded during the VM boot cycle. Shielded VM instances provide boot security, monitor integrity, and use the Virtual Trusted Platform Module (vTPM).
To help secure SSH access, OS Login lets your employees connect to your VMs by using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. Therefore, you don't need to manage SSH keys throughout your organization. OS Login ties an administrator's access to their employee lifecycle, so when employees change roles or leave your organization, their access is revoked with their account. OS Login also supports Google two-factor authentication, which adds an extra layer of security against account takeover attacks.
In GKE, application instances run within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The immutability principle means that your employees don't modify the container or access it interactively. If the container must be changed, you build a new image and redeploy that image. Enable SSH access to the underlying containers only in specific debugging scenarios.
To help globally secure configurations across your environment, you can use organization policies to set constraints or guardrails on resources that affect the behavior of your cloud assets. For example, you can define the following organization policies and apply them either globally across a Google Cloud organization or selectively at the level of a folder or project:
- Disable external IP address allocation to VMs.
- Restrict resource creation to specific geographical locations.
- Disable the creation of Service Accounts or their keys.
Encrypt data at rest and in transit
This recommendation is relevant to the following focus areas:
- Infrastructure security
- Data security
Data encryption is a foundational control to protect sensitive information, and it's a key part of data governance. An effective data protection strategy includes access control, data segmentation and geographical residency, auditing, and encryption implementation that's based on a careful assessment of requirements.
By default, Google Cloud encrypts customer data that's stored at rest, with no action required from you. In addition to default encryption, Google Cloud provides options for envelope encryption and encryption key management. You must identify the solutions that best fit your requirements for key generation, storage, and rotation, whether you're choosing the keys for your storage, for compute, or for big data workloads. For example, Customer-managed encryption keys (CMEKs) can be created in Cloud Key Management Service (Cloud KMS). The CMEKs can be either software-based or HSM-protected to meet your regulatory or compliance requirements, such as the need to rotate encryption keys regularly. Cloud KMS Autokey lets you automate the provisioning and assignment of CMEKs. In addition, you can bring your own keys that are sourced from a third-party key management system by using Cloud External Key Manager (Cloud EKM).
We strongly recommend that data be encrypted in-transit. Google encrypts and authenticates data in transit at one or more network layers when data moves outside physical boundaries that aren't controlled by Google or on behalf of Google. All VM-to-VM traffic within a VPC network and between peered VPC networks is encrypted. You can use MACsec for encryption of traffic over Cloud Interconnect connections. IPsec provides encryption for traffic over Cloud VPN connections. You can protect application-to-application traffic in the cloud by using security features like TLS and mTLS configurations in Apigee and Cloud Service Mesh for containerized applications.
By default, Google Cloud encrypts data at rest and data in transit across the network. However, data isn't encrypted by default while it's in use in memory. If your organization handles confidential data, you need to mitigate any threats that undermine the confidentiality and integrity of either the application or the data in system memory. To mitigate these threats, you can use Confidential Computing, which provides a trusted execution environment for your compute workloads. For more information, see Confidential VM overview.
Implement zero trust
This principle in the security pillar of the Google Cloud Architecture Framework helps you ensure comprehensive security across your cloud workloads. The principle of zero trust emphasizes the following practices:
- Eliminating implicit trust
- Applying the principle of least privilege to access control
- Enforcing explicit validation of all access requests
- Adopting an assume-breach mindset to enable continuous verification and security posture monitoring
Principle overview
The zero-trust model shifts the security focus from perimeter-based security to an approach where no user or device is considered to be inherently trustworthy. Instead, every access request must be verified, regardless of its origin. This approach involves authenticating and authorizing every user and device, validating their context (location and device posture), and granting least privilege access to only the necessary resources.
Implementing the zero-trust model helps your organization enhance its security posture by minimizing the impact of potential breaches and protecting sensitive data and applications against unauthorized access. The zero-trust model helps you ensure confidentiality, integrity, and availability of data and resources in the cloud.
Recommendations
To implement the zero-trust model for your cloud workloads, consider the recommendations in the following sections:
Secure your network
This recommendation is relevant to the following focus area: Infrastructure security.
Transitioning from conventional perimeter-based security to a zero-trust model requires multiple steps. Your organization might have already integrated certain zero-trust controls into its security posture. However, a zero-trust model isn't a singular product or solution. Instead, it's a holistic integration of multiple security layers and best practices. This section describes recommendations and techniques to implement zero trust for network security.
- Access control: Enforce access controls based on user identity and context by using solutions like Chrome Enterprise Premium and Identity-Aware Proxy (IAP). By doing this, you shift security from the network perimeter to individual users and devices. This approach enables granular access control and reduces the attack surface.
- Network security: Secure network connections between your
on-premises, Google Cloud, and multicloud environments.
- Use the private connectivity methods from Cloud Interconnect and IPsec VPNs.
- To help secure access to Google Cloud services and APIs, use Private Service Connect.
- To help secure outbound access from workloads deployed on GKE Enterprise, use Cloud Service Mesh egress gateways.
- Network design: Prevent potential security risks by deleting default
networks in existing projects and disabling the creation of default
networks in new projects.
- To avoid conflicts, plan your network and IP address allocation carefully.
- To enforce effective access control, limit the number of Virtual Private Cloud (VPC) networks per project.
- Segmentation: Isolate workloads but maintain centralized network
management.
- To segment your network, use Shared VPC.
- Define firewall policies and rules at the organization, folder, and VPC network levels.
- To prevent data exfiltration, establish secure perimeters around sensitive data and services by using VPC Service Controls.
- Perimeter security: Protect against DDoS attacks and web application
threats.
- To protect against threats, use Google Cloud Armor.
- Configure security policies to allow, deny, or redirect traffic at the Google Cloud edge.
- Automation: Automate infrastructure provisioning by embracing infrastructure as code (IaC) principles and by using tools like Terraform, Jenkins, and Cloud Build. IaC helps to ensure consistent security configurations, simplified deployments, and rapid rollbacks in case of issues.
- Secure foundation: Establish a secure application environment by using the Enterprise foundations blueprint. This blueprint provides prescriptive guidance and automation scripts to help you implement security best practices and configure your Google Cloud resources securely.
Verify every access attempt explicitly
This recommendation is relevant to the following focus areas:
- Identity and access management
- Security operations (SecOps)
- Logging, auditing, and monitoring
Implement strong authentication and authorization mechanisms for any user, device, or service that attempts to access your cloud resources. Don't rely on location or network perimeter as a security control. Don't automatically trust any user, device, or service, even if they are already inside the network. Instead, every attempt to access resources must be rigorously authenticated and authorized. You must implement strong identity verification measures, such as multi-factor authentication (MFA). You must also ensure that access decisions are based on granular policies that consider various contextual factors like user role, device posture, and location.
To implement this recommendation, use the following methods, tools, and technologies:
- Unified identity management: Ensure consistent identity management
across your organization by using a single identity provider (IdP).
- Google Cloud supports federation with most IdPs, including on-premises Active Directory. Federation lets you extend your existing identity management infrastructure to Google Cloud and enable single sign-on (SSO) for users.
- If you don't have an existing IdP, consider using Cloud Identity Premium or Google Workspace.
- Limited service account permissions: Use
service accounts
carefully, and adhere to the principle of least privilege.
- Grant only the necessary permissions required for each service account to perform its designated tasks.
- Use Workload Identity Federation for applications that run on Google Kubernetes Engine (GKE) or run outside Google Cloud to access resources securely.
- Robust processes: Update your identity processes to align with cloud
security best practices.
- To help ensure compliance with regulatory requirements, implement identity governance to track access, risks, and policy violations.
- Review and update your existing processes for granting and auditing access-control roles and permissions.
- Strong authentication: Implement SSO for user authentication and
implement MFA for privileged accounts.
- Google Cloud supports various MFA methods, including Titan Security Keys, for enhanced security.
- For workload authentication, use OAuth 2.0 or signed JSON Web Tokens (JWTs).
- Least privilege: Minimize the risk of unauthorized access and data
breaches by enforcing the principles of least privilege and separation of
duties.
- Avoid overprovisioning user access.
- Consider implementing just-in-time privileged access for sensitive operations.
- Logging: Enable audit logging for administrator and data access
activities.
- For analysis and threat detection, scan the logs by using Security Command Center Enterprise or Google Security Operations.
- Configure appropriate log retention policies to balance security needs with storage costs.
Monitor and maintain your network
This recommendation is relevant to the following focus areas:
- Logging, auditing, and monitoring
- Application security
- Security operations (SecOps)
- Infrastructure security
When you plan and implement security measures, assume that an attacker is already inside your environment. This proactive approach involves using the following multiple tools and techniques to provide visibility into your network:
Centralized logging and monitoring: Collect and analyze security logs from all of your cloud resources through centralized logging and monitoring.
- Establish baselines for normal network behavior, detect anomalies, and identify potential threats.
- Continuously analyze network traffic flows to identify suspicious patterns and potential attacks.
Insights into network performance and security: Use tools like Network Analyzer. Monitor traffic for unusual protocols, unexpected connections, or sudden spikes in data transfer, which could indicate malicious activity.
Vulnerability scanning and remediation: Regularly scan your network and applications for vulnerabilities.
- Use Web Security Scanner, which can automatically identify vulnerabilities in your Compute Engine instances, containers, and GKE clusters.
- Prioritize remediation based on the severity of vulnerabilities and their potential impact on your systems.
Intrusion detection: Monitor network traffic for malicious activity and automatically block or get alerts for suspicious events by using Cloud IDS and Cloud NGFW intrusion prevention service.
Security analysis: Consider implementing Google SecOps to correlate security events from various sources, provide real-time analysis of security alerts, and facilitate incident response.
Consistent configurations: Ensure that you have consistent security configurations across your network by using configuration management tools.
Implement shift-left security
This principle in the security pillar of the Google Cloud Architecture Framework helps you identify practical controls that you can implement early in the software development lifecycle to improve your security posture. It provides recommendations that help you implement preventive security guardrails and post-deployment security controls.
Principle overview
Shift-left security means adopting security practices early in the software development lifecycle. This principle has the following goals:
- Avoid security defects before system changes are made. Implement preventive security guardrails and adopt practices such as infrastructure as code (IaC), policy as code, and security checks in the CI/CD pipeline. You can also use other platform-specific capabilities like Organization Policy Service and hardened GKE clusters in Google Cloud.
- Detect and fix security bugs early, fast, and reliably after any system changes are committed. Adopt practices like code reviews, post-deployment vulnerability scanning, and security testing.
The Implement security by design and shift-left security principles are related but they differ in scope. The security-by-design principle helps you to avoid fundamental design flaws that would require re-architecting the entire system. For example, a threat-modeling exercise reveals that the current design doesn't include an authorization policy, and all users would have the same level of access without it. Shift-left security helps you to avoid implementation defects (bugs and misconfigurations) before changes are applied, and it enables fast, reliable fixes after deployment.
Recommendations
To implement the shift-left security principle for your cloud workloads, consider the recommendations in the following sections:
- Adopt preventive security controls
- Automate provisioning and management of cloud resources
- Automate secure application releases
- Ensure that application deployments follow approved processes
- Scan for known vulnerabilities before application deployment
- Monitor your application code for known vulnerabilities
Adopt preventive security controls
This recommendation is relevant to the following focus areas:
- Identity and access management
- Cloud governance, risk, and compliance
Preventive security controls are crucial for maintaining a strong security posture in the cloud. These controls help you proactively mitigate risks. You can prevent misconfigurations and unauthorized access to resources, enable developers to work efficiently, and help ensure compliance with industry standards and internal policies.
Preventive security controls are more effective when they're implemented by using infrastructure as code (IaC). With IaC, preventive security controls can include more customized checks on the infrastructure code before changes are deployed. When combined with automation, preventive security controls can run as part of your CI/CD pipeline's automatic checks.
The following products and Google Cloud capabilities can help you implement preventive controls in your environment:
- Organization Policy Service constraints: configure predefined and custom constraints with centralized control.
- VPC Service Controls: create perimeters around your Google Cloud services.
- Identity and Access Management (IAM), Privileged Access Manager, and principal access boundary policies: restrict access to resources.
- Policy Controller and Open Policy Agent (OPA): enforce IaC constraints in your CI/CD pipeline and avoid cloud misconfigurations.
IAM lets you authorize who can act on specific resources based on permissions. For more information, see Access control for organization resources with IAM.
Organization Policy Service lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following:
- Limit resource sharing based on domain.
- Limit the use of service accounts.
- Restrict the physical location of newly created resources.
In addition to using organizational policies, you can restrict access to resources by using the following methods:
- Tags with IAM: assign a tag to a set of resources and then set the access definition for the tag itself, rather than defining the access permissions on each resource.
- IAM Conditions: define conditional, attribute-based access control for resources.
- Defense in depth: use VPC Service Controls to further restrict access to resources.
For more information about resource management, see Decide a resource hierarchy for your Google Cloud landing zone.
Automate provisioning and management of cloud resources
This recommendation is relevant to the following focus areas:
- Application security
- Cloud governance, risk, and compliance
Automating the provisioning and management of cloud resources and workloads is more effective when you also adopt declarative IaC, as opposed to imperative scripting. IaC isn't a security tool or practice on its own, but it helps you to improve the security of your platform. Adopting IaC lets you create repeatable infrastructure and provides your operations team with a known good state. IaC also improves the efficiency of rollbacks, audit changes, and troubleshooting.
When combined with CI/CD pipelines and automation, IaC also gives you the ability to adopt practices such as policy as code with tools like OPA. You can audit infrastructure changes over time and run automatic checks on the infrastructure code before changes are deployed.
To automate the infrastructure deployment, you can use tools like Config Controller, Terraform, Jenkins, and Cloud Build. To help you build a secure application environment using IaC and automation, Google Cloud provides the enterprise foundations blueprint. This blueprint is Google's opinionated design that follows all of our recommended practices and configurations. The blueprint provides step-by-step instructions to configure and deploy your Google Cloud topology by using Terraform and Cloud Build.
You can modify the scripts of the enterprise foundations blueprint to configure an environment that follows Google recommendations and meets your own security requirements. You can further build on the blueprint with additional blueprints or design your own automation. The Google Cloud Architecture Center provides other blueprints that can be implemented on top of the enterprise foundations blueprint. The following are a few examples of these blueprints:
- Deploy an enterprise developer platform on Google Cloud
- Deploy a secured serverless architecture using Cloud Run
- Build and deploy generative AI and machine learning models in an enterprise
- Import data from Google Cloud into a secured BigQuery data warehouse
- Deploy network monitoring and telemetry capabilities in Google Cloud
Automate secure application releases
This recommendation is relevant to the following focus area: Application security.
Without automated tools, it can be difficult to deploy, update, and patch complex application environments to meet consistent security requirements. We recommend that you build automated CI/CD pipelines for your software development lifecycle (SDLC). Automated CI/CD pipelines help you to remove manual errors, provide standardized development feedback loops, and enable efficient product iterations. Continuous delivery is one of the best practices that the DORA framework recommends.
Automating application releases by using CI/CD pipelines helps to improve your ability to detect and fix security bugs early, fast, and reliably. For example, you can scan for security vulnerabilities automatically when artifacts are created, narrow the scope of security reviews, and roll back to a known and safe version. You can also define policies for different environments (such as development, test, or production environments) so that only verified artifacts are deployed.
To help you automate application releases and embed security checks in your CI/CD pipeline, Google Cloud provides multiple tools including Cloud Build, Cloud Deploy, Web Security Scanner, and Binary Authorization.
To establish a process that verifies multiple security requirements in your SDLC, use the Supply-chain Levels for Software Artifacts (SLSA) framework, which has been defined by Google. SLSA requires security checks for source code, build process, and code provenance. Many of these requirements can be included in an automated CI/CD pipeline. To understand how Google applies these practices internally, see Google Cloud's approach to change.
Ensure that application deployments follow approved processes
This recommendation is relevant to the following focus area: Application security.
If an attacker compromises your CI/CD pipeline, your entire application stack can be affected. To help secure the pipeline, you should enforce an established approval process before you deploy the code into production.
If you use Google Kubernetes Engine (GKE), GKE Enterprise, or Cloud Run, you can establish an approval process by using Binary Authorization. Binary Authorization attaches configurable signatures to container images. These signatures (also called attestations) help to validate the image. At deployment time, Binary Authorization uses these attestations to determine whether a process was completed. For example, you can use Binary Authorization to do the following:
- Verify that a specific build system or CI pipeline created a container image.
- Validate that a container image is compliant with a vulnerability signing policy.
- Verify that a container image passes the criteria for promotion to the next deployment environment, such as from development to QA.
By using Binary Authorization, you can enforce that only trusted code runs on your target platforms.
Scan for known vulnerabilities before application deployment
This recommendation is relevant to the following focus area: Application security.
We recommend that you use automated tools that can continuously perform vulnerability scans on application artifacts before they're deployed to production.
For containerized applications, use Artifact Analysis to automatically run vulnerability scans for container images. Artifact Analysis scans new images when they're uploaded to Artifact Registry. The scan extracts information about the system packages in the container. After the initial scan, Artifact Analysis continuously monitors the metadata of scanned images in Artifact Registry for new vulnerabilities. When Artifact Analysis receives new and updated vulnerability information from vulnerability sources, it does the following:
- Updates the metadata of the scanned images to keep them up to date.
- Creates new vulnerability occurrences for new notes.
- Deletes vulnerability occurrences that are no longer valid.
Monitor your application code for known vulnerabilities
This recommendation is relevant to the following focus area: Application security.
Use automated tools to constantly monitor your application code for known vulnerabilities such as the OWASP Top 10. For more information about Google Cloud products and features that support OWASP Top 10 mitigation techniques, see OWASP Top 10 mitigation options on Google Cloud.
Use Web Security Scanner to help identify security vulnerabilities in your App Engine, Compute Engine, and GKE web applications. The scanner crawls your application, follows all of the links within the scope of your starting URLs, and attempts to exercise as many user inputs and event handlers as possible. It can automatically scan for and detect common vulnerabilities, including cross-site scripting, code injection, mixed content, and outdated or insecure libraries. Web Security Scanner provides early identification of these types of vulnerabilities without distracting you with false positives.
In addition, if you use GKE Enterprise to manage fleets of Kubernetes clusters, the security posture dashboard shows opinionated, actionable recommendations to help improve your fleet's security posture.
Implement preemptive cyber defense
This principle in the security pillar of the Google Cloud Architecture Framework provides recommendations to build robust cyber-defense programs as part of your overall security strategy.
This principle emphasizes the use of threat intelligence to proactively guide your efforts across the core cyber-defense functions, as defined in The Defender's Advantage: A guide to activating cyber defense.
Principle overview
When you defend your system against cyber attacks, you have a significant, underutilized advantage against attackers. As the founder of Mandiant states, "You should know more about your business, your systems, your topology, your infrastructure than any attacker does. This is an incredible advantage." To help you use this inherent advantage, this document provides recommendations about proactive and strategic cyber-defense practices that are mapped to the Defender's Advantage framework.
Recommendations
To implement preemptive cyber defense for your cloud workloads, consider the recommendations in the following sections:
- Integrate the functions of cyber defense
- Use the Intelligence function in all aspects of cyber defense
- Understand and capitalize on your defender's advantage
- Validate and improve your defenses continuously
- Manage and coordinate cyber-defense efforts
Integrate the functions of cyber defense
This recommendation is relevant to all of the focus areas.
The Defender's Advantage framework identifies six critical functions of cyber defense: Intelligence, Detect, Respond, Validate, Hunt, and Mission Control. Each function focuses on a unique part of the cyber-defense mission, but these functions must be well-coordinated and work together to provide an effective defense. Focus on building a robust and integrated system where each function supports the others. If you need a phased approach for adoption, consider the following suggested order. Depending on your current cloud maturity, resource topology, and specific threat landscape, you might want to prioritize certain functions.
- Intelligence: The Intelligence function guides all the other functions. Understanding the threat landscape—including the most likely attackers, their tactics, techniques, and procedures (TTPs), and the potential impact—is critical to prioritizing actions across the entire program. The Intelligence function is responsible for stakeholder identification, definition of intelligence requirements, data collection, analysis and dissemination, automation, and the creation of a cyber threat profile.
- Detect and Respond: These functions make up the core of active defense, which involves identifying and addressing malicious activity. These functions are necessary to act on the intelligence that's gathered by the intelligence function. The Detect function requires a methodical approach that aligns detections to attacker TTPs and ensures robust logging. The Respond function must focus on initial triage, data collection, and incident remediation.
- Validate: The Validate function is a continuous process that provides assurance that your security control ecosystem is up-to-date and operating as designed. This function ensures that your organization understands the attack surface, knows where vulnerabilities exist, and measures the effectiveness of controls. Security validation is also an important component of the detection engineering lifecycle and must be used to identify detection gaps and create new detections.
- Hunt: The Hunt function involves proactively searching for active threats within an environment. This function must be implemented when your organization has a baseline level of maturity in the Detect and Respond functions. The Hunt function expands the detection capabilities and helps to identify gaps and weaknesses in controls. The Hunt function must be based on specific threats. This advanced function benefits from a foundation of robust intelligence, detection, and response capabilities.
- Mission Control: The Mission Control function acts as the central hub that connects all of the other functions. This function is responsible for strategy, communication, and decisive action across your cyber-defense program. It ensures that all of the functions are working together and that they're aligned with your organization's business goals. You must focus on establishing a clear understanding of the purpose of the Mission Control function before you use it to connect the other functions.
Use the Intelligence function in all aspects of cyber defense
This recommendation is relevant to all of the focus areas.
This recommendation highlights the Intelligence function as a core part of a strong cyber-defense program. Threat intelligence provides knowledge about threat actors, their TTPs, and indicators of compromise (IOCs). This knowledge should inform and prioritize actions across all cyber-defense functions. An intelligence-driven approach helps you align defenses to meet the threats that are most likely to affect your organization. This approach also helps with efficient allocation and prioritization of resources.
The following Google Cloud products and features help you take advantage of threat intelligence to guide your security operations. Use these features to identify and prioritize potential threats, vulnerabilities, and risks, and then plan and implement appropriate actions.
Google Security Operations (Google SecOps) helps you store and analyze security data centrally. Use Google SecOps to map logs into a common model, enrich the logs, and link the logs to timelines for a comprehensive view of attacks. You can also create detection rules, set up IoC matching, and perform threat-hunting activities. The platform also provides curated detections, which are predefined and managed rules to help identify threats. Google SecOps can also integrate with Mandiant frontline intelligence. Google SecOps uniquely integrates industry-leading AI, along with threat intelligence from Mandiant and Google VirusTotal. This integration is critical for threat evaluation and understanding who is targeting your organization and the potential impact.
Security Command Center Enterprise, which is powered by Google AI, enables security professionals to efficiently assess, investigate, and respond to security issues across multiple cloud environments. The security professionals who can benefit from Security Command Center include security operations center (SOC) analysts, vulnerability and posture analysts, and compliance managers. Security Command Center Enterprise enriches security data, assesses risk, and prioritizes vulnerabilities. This solution provides teams with the information that they need to address high-risk vulnerabilities and to remediate active threats.
Chrome Enterprise Premium offers threat and data protection, which helps to protect users from exfiltration risks and prevents malware from getting onto enterprise-managed devices. Chrome Enterprise Premium also provides visibility into unsafe or potentially unsafe activity that can happen within the browser.
Network monitoring, through tools like Network Intelligence Center, provides visibility into network performance. Network monitoring can also help you detect unusual traffic patterns or detect data transfer amounts that might indicate an attack or data exfiltration attempt.
Understand and capitalize on your defender's advantage
This recommendation is relevant to all of the focus areas.
As mentioned earlier, you have an advantage over attackers when you have a thorough understanding of your business, systems, topology, and infrastructure. To capitalize on this knowledge advantage, utilize this data about your environments during cyberdefense planning.
Google Cloud provides the following features to help you proactively gain visibility to identify threats, understand risks, and respond in a timely manner to mitigate potential damage:
Chrome Enterprise Premium helps you enhance security for enterprise devices by protecting users from exfiltration risks. It extends Sensitive Data Protection services into the browser, and prevents malware. It also offers features like protection against malware and phishing to help prevent exposure to unsafe content. In addition, it gives you control over the installation of extensions to help prevent unsafe or unvetted extensions. These capabilities help you establish a secure foundation for your operations.
Security Command Center Enterprise provides a continuous risk engine that offers comprehensive and ongoing risk analysis and management. The risk engine feature enriches security data, assesses risk, and prioritizes vulnerabilities to help fix issues quickly. Security Command Center enables your organization to proactively identify weaknesses and implement mitigations.
Google SecOps centralizes security data and provides enriched logs with timelines. This enables defenders to proactively identify active compromises and adapt defenses based on attackers' behavior.
Network monitoring helps identify irregular network activity that might indicate an attack and it provides early indicators that you can use to take action. To help proactively protect your data from theft, continuously monitor for data exfiltration and use the provided tools.
Validate and improve your defenses continuously
This recommendation is relevant to all of the focus areas.
This recommendation emphasizes the importance of targeted testing and continuous validation of controls to understand strengths and weaknesses across the entire attack surface. This includes validating the effectiveness of controls, operations, and staff through methods like the following:
- Penetration tests
- Red-blue team and purple team exercises
- Tabletop exercises
You must also actively search for threats and use the results to improve detection and visibility. Use the following tools to continuously test and validate your defenses against real-world threats:
Security Command Center Enterprise provides a continuous risk engine to evaluate vulnerabilities and prioritize remediation, which enables ongoing evaluation of your overall security posture. By prioritizing issues, Security Command Center Enterprise helps you to ensure that resources are used effectively.
Google SecOps offers threat-hunting and curated detections that let you proactively identify weaknesses in your controls. This capability enables continuous testing and improvement of your ability to detect threats.
Chrome Enterprise Premium provides threat and data protection features that can help you to address new and evolving threats, and continuously update your defenses against exfiltration risks and malware.
Cloud Next Generation Firewall (Cloud NGFW) provides network monitoring and data-exfiltration monitoring. These capabilities can help you to validate the effectiveness of your current security posture and identify potential weaknesses. Data-exfiltration monitoring helps you to validate the strength of your organization's data protection mechanisms and make proactive adjustments where necessary. When you integrate threat findings from Cloud NGFW with Security Command Center and Google SecOps, you can optimize network-based threat detection, optimize threat response, and automate playbooks. For more information about this integration, see Unifying Your Cloud Defenses: Security Command Center & Cloud NGFW Enterprise.
Manage and coordinate cyber-defense efforts
This recommendation is relevant to all of the focus areas.
As described earlier in Integrate the functions of cyber defense, the Mission Control function interconnects the other functions of the cyber-defense program. This function enables coordination and unified management across the program. It also helps you coordinate with other teams that don't work on cybersecurity. The Mission Control function promotes empowerment and accountability, facilitates agility and expertise, and drives responsibility and transparency.
The following products and features can help you implement the Mission Control function:
- Security Command Center Enterprise acts as a central hub for coordinating and managing your cyber-defense operations. It brings tools, teams, and data together, along with the built-in Google SecOps response capabilities. Security Command Center provides clear visibility into your organization's security state and enables the identification of security misconfigurations across different resources.
- Google SecOps provides a platform for teams to respond to threats by mapping logs and creating timelines. You can also define detection rules and search for threats.
- Google Workspace and Chrome Enterprise Premium help you to manage and control end-user access to sensitive resources. You can define granular access controls based on user identity and the context of a request.
- Network monitoring provides insights into the performance of network resources. You can import network monitoring insights into Security Command Center and Google SecOps for centralized monitoring and correlation against other timeline based data points. This integration helps you to detect and respond to potential network usage changes caused by nefarious activity.
- Data-exfiltration monitoring helps to identify possible data loss incidents. With this feature, you can efficiently mobilize an incident response team, assess damages, and limit further data exfiltration. You can also improve current policies and controls to ensure data protection.
Product summary
The following table lists the products and features that are described in this document and maps them to the associated recommendations and security capabilities.
Google Cloud product | Applicable recommendations |
---|---|
Google SecOps |
Use the Intelligence function in all aspects of cyber defense:
Enables threat hunting and IoC matching, and integrates with
Mandiant for comprehensive threat evaluation.
Understand and capitalize on your defender's advantage: Provides curated detections and centralizes security data for proactive compromise identification. Validate and improve your defenses continuously: Enables continuous testing and improvement of threat detection capabilities.Manage and coordinate cyber-defense efforts through Mission Control: Provides a platform for threat response, log analysis, and timeline creation. |
Security Command Center Enterprise |
Use the Intelligence function in all aspects of cyber defense:
Uses AI to assess risk, prioritize vulnerabilities, and provide
actionable insights for remediation.
Understand and capitalize on your defender's advantage: Offers comprehensive risk analysis, vulnerability prioritization, and proactive identification of weaknesses. Validate and improve your defenses continuously: Provides ongoing security posture evaluation and resource prioritization.Manage and coordinate cyber-defense efforts through Mission Control: Acts as a central hub for managing and coordinating cyber-defense operations. |
Chrome Enterprise Premium |
Use the Intelligence function in all aspects of cyber defense:
Protects users from exfiltration risks, prevents malware, and
provides visibility into unsafe browser activity.
Understand and capitalize on your defender's advantage: Enhances security for enterprise devices through data protection, malware prevention, and control over extensions. Validate and improve your defenses continuously: Addresses new and evolving threats through continuous updates to defenses against exfiltration risks and malware.Manage and coordinate cyber-defense efforts through Mission Control: Manage and control end-user access to sensitive resources, including granular access controls. |
Google Workspace | Manage and coordinate cyber-defense efforts through Mission Control: Manage and control end-user access to sensitive resources, including granular access controls. |
Network Intelligence Center | Use the Intelligence function in all aspects of cyber defense: Provides visibility into network performance and detects unusual traffic patterns or data transfers. |
Cloud NGFW | Validate and improve your defenses continuously: Optimizes network-based threat detection and response through integration with Security Command Center and Google SecOps. |
Use AI securely and responsibly
This principle in the security pillar of the Google Cloud Architecture Framework provides recommendations to help you secure your AI systems. These recommendations are aligned with Google's Secure AI Framework (SAIF), which provides a practical approach to address the security and risk concerns of AI systems. SAIF is a conceptual framework that aims to provide industry-wide standards for building and deploying AI responsibly.
Principle overview
To help ensure that your AI systems meet your security, privacy, and compliance requirements, you must adopt a holistic strategy that starts with the initial design and extends to deployment and operations. You can implement this holistic strategy by applying the six core elements of SAIF.
Google uses AI to enhance security measures, such as identifying threats, automating security tasks, and improving detection capabilities, while keeping humans in the loop for critical decisions.
Google emphasizes a collaborative approach to advancing AI security. This approach involves partnering with customers, industries, and governments to enhance the SAIF guidelines and offer practical, actionable resources.
The recommendations to implement this principle are grouped within the following sections:
Recommendations to use AI securely
To use AI securely, you need both foundational security controls and AI-specific security controls. This section provides an overview of recommendations to ensure that your AI and ML deployments meet the security, privacy, and compliance requirements of your organization. For an overview of architectual principles and recommendations that are specific to AI and ML workloads in Google Cloud, see the AI and ML perspective in the Architecture Framework.
Define clear goals and requirements for AI usage
This recommendation is relevant to the following focus areas:
- Cloud governance, risk, and compliance
- AI and ML security
This recommendation aligns with the SAIF element about contextualizing AI system risks in the surrounding business processes. When you design and evolve AI systems, it's important to understand your specific business goals, risks, and compliance requirements.
Keep data secure and prevent loss or mishandling
This recommendation is relevant to the following focus areas:
- Infrastructure security
- Identity and access management
- Data security
- Application security
- AI and ML security
This recommendation aligns with the following SAIF elements:
- Expand strong security foundations to the AI ecosystem. This element includes data collection, storage, access control, and protection against data poisoning.
- Contextualize AI system risks. Emphasize data security to support business objectives and compliance.
Keep AI pipelines secure and robust against tampering
This recommendation is relevant to the following focus areas:
- Infrastructure security
- Identity and access management
- Data security
- Application security
- AI and ML security
This recommendation aligns with the following SAIF elements:
- Expand strong security foundations to the AI ecosystem. As a key element of establishing a secure AI system, secure your code and model artifacts.
- Adapt controls for faster feedback loops. Because it's important for mitigation and incident response, track your assets and pipeline runs.
Deploy apps on secure systems using secure tools and artifacts
This recommendation is relevant to the following focus areas:
- Infrastructure security
- Identity and access management
- Data security
- Application security
- AI and ML security
Using secure systems and validated tools and artifacts in AI-based applications aligns with the SAIF element about expanding strong security foundations to the AI ecosystem and supply chain. This recommendation can be addressed through the following steps:
- Implement a secure environment for ML training and deployment
- Use validated container images
- Apply Supply-chain Levels for Software Artifacts (SLSA) guidelines
Protect and monitor inputs
This recommendation is relevant to the following focus areas:
- Logging, auditing, and monitoring
- Security operations
- AI and ML security
This recommendation aligns with the SAIF element about extending detection and response to bring AI into an organization's threat universe. To prevent issues, it's critical to manage prompts for generative AI systems, monitor inputs, and control user access.
Recommendations for AI governance
All of the recommendations in this section are relevant to the following focus area: Cloud governance, risk, and compliance.
Google Cloud offers a robust set of tools and services that you can use to build responsible and ethical AI systems. We also offer a framework of policies, procedures, and ethical considerations that can guide the development, deployment, and use of AI systems.
As reflected in our recommendations, Google's approach for AI governance is guided by the following principles:
- Fairness
- Transparency
- Accountability
- Privacy
- Security
Use fairness indicators
Vertex AI can detect bias during the data collection or post-training evaluation process. Vertex AI provides model evaluation metrics like data bias and model bias to help you evaluate your model for bias.
These metrics are related to fairness across different categories like race, gender, and class. However, interpreting statistical deviations isn't a straightforward exercise, because differences across categories might not be a result of bias or a signal of harm.
Use Vertex Explainable AI
To understand how the AI models make decisions, use Vertex Explainable AI. This feature helps you to identify potential biases that might be hidden in the model's logic.
This explainability feature is integrated with BigQuery ML and Vertex AI, which provide feature-based explanations. You can either perform explainability in BigQuery ML or register your model in Vertex AI and perform explainability in Vertex AI.
Track data lineage
Track the origin and transformation of data that's used in your AI systems. This tracking helps you understand the data's journey and identify potential sources of bias or error.
Data lineage is a Dataplex feature that lets you track how data moves through your systems: where it comes from, where it's passed to, and what transformations are applied to it.
Establish accountability
Establish clear responsibility for the development, deployment, and outcomes of your AI systems.
Use Cloud Logging to log key events and decisions made by your AI systems. The logs provide an audit trail to help you understand how the system is performing and identify areas for improvement.
Use Error Reporting to systematically analyze errors made by the AI systems. This analysis can reveal patterns that point to underlying biases or areas where the model needs further refinement.
Implement differential privacy
During model training, add noise to the data in order to make it difficult to identify individual data points but still enable the model to learn effectively. With SQL in BigQuery, you can transform the results of a query with differentially private aggregations.
Use AI for security
This principle in the security pillar of the Google Cloud Architecture Framework provides recommendations to use AI to help you improve the security of your cloud workloads.
Because of the increasing number and sophistication of cyber attacks, it's important to take advantage of AI's potential to help improve security. AI can help to reduce the number of threats, reduce the manual effort required by security professionals, and help compensate for the scarcity of experts in the cyber-security domain.
Principle overview
Use AI capabilities to improve your existing security systems and processes. You can use Gemini in Security as well as the intrinsic AI capabilities that are built into Google Cloud services.
These AI capabilities can transform security by providing assistance across every stage of the security lifecycle. For example, you can use AI to do the following:
- Analyze and explain potentially malicious code without reverse engineering.
- Reduce repetitive work for cyber-security practitioners.
- Use natural language to generate queries and interact with security event data.
- Surface contextual information.
- Offer recommendations for quick responses.
- Aid in the remediation of events.
- Summarize high-priority alerts for misconfigurations and vulnerabilities, highlight potential impacts, and recommend mitigations.
Levels of security autonomy
AI and automation can help you achieve better security outcomes when you're dealing with ever-evolving cyber-security threats. By using AI for security, you can achieve greater levels of autonomy to detect and prevent threats and improve your overall security posture. Google defines four levels of autonomy when you use AI for security, and they outline the increasing role of AI in assisting and eventually leading security tasks:
- Manual: Humans run all of the security tasks (prevent, detect, prioritize, and respond) across the entire security lifecycle.
- Assisted: AI tools, like Gemini, boost human productivity by summarizing information, generating insights, and making recommendations.
- Semi-autonomous: AI takes primary responsibility for many security tasks and delegates to humans only when required.
- Autonomous: AI acts as a trusted assistant that drives the security lifecycle based on your organization's goals and preferences, with minimal human intervention.
Recommendations
The following sections describe the recommendations for using AI for security. The sections also indicate how the recommendations align with Google's Secure AI Framework (SAIF) core elements and how they're relevant to the levels of security autonomy.
- Enhance threat detection and response with AI
- Simplify security for experts and non-experts
- Automate time-consuming security tasks with AI
- Incorporate AI into risk management and governance processes
- Implement secure development practices for AI systems
Enhance threat detection and response with AI
This recommendation is relevant to the following focus areas:
- Security operations (SecOps)
- Logging, auditing, and monitoring
AI can analyze large volumes of security data, offer insights into threat actor behavior, and automate the analysis of potentially malicious code. This recommendation is aligned with the following SAIF elements:
- Extend detection and response to bring AI into your organization's threat universe.
- Automate defenses to keep pace with existing and new threats.
Depending on your implementation, this recommendation can be relevant to the following levels of autonomy:
- Assisted: AI helps with threat analysis and detection.
- Semi-autonomous: AI takes on more responsibility for the security task.
Google Threat Intelligence, which uses AI to analyze threat actor behavior and malicious code, can help you implement this recommendation.
Simplify security for experts and non-experts
This recommendation is relevant to the following focus areas:
- Security operations (SecOps)
- Cloud governance, risk, and compliance
AI-powered tools can summarize alerts and recommend mitigations, and these capabilities can make security more accessible to a wider range of personnel. This recommendation is aligned with the following SAIF elements:
- Automate defenses to keep pace with existing and new threats.
- Harmonize platform-level controls to ensure consistent security across the organization.
Depending on your implementation, this recommendation can be relevant to the following levels of autonomy:
- Assisted: AI helps you to improve the accessibility of security information.
- Semi-autonomous: AI helps to make security practices more effective for all users.
Gemini in Security Command Center can provide summaries of alerts for misconfigurations and vulnerabilities.
Automate time-consuming security tasks with AI
This recommendation is relevant to the following focus areas:
- Infrastructure security
- Security operations (SecOps)
- Application security
AI can automate tasks such as analyzing malware, generating security rules, and identifying misconfigurations. These capabilities can help to reduce the workload on security teams and accelerate response times. This recommendation is aligned with the SAIF element about automating defenses to keep pace with existing and new threats.
Depending on your implementation, this recommendation can be relevant to the following levels of autonomy:
- Assisted: AI helps you to automate tasks.
- Semi-autonomous: AI takes primary responsibility for security tasks, and only requests human assistance when needed.
Gemini in Google SecOps can help to automate high-toil tasks by assisting analysts, retrieving relevant context, and making recommendations for next steps.
Incorporate AI into risk management and governance processes
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
You can use AI to build a model inventory and risk profiles. You can also use AI to implement policies for data privacy, cyber risk, and third-party risk. This recommendation is aligned with the SAIF element about contextualizing AI system risks in surrounding business processes.
Depending on your implementation, this recommendation can be relevant to the semi-autonomous level of autonomy. At this level, AI can orchestrate security agents that run processes to achieve your custom security goals.
Implement secure development practices for AI systems
This recommendation is relevant to the following focus areas:
- Application security
- AI and ML security
You can use AI for secure coding, cleaning training data, and validating tools and artifacts. This recommendation is aligned with the SAIF element about expanding strong security foundations to the AI ecosystem.
This recommendation can be relevant to all levels of security autonomy, because a secure AI system needs to be in place before AI can be used effectively for security. The recommendation is most relevant to the assisted level, where security practices are augmented by AI.
To implement this recommendation, follow the Supply-chain Levels for Software Artifacts (SLSA) guidelines for AI artifacts and use validated container images.
Meet regulatory, compliance, and privacy needs
This principle in the security pillar of the Google Cloud Architecture Framework helps you identify and meet regulatory, compliance, and privacy requirements for cloud deployments. These requirements influence many of the decisions that you need to make about the security controls that must be used for your workloads in Google Cloud.
Principle overview
Meeting regulatory, compliance, and privacy needs is an unavoidable challenge for all businesses. Cloud regulatory requirements depend on several factors, including the following:
- The laws and regulations that apply to your organization's physical locations
- The laws and regulations that apply to your customers' physical locations
- Your industry's regulatory requirements
Privacy regulations define how you can obtain, process, store, and manage your users' data. You own your own data, including the data that you receive from your users. Therefore, many privacy controls are your responsibility, including controls for cookies, session management, and obtaining user permission.
The recommendations to implement this principle are grouped within the following sections:
- Recommendations to address organizational risks
- Recommendations to address regulatory and compliance obligations
- Recommendations to manage your data sovereignty
- Recommendations to address privacy requirements
Recommendations to address organizational risks
This section provides recommendations to help you identify and address risks to your organization.
Identify risks to your organization
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Before you create and deploy resources on Google Cloud, complete a risk assessment. This assessment should determine the security features that you need to meet your internal security requirements and external regulatory requirements.
Your risk assessment provides you with a catalog of organization-specific risks, and informs you about your organization's capability to detect and counteract security threats. You must perform a risk analysis immediately after deployment and whenever there are changes in your business needs, regulatory requirements, or threats to your organization.
As mentioned in the Implement security by design principle, your security risks in a cloud environment differ from on-premises risks. This difference is due to the shared responsibility model in the cloud, which varies by service (IaaS, PaaS, or SaaS) and your usage. Use a cloud-specific risk assessment framework like the Cloud Controls Matrix (CCM). Use threat modeling, like OWASP application threat modeling, to identify and address vulnerabilities. For expert help with risk assessments, contact your Google account representative or consult Google Cloud's partner directory.
After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate the risks. For mitigation controls that you can implement, see the next section about mitigating your risks.
Mitigate your risks
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
When you adopt new public cloud services, you can mitigate risks by using technical controls, contractual protections, and third-party verifications or attestations.
Technical controls are features and technologies that you use to protect your environment. These include built-in cloud security controls like firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy. There are two categories of technical controls:
- You can implement Google Cloud's security controls to help you mitigate the risks that apply to your environment. For example, you can secure the connection between your on-premises networks and your cloud networks by using Cloud VPN and Cloud Interconnect.
- Google has robust internal controls and auditing to protect against insider access to customer data. Our audit logs provide you with near real-time logs of Google administrator access on Google Cloud.
Contractual protections refer to the legal commitments made by us regarding Google Cloud services. Google is committed to maintaining and expanding our compliance portfolio. The Cloud Data Processing Addendum (CDPA) describes our commitments with regard to the processing and security of your data. The CDPA also outlines the access controls that limit Google support engineers' access to customers' environments, and it describes our rigorous logging and approval process. We recommend that you review Google Cloud's contractual controls with your legal and regulatory experts, and verify that they meet your requirements. If you need more information, contact your technical account representative.
Third-party verifications or attestations refer to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, to learn about Google Cloud attestations with regard to the ISO/IEC 27017 guidelines, see ISO/IEC 27017 - Compliance. To view the current Google Cloud certifications and letters of attestation, see Compliance resource center.
Recommendations to address regulatory and compliance obligations
A typical compliance journey has three stages: assessment, gap remediation, and continual monitoring. This section provides recommendations that you can use during each of these stages.
Assess your compliance needs
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides information about the following:
- Service support for various regulations
- Google Cloud certifications and attestations
To better understand the compliance lifecycle at Google and how your requirements can be met, you can contact sales to request help from a Google compliance specialist. Or, you can contact your Google Cloud account manager to request a compliance workshop.
For more information about tools and resources that you can use to manage security and compliance for Google Cloud workloads, see Assuring Compliance in the Cloud.
Automate implementation of compliance requirements
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
To help you stay in compliance with changing regulations, determine whether you can automate how you implement compliance requirements. You can use both compliance-focused capabilities that Google Cloud provides and blueprints that use recommended configurations for a particular compliance regime.
Assured Workloads builds on the controls within Google Cloud to help you meet your compliance obligations. Assured Workloads lets you do the following:
- Select your compliance regime. Then, the tool automatically sets the baseline personnel access controls for the selected regime.
- Set the location for your data by using organization policies so that your data at rest and your resources remain only in that region.
- Select the key-management option (such as the key rotation period) that best meets your security and compliance requirements.
- Select the access criteria for Google support personnel to meet certain regulatory requirements such as FedRAMP Moderate. For example, you can select whether Google support personnel have completed the appropriate background checks.
- Use Google-owned and Google-owned and Google-managed encryption key that are FIPS-140-2 compliant and support FedRAMP Moderate compliance. For an added layer of control and for the separation of duties, you can use customer-managed encryption keys (CMEK). For more information about keys, see Encrypt data at rest and in transit.
In addition to Assured Workloads, you can use Google Cloud blueprints that are relevant to your compliance regime. You can modify these blueprints to incorporate your security policies into your infrastructure deployments.
To help you build an environment that supports your compliance requirements, Google's blueprints and solution guides include recommended configurations and provide Terraform modules. The following table lists blueprints that address security and alignment with compliance requirements.
Requirement | Blueprints and solution guides |
---|---|
FedRAMP | |
HIPAA |
Monitor your compliance
This recommendation is relevant to the following focus areas:
- Cloud governance, risk, and compliance
- Logging, monitoring, and auditing
Most regulations require that you monitor particular activities, which include access-related activities. To help with your monitoring, you can use the following:
- Access Transparency: View near real-time logs when Google Cloud administrators access your content.
- Firewall Rules Logging: Record TCP and UDP connections inside a VPC network for any rules that you create. These logs can be useful for auditing network access or for providing early warning that the network is being used in an unapproved manner.
- VPC Flow Logs: Record network traffic flows that are sent or received by VM instances.
- Security Command Center Premium: Monitor for compliance with various standards.
- OSSEC (or another open source tool): Log the activity of individuals who have administrator access to your environment.
- Key Access Justifications: View the reasons for a key-access request.
- Security Command Center notifications: Get alerts when noncompliance issues occur. For example, get alerts when users disable two-step verification or when service accounts are over-privileged. You can also set up automatic remediation for specific notifications.
Recommendations to manage your data sovereignty
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Data sovereignty provides you with a mechanism to prevent Google from accessing your data. You approve access only for provider behaviors that you agree are necessary. For example, you can manage your data sovereignty in the following ways:
- Store and manage encryption keys outside the cloud.
- Grant access to these keys based on detailed access justifications.
- Protect data in use by using Confidential Computing.
Manage your operational sovereignty
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Operational sovereignty provides you with assurances that Google personnel can't compromise your workloads. For example, you can manage operational sovereignty in the following ways:
- Restrict the deployment of new resources to specific provider regions.
- Limit Google personnel access based on predefined attributes such as their citizenship or geographic location.
Manage software sovereignty
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want. Also, you can have this control without being dependent or locked in with a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed.
For example, to help you manage your software sovereignty, Google Cloud supports hybrid and multicloud deployments. In addition, GKE Enterprise lets you manage and deploy your applications in both cloud environments and on-premises environments. If you choose on-premises deployments for data sovereignty reasons, Google Distributed Cloud is a combination of hardware and software that brings Google Cloud into your data center.
Recommendations to address privacy requirements
Google Cloud includes the following controls that promote privacy:
- Default encryption of all data when it's at rest, when it's in transit, and while it's being processed.
- Safeguards against insider access.
- Support for numerous privacy regulations.
The following recommendations address additional controls that you can implement. For more information, see Privacy Resource Center.
Control data residency
This recommendation is relevant to the following focus area: Cloud governance, risk, and compliance.
Data residency describes where your data is stored at rest. Data residency requirements vary based on system design objectives, industry regulatory concerns, national law, tax implications, and even culture.
Controlling data residency starts with the following:
- Understand your data type and its location.
- Determine what risks exist for your data and which laws and regulations apply.
- Control where your data is stored or where it goes.
To help you comply with data residency requirements, Google Cloud lets you control where your data is stored, how it's accessed, and how it's processed. You can use resource location policies to restrict where resources are created and to limit where data is replicated between regions. You can use the location property of a resource to identify where the service is deployed and who maintains it. For more information, see Resource locations supported services.
Classify your confidential data
This recommendation is relevant to the following focus area: Data security.
You must define what data is confidential, and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personally identifiable information (PII). Using Sensitive Data Protection, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. Additionally, Dataplex offers a catalog service that provides a platform for storing, managing, and accessing your metadata. For more information and an example of data classification and de-identification, see De-identification and re-identification of PII using Sensitive Data Protection.
Lock down access to sensitive data
This recommendation is relevant to the following focus areas:
- Data security
- Identity and access management
Place sensitive data in its own service perimeter by using VPC Service Controls. VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transferring of data (data exfiltration) from Google-managed services. With VPC Service Controls, you configure security perimeters around the resources of your Google-managed services to control the movement of data across the perimeter. Set Google Identity and Access Management (IAM) access controls for that data. Configure multifactor authentication (MFA) for all users who require access to sensitive data.
Shared responsibilities and shared fate on Google Cloud
This document describes the differences between the shared responsibility model and shared fate in Google Cloud. It discusses the challenges and nuances of the shared responsibility model. This document describes what shared fate is and how we partner with our customers to address cloud security challenges.
Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers.
Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate.
Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud.
Shared responsibility
You're the expert in knowing the security and regulatory requirements for your business, and knowing the requirements for protecting your confidential data and resources. When you run your workloads on Google Cloud, you must identify the security controls that you need to configure in Google Cloud to help protect your confidential data and each workload. To decide which security controls to implement, you must consider the following factors:
- Your regulatory compliance obligations
- Your organization's security standards and risk management plan
- Security requirements of your customers and your vendors
Defined by workloads
Traditionally, responsibilities are defined based on the type of workload that you're running and the cloud services that you require. Cloud services include the following categories:
Cloud service | Description |
---|---|
Infrastructure as a service (IaaS) | IaaS services include Compute Engine, Cloud Storage, and networking
services such as Cloud VPN, Cloud Load Balancing, and Cloud DNS.
IaaS provides compute, storage, and network services on demand with pay-as-you-go pricing. You can use IaaS if you plan on migrating an existing on-premises workload to the cloud using lift-and-shift, or if you want to run your application on particular VMs, using specific databases or network configurations. In IaaS, the bulk of the security responsibilities are yours, and our responsibilities are focused on the underlying infrastructure and physical security. |
Platform as a service (PaaS) | PaaS services include App Engine, Google Kubernetes Engine (GKE), and BigQuery.
PaaS provides the runtime environment that you can develop and run your applications in. You can use PaaS if you're building an application (such as a website), and want to focus on development not on the underlying infrastructure. In PaaS, we're responsible for more controls than in IaaS. Typically, this will vary by the services and features that you use. You share responsibility with us for application-level controls and IAM management. You remain responsible for your data security and client protection. |
Software as a service (SaaS) | SaaS applications include Google Workspace, Google Security Operations, and
third-party SaaS applications that are available in Google Cloud Marketplace.
SaaS provides online applications that you can subscribe to or pay for in some way. You can use SaaS applications when your enterprise doesn't have the internal expertise or business requirement to build the application themselves, but does require the ability to process workloads. In SaaS, we own the bulk of the security responsibilities. You remain responsible for your access controls and the data that you choose to store in the application. |
Function as a service (FaaS) or serverless | FaaS provides the platform for developers to run small, single-purpose code (called functions) that run in response to particular events. You would use FaaS when you want particular things to occur based on a particular event. For example, you might create a function that runs whenever data is uploaded to Cloud Storage so that it can be classified. FaaS has a similar shared responsibility list as SaaS. Cloud Run functions is a FaaS application. |
The following diagram shows the cloud services and defines how responsibilities are shared between the cloud provider and customer.
As the diagram shows, the cloud provider always remains responsible for the underlying network and infrastructure, and customers always remain responsible for their access policies and data.
Defined by industry and regulatory framework
Various industries have regulatory frameworks that define the security controls that must be in place. When you move your workloads to the cloud, you must understand the following:
- Which security controls are your responsibility
- Which security controls are available as part of the cloud offering
- Which default security controls are inherited
Inherited security controls (such as our default encryption and infrastructure controls) are controls that you can provide as part of your evidence of your security posture to auditors and regulators. For example, the Payment Card Industry Data Security Standard (PCI DSS) defines regulations for payment processors. When you move your business to the cloud, these regulations are shared between you and your CSP. To understand how PCI DSS responsibilities are shared between you and Google Cloud, see Google Cloud: PCI DSS Shared Responsibility Matrix.
As another example, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) has set standards for handling electronic personal health information (PHI). These responsibilities are also shared between the CSP and you. For more information on how Google Cloud meets our responsibilities under HIPAA, see HIPAA - Compliance.
Other industries (for example, finance or manufacturing) also have regulations that define how data can be gathered, processed, and stored. For more information about shared responsibility related to these, and how Google Cloud meets our responsibilities, see Compliance resource center.
Defined by location
Depending on your business scenario, you might need to consider your responsibilities based on the location of your business offices, your customers, and your data. Different countries and regions have created regulations that inform how you can process and store your customer's data. For example, if your business has customers who reside in the European Union, your business might need to abide by the requirements that are described in the General Data Protection Regulation (GDPR), and you might be obligated to keep your customer data in the EU itself. In this circumstance, you are responsible for ensuring that the data that you collect remains in the Google Cloud regions in the EU. For more information about how we meet our GDPR obligations, see GDPR and Google Cloud.
For information about the requirements related to your region, see Compliance offerings. If your scenario is particularly complicated, we recommend speaking with our sales team or one of our partners to help you evaluate your security responsibilities.
Challenges for shared responsibility
Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios:
- Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance's Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration.
- Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you're migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products.
- Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly.
- How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you're running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you're processing and storing.
- Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren't easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging.
- Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn't have a large security team.
Shared fate
We developed shared fate in Google Cloud to start addressing the challenges that the shared responsibility model doesn't address. Shared fate focuses on how all parties can better interact to continuously improve security. Shared fate builds on the shared responsibility model because it views the relationship between cloud provider and customer as an ongoing partnership to improve security.
Shared fate is about us taking responsibility for making Google Cloud more secure. Shared fate includes helping you get started with a secured landing zone and being clear, opinionated, and transparent about recommended security controls, settings, and associated best practices. It includes helping you better quantify and manage your risk with cyber-insurance, using our Risk Protection Program. Using shared fate, we want to evolve from the standard shared responsibility framework to a better model that helps you secure your business and build trust in Google Cloud.
The following sections describe various components of shared fate.
Help getting started
A key component of shared fate is the resources that we provide to help you get started, in a secure configuration in Google Cloud. Starting with a secure configuration helps reduce the issue of misconfigurations which is the root cause of most security breaches.
Our resources include the following:
- Enterprise foundations blueprint that discuss top security concerns and our top recommendations.
Secure blueprints that let you deploy and maintain secure solutions using infrastructure as code (IaC). Blueprints have our security recommendations enabled by default. Many blueprints are created by Google security teams and managed as products. This support means that they're updated regularly, go through a rigorous testing process, and receive attestations from third-party testing groups. Blueprints include the enterprise foundations blueprint and the secured data warehouse blueprint.
Architecture Framework best practices that address the top recommendations for building security into your designs. The Architecture Framework includes a security section and a community zone that you can use to connect with experts and peers.
Landing zone navigation guides that step you through the top decisions that you need to make to build a secure foundation for your workloads, including resource hierarchy, identity onboarding, security and key management, and network structure.
Risk Protection Program
Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality.
The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you're looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview.
Help with deployment and governance
Shared fate also helps with your continued governance of your environment. For example, we focus efforts on products such as the following:
- Assured Workloads, which helps you meet your compliance obligations.
- Security Command Center Premium, which uses threat intelligence, threat detection, web scanning, and other advanced methods to monitor and detect threats. It also provides a way to resolve many of these threats quickly and automatically.
- Organization policies and resource settings that let you configure policies throughout your hierarchy of folders and projects.
- Policy Intelligence tools that provide you with insights on access to accounts and resources.
- Confidential Computing, which allows you to encrypt data in use.
- Sovereign Controls by Partners, which is available in certain countries and helps enforce data residency requirements.
Putting shared responsibility and shared fate into practice
As part of your planning process, consider the following actions to help you understand and implement appropriate security controls:
- Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider.
- Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements.
- Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture.
- Use the landing zone documentation and the recommendations in the enterprise foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process.
- After you deploy your workloads, verify that you're meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium.
For more information, see the CISO's Guide to Cloud Transformation paper.
What's next
- Review the core security principles.
- Keep up to date with shared fate resources.
- Familiarize yourself with available blueprints, including the security foundations blueprint and workload examples like the secured data warehouse.
- Read more about shared fate.
- Read about our underlying secure infrastructure in the Google infrastructure security design overview.
- Read how to implement NIST Cybersecurity Framework best practices in Google Cloud (PDF).
Google Cloud Architecture Framework: Reliability
The reliability pillar in the Google Cloud Architecture Framework provides principles and recommendations to help you design, deploy, and manage reliable workloads in Google Cloud.
This document is intended for cloud architects, developers, platform engineers, administrators, and site reliability engineers.
Reliability is a system's ability to consistently perform its intended functions within the defined conditions and maintain uninterrupted service. Best practices for reliability include redundancy, fault-tolerant design, monitoring, and automated recovery processes.
As a part of reliability, resilience is the system's ability to withstand and recover from failures or unexpected disruptions, while maintaining performance. Google Cloud features, like multi-regional deployments, automated backups, and disaster recovery solutions, can help you improve your system's resilience.
Reliability is important to your cloud strategy for many reasons, including the following:
- Minimal downtime: Downtime can lead to lost revenue, decreased productivity, and damage to reputation. Resilient architectures can help ensure that systems can continue to function during failures or recover efficiently from failures.
- Enhanced user experience: Users expect seamless interactions with technology. Resilient systems can help maintain consistent performance and availability, and they provide reliable service even during high demand or unexpected issues.
- Data integrity: Failures can cause data loss or data corruption. Resilient systems implement mechanisms such as backups, redundancy, and replication to protect data and ensure that it remains accurate and accessible.
- Business continuity: Your business relies on technology for critical operations. Resilient architectures can help ensure continuity after a catastrophic failure, which enables business functions to continue without significant interruptions and supports a swift recovery.
- Compliance: Many industries have regulatory requirements for system availability and data protection. Resilient architectures can help you to meet these standards by ensuring systems remain operational and secure.
- Lower long-term costs: Resilient architectures require upfront investment, but resiliency can help to reduce costs over time by preventing expensive downtime, avoiding reactive fixes, and enabling more efficient resource use.
Organizational mindset
To make your systems reliable, you need a plan and an established strategy. This strategy must include education and the authority to prioritize reliability alongside other initiatives.
Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.
Every team must understand the reliability targets and risks of their applications. The teams must be accountable to these requirements. Conflicts between reliability and regular product feature development must be prioritized and escalated accordingly.
Plan and manage reliability holistically, across all your functions and teams. Consider setting up a Cloud Centre of Excellence (CCoE) that includes a reliability pillar. For more information, see Optimize your organization's cloud journey with a Cloud Center of Excellence.
Focus areas for reliability
The activities that you perform to design, deploy, and manage a reliable system can be categorized in the following focus areas. Each of the reliability principles and recommendations in this pillar is relevant to one of these focus areas.
- Scoping: To understand your system, conduct a detailed analysis of its architecture. You need to understand the components, how they work and interact, how data and actions flow through the system, and what could go wrong. Identify potential failures, bottlenecks, and risks, which helps you to take actions to mitigate those issues.
- Observation: To help prevent system failures, implement comprehensive and continuous observation and monitoring. Through this observation, you can understand trends and identify potential problems proactively.
- Response: To reduce the impact of failures, respond appropriately and recover efficiently. Automated responses can also help reduce the impact of failures. Even with planning and controls, failures can still occur.
- Learning: To help prevent failures from recurring, learn from each experience, and take appropriate actions.
Core principles
The recommendations in the reliability pillar of the Architecture Framework are mapped to the following core principles:
- Define reliability based on user-experience goals
- Set realistic targets for reliability
- Build highly available systems through redundant resources
- Take advantage of horizontal scalability
- Detect potential failures by using observability
- Design for graceful degradation
- Perform testing for recovery from failures
- Perform testing for recovery from data loss
- Conduct thorough postmortems
Contributors
Authors:
- Laura Hyatt | Enterprise Cloud Architect
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Gino Pelliccia | Principal Architect
Other contributors:
- Andrés-Leonardo Martínez-Ortiz | Technical Program Manager
- Brian Kudzia | Enterprise Infrastructure Customer Engineer
- Daniel Lees | Cloud Security Architect
- Filipe Gracio, PhD | Customer Engineer
- Gary Harmson | Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Ryan Cox | Principal Architect
- Wade Holmes | Global Solutions Director
- Zach Seils | Networking Specialist
Define reliability based on user-experience goals
This principle in the reliability pillar of the Google Cloud Architecture Framework helps you to assess your users' experience, and then map the findings to reliability goals and metrics.
This principle is relevant to the scoping focus area of reliability.
Principle overview
Observability tools provide large amounts of data, but not all of the data directly relates to the impacts on the users. For example, you might observe high CPU usage, slow server operations, or even crashed tasks. However, if these issues don't affect the user experience, then they don't constitute an outage.
To measure the user experience, you need to distinguish between internal system behavior and user-facing problems. Focus on metrics like the success ratio of user requests. Don't rely solely on server-centric metrics, like CPU usage, which can lead to misleading conclusions about your service's reliability. True reliability means that users can consistently and effectively use your application or service.
Recommendations
To help you measure user experience effectively, consider the recommendations in the following sections.
Measure user experience
To truly understand your service's reliability, prioritize metrics that reflect your users' actual experience. For example, measure the users' query success ratio, application latency, and error rates.
Ideally, collect this data directly from the user's device or browser. If this direct data collection isn't feasible, shift your measurement point progressively further away from the user in the system. For example, you can use the load balancer or frontend service as the measurement point. This approach helps you identify and address issues before those issues can significantly impact your users.
Analyze user journeys
To understand how users interact with your system, you can use tracing tools like Cloud Trace. By following a user's journey through your application, you can find bottlenecks and latency issues that might degrade the user's experience. Cloud Trace captures detailed performance data for each hop in your service architecture. This data helps you identify and address performance issues more efficiently, which can lead to a more reliable and satisfying user experience.
Set realistic targets for reliability
This principle in the reliability pillar of the Google Cloud Architecture Framework helps you define reliability goals that are technically feasible for your workloads in Google Cloud.
This principle is relevant to the scoping focus area of reliability.
Principle overview
Design your systems to be just reliable enough for user happiness. It might seem counterintuitive, but a goal of 100% reliability is often not the most effective strategy. Higher reliability might result in a significantly higher cost, both in terms of financial investment and potential limitations on innovation. If users are already happy with the current level of service, then efforts to further increase happiness might yield a low return on investment. Instead, you can better spend resources elsewhere.
You need to determine the level of reliability at which your users are happy, and determine the point where the cost of incremental improvements begin to outweigh the benefits. When you determine this level of sufficient reliability, you can allocate resources strategically and focus on features and improvements that deliver greater value to your users.
Recommendations
To set realistic reliability targets, consider the recommendations in the following subsections.
Accept some failure and prioritize components
Aim for high availability such as 99.99% uptime, but don't set a target of 100% uptime. Acknowledge that some failures are inevitable.
The gap between 100% uptime and a 99.99% target is the allowance for failure. This gap is often called the error budget. The error budget can help you take risks and innovate, which is fundamental to any business to stay competitive.
Prioritize the reliability of the most critical components in the system. Accept that less critical components can have a higher tolerance for failure.
Balance reliability and cost
To determine the optimal reliability level for your system, conduct thorough cost-benefit analyses.
Consider factors like system requirements, the consequences of failures, and your organization's risk tolerance for the specific application. Remember to consider your disaster recovery metrics, such as the recovery time objective (RTO) and recovery point objective (RPO). Decide what level of reliability is acceptable within the budget and other constraints.
Look for ways to improve efficiency and reduce costs without compromising essential reliability features.
Build highly available systems through resource redundancy
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to plan, build, and manage resource redundancy, which can help you to avoid failures.
This principle is relevant to the scoping focus area of reliability.
Principle overview
After you decide the level of reliability that you need, you must design your systems to avoid any single points of failure. Every critical component in the system must be replicated across multiple machines, zones, and regions. For example, a critical database can't be located in only one region, and a metadata server can't be deployed in only one single zone or region. In those examples, if the sole zone or region has an outage, the system has a global outage.
Recommendations
To build redundant systems, consider the recommendations in the following subsections.
Identify failure domains and replicate services
Map out your system's failure domains, from individual VMs to regions, and design for redundancy across the failure domains.
To ensure high availability, distribute and replicate your services and applications across multiple zones and regions. Configure the system for automatic failover to make sure that the services and applications continue to be available in the event of zone or region outages.
For examples of multi-zone and multi-region architectures, see Design reliable infrastructure for your workloads in Google Cloud.
Detect and address issues promptly
Continuously track the status of your failure domains to detect and address issues promptly.
You can monitor the current status of Google Cloud services in all regions by using the Google Cloud Service Health dashboard. You can also view incidents relevant to your project by using Personalized Service Health. You can use load balancers to detect resource health and automatically route traffic to healthy backends. For more information, see Health checks overview.
Test failover scenarios
Like a fire drill, regularly simulate failures to validate the effectiveness of your replication and failover strategies.
For more information, see Simulate a zone outage for a regional MIG and Simulate a zone failure in GKE regional clusters.
Take advantage of horizontal scalability
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you use horizontal scalability. By using horizontal scalability, you can help ensure that your workloads in Google Cloud can scale efficiently and maintain performance.
This principle is relevant to the scoping focus area of reliability.
Principle overview
Re-architect your system to a horizontal architecture. To accommodate growth in traffic or data, you can add more resources. You can also remove resources when they're not in use.
To understand the value of horizontal scaling, consider the limitations of vertical scaling.
A common scenario for vertical scaling is to use a MySQL database as the primary database with critical data. As database usage increases, more RAM and CPU is required. Eventually, the database reaches the memory limit on the host machine, and needs to be upgraded. This process might need to be repeated several times. The problem is that there are hard limits on how much a database can grow. VM sizes are not unlimited. The database can reach a point when it's no longer possible to add more resources.
Even if resources were unlimited, a large VM can become a single point of failure. Any problem with the primary database VM can cause error responses or cause a system-wide outage that affects all users. Avoid single points of failure, as described in Build highly available systems through redundant resources.
Besides these scaling limits, vertical scaling tends to be more expensive. The cost can increase exponentially as machines with greater amounts of compute power and memory are acquired.
Horizontal scaling, by contrast, can cost less. The potential for horizontal scaling is virtually unlimited in a system that's designed to scale.
Recommendations
To transition from a single VM architecture to a horizontal multiple-machine architecture, you need to plan carefully and use the right tools. To help you achieve horizontal scaling, consider the recommendations in the following subsections.
Use managed services
Managed services remove the need to manually manage horizontal scaling. For example, with Compute Engine managed instance groups (MIGs), you can add or remove VMs to scale your application horizontally. For containerized applications, Cloud Run is a serverless platform that can automatically scale your stateless containers based on incoming traffic.
Promote modular design
Modular components and clear interfaces help you scale individual components as needed, instead of scaling the entire application. For more information, see Promote modular design in the performance optimization pillar.
Implement a stateless design
Design applications to be stateless, meaning no locally stored data. This lets you add or remove instances without worrying about data consistency.
Detect potential failures by using observability
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you proactively identify areas where errors and failures might occur.
This principle is relevant to the observation focus area of reliability.
Principle overview
To maintain and improve the reliability of your workloads in Google Cloud, you need to implement effective observability by using metrics, logs, and traces.
- Metrics are numerical measurements of activities that you want to track for your application at specific time intervals. For example, you might want to track technical metrics like request rate and error rate, which can be used as service-level indicators (SLIs). You might also need to track application-specific business metrics like orders placed and payments received.
- Logs are time-stamped records of discrete events that occur within an application or system. The event could be a failure, an error, or a change in state. Logs might include metrics, and you can also use logs for SLIs.
- A trace represents the journey of a single user or transaction through a number of separate applications or the components of an application. For example, these components could be microservices. Traces help you to track what components were used in the journeys, where bottlenecks exist, and how long the journeys took.
Metrics, logs, and traces help you monitor your system continuously. Comprehensive monitoring helps you find out where and why errors occurred. You can also detect potential failures before errors occur.
Recommendations
To detect potential failures efficiently, consider the recommendations in the following subsections.
Gain comprehensive insights
To track key metrics like response times and error rates, use Cloud Monitoring and Cloud Logging. These tools also help you to ensure that the metrics consistently meet the needs of your workload.
To make data-driven decisions, analyze default service metrics to understand component dependencies and their impact on overall workload performance.
To customize your monitoring strategy, create and publish your own metrics by using the Google Cloud SDK.
Perform proactive troubleshooting
Implement robust error handling and enable logging across all of the components of your workloads in Google Cloud. Activate logs like Cloud Storage access logs and VPC Flow Logs.
When you configure logging, consider the associated costs. To control logging costs, you can configure exclusion filters on the log sinks to exclude certain logs from being stored.
Optimize resource utilization
Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detect under-provisioned and over-provisioned resources in services like GKE, Compute Engine, and Dataproc. For a complete list of supported services, see Cloud Monitoring overview.
Prioritize alerts
For alerts, focus on critical metrics, set appropriate thresholds to minimize alert fatigue, and ensure timely responses to significant issues. This targeted approach lets you proactively maintain workload reliability. For more information, see Alerting overview.
Design for graceful degradation
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you to design your Google Cloud workloads to fail gracefully.
This principle is relevant to the response focus area of reliability.
Principle overview
Graceful degradation is a design approach where a system that experiences a high load continues to function, possibly with reduced performance or accuracy. Graceful degradation ensures continued availability of the system and prevents complete failure, even if the system's work isn't optimal. When the load returns to a manageable level, the system resumes full functionality.
For example, during periods of high load, Google Search prioritizes results from higher-ranked web pages, potentially sacrificing some accuracy. When the load decreases, Google Search recomputes the search results.
Recommendations
To design your systems for graceful degradation, consider the recommendations in the following subsections.
Implement throttling
Ensure that your replicas can independently handle overloads and can throttle incoming requests during high-traffic scenarios. This approach helps you to prevent cascading failures that are caused by shifts in excess traffic between zones.
Use tools like Apigee to control the rate of API requests during high-traffic times. You can configure policy rules to reflect how you want to scale back requests.
Drop excess requests early
Configure your systems to drop excess requests at the frontend layer to protect backend components. Dropping some requests prevents global failures and enables the system to recover more gracefully.With this approach, some users might experience errors. However, you can minimize the impact of outages, in contrast to an approach like circuit-breaking, where all traffic is dropped during an overload.
Handle partial errors and retries
Build your applications to handle partial errors and retries seamlessly. This design helps to ensure that as much traffic as possible is served during high-load scenarios.
Test overload scenarios
To validate that the throttle and request-drop mechanisms work effectively, regularly simulate overload conditions in your system. Testing helps ensure that your system is prepared for real-world traffic surges.
Monitor traffic spikes
Use analytics and monitoring tools to predict and respond to traffic surges before they escalate into overloads. Early detection and response can help maintain service availability during high-demand periods.
Perform testing for recovery from failures
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you design and run tests for recovery in the event of failures.
This principle is relevant to the learning focus area of reliability.
Principle overview
To be sure that your system can recover from failures, you must periodically run tests that include regional failovers, release rollbacks, and data restoration from backups.
This testing helps you to practice responses to events that pose major risks to reliability, such as the outage of an entire region. This testing also helps you verify that your system behaves as intended during a disruption.
In the unlikely event of an entire region going down, you need to fail over all traffic to another region. During normal operation of your workload, when data is modified, it needs to be synchronized from the primary region to the failover region. You need to verify that the replicated data is always very recent, so that users don't experience data loss or session breakage. The load balancing system must also be able to shift traffic to the failover region at any time without service interruptions. To minimize downtime after a regional outage, operations engineers also need to be able to manually and efficiently shift user traffic away from a region, in as less time as possible. This operation is sometimes called draining a region, which means you stop the inbound traffic to the region and move all the traffic elsewhere.
Recommendations
When you design and run tests for failure recovery, consider the recommendations in the following subsections.
Define the testing objectives and scope
Clearly define what you want to achieve from the testing. For example, your objectives can include the following:
- Validate the recovery time objective (RTO) and the recovery point objective (RPO). For details, see Basics of DR planning.
- Assess system resilience and fault tolerance under various failure scenarios.
- Test the effectiveness of automated failover mechanisms.
Decide which components, services, or regions are in the testing scope. The scope can include specific application tiers like the frontend, backend, and database, or it can include specific Google Cloud resources like Cloud SQL instances or GKE clusters. The scope must also specify any external dependencies, such as third-party APIs or cloud interconnections.
Prepare the environment for testing
Choose an appropriate environment, preferably a staging or sandbox environment that replicates your production setup. If you conduct the test in production, ensure that you have safety measures ready, like automated monitoring and manual rollback procedures.
Create a backup plan. Take snapshots or backups of critical databases and services to prevent data loss during the test. Ensure that your team is prepared to do manual interventions if the automated failover mechanisms fail.
To prevent test disruptions, ensure that your IAM roles, policies, and failover configurations are correctly set up. Verify that the necessary permissions are in place for the test tools and scripts.
Inform stakeholders, including operations, DevOps, and application owners, about the test schedule, scope, and potential impact. Provide stakeholders with an estimated timeline and the expected behaviors during the test.
Simulate failure scenarios
Plan and execute failures by using tools like Chaos Monkey. You can use custom scripts to simulate failures of critical services such as a shutdown of a primary node in a multi-zone GKE cluster or a disabled Cloud SQL instance. You can also use scripts to simulate a region-wide network outage by using firewall rules or API restrictions based on your scope of test. Gradually escalate the failure scenarios to observe system behavior under various conditions.
Introduce load testing alongside failure scenarios to replicate real-world usage during outages. Test cascading failure impacts, such as how frontend systems behave when backend services are unavailable.
To validate configuration changes and to assess the system's resilience against human errors, test scenarios that involve misconfigurations. For example, run tests with incorrect DNS failover settings or incorrect IAM permissions.
Monitor system behavior
Monitor how load balancers, health checks, and other mechanisms reroute traffic. Use Google Cloud tools like Cloud Monitoring and Cloud Logging to capture metrics and events during the test.
Observe changes in latency, error rates, and throughput during and after the failure simulation, and monitor the overall performance impact. Identify any degradation or inconsistencies in the user experience.
Ensure that logs are generated and alerts are triggered for key events, such as service outages or failovers. Use this data to verify the effectiveness of your alerting and incident response systems.
Verify recovery against your RTO and RPO
Measure how long it takes for the system to resume normal operations after a failure, and then compare this data with the defined RTO and document any gaps.
Ensure that data integrity and availability align with the RPO. To test database consistency, compare snapshots or backups of the database before and after a failure.
Evaluate service restoration and confirm that all services are restored to a functional state with minimal user disruption.
Document and analyze results
Document each test step, failure scenario, and corresponding system behavior. Include timestamps, logs, and metrics for detailed analyses.
Highlight bottlenecks, single points of failure, or unexpected behaviors observed during the test. To help prioritize fixes, categorize issues by severity and impact.
Suggest improvements to the system architecture, failover mechanisms, or monitoring setups. Based on test findings, update any relevant failover policies and playbooks. Present a postmortem report to stakeholders. The report should summarize the outcomes, lessons learned, and next steps. For more information, see Conduct thorough postmortems.
Iterate and improve
To validate ongoing reliability and resilience, plan periodic testing (for example, quarterly).
Run tests under different scenarios, including infrastructure changes, software updates, and increased traffic loads.
Automate failover tests by using CI/CD pipelines to integrate reliability testing into your development lifecycle.
During the postmortem, use feedback from stakeholders and end users to improve the test process and system resilience.
Perform testing for recovery from data loss
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you design and run tests for recovery from data loss.
This principle is relevant to the learning focus area of reliability.
Principle overview
To ensure that your system can recover from situations where data is lost or corrupted, you need to run tests for those scenarios. Instances of data loss might be caused by a software bug or some type of natural disaster. After such events, you need to restore data from backups and bring all of the services back up again by using the freshly restored data.
We recommend that you use three criteria to judge the success or failure of this type of recovery test: data integrity, recovery time objective (RTO), and recovery point objective (RPO). For details about the RTO and RPO metrics, see Basics of DR planning.
The goal of data restoration testing is to periodically verify that your organization can continue to meet business continuity requirements. Besides measuring RTO and RPO, a data restoration test must include testing of the entire application stack and all the critical infrastructure services with the restored data. This is necessary to confirm that the entire deployed application works correctly in the test environment.
Recommendations
When you design and run tests for recovering from data loss, consider the recommendations in the following subsections.
Verify backup consistency and test restoration processes
You need to verify that your backups contain consistent and usable snapshots of data that you can restore to immediately bring applications back into service. To validate data integrity, set up automated consistency checks to run after each backup.
To test backups, restore them in a non-production environment. To ensure your backups can be restored efficiently and that the restored data meets application requirements, regularly simulate data recovery scenarios. Document the steps for data restoration, and train your teams to execute the steps effectively during a failure.
Schedule regular and frequent backups
To minimize data loss during restoration and to meet RPO targets, it's essential to have regularly scheduled backups. Establish a backup frequency that aligns with your RPO. For example, if your RPO is 15 minutes, schedule backups to run at least every 15 minutes. Optimize the backup intervals to reduce the risk of data loss.
Use Google Cloud tools like Cloud Storage, Cloud SQL automated backups, or Spanner backups to schedule and manage backups. For critical applications, use near-continuous backup solutions like point-in-time recovery (PITR) for Cloud SQL or incremental backups for large datasets.
Define and monitor RPO
Set a clear RPO based on your business needs, and monitor adherence to the RPO. If backup intervals exceed the defined RPO, use Cloud Monitoring to set up alerts.
Monitor backup health
Use Google Cloud Backup and DR service or similar tools to track the health of your backups and confirm that they are stored in secure and reliable locations. Ensure that the backups are replicated across multiple regions for added resilience.
Plan for scenarios beyond backup
Combine backups with disaster recovery strategies like active-active failover setups or cross-region replication for improved recovery time in extreme cases. For more information, see Disaster recovery planning guide.
Conduct thorough postmortems
This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you conduct effective postmortems after failures and incidents.
This principle is relevant to the learning focus area of reliability.
Principle overview
A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve the incident, the root causes, and the follow-up actions to prevent the incident from recurring. The goal of a postmortem is to learn from mistakes and not assign blame.
The following diagram shows the workflow of a postmortem:
The workflow of a postmortem includes the following steps:
- Create postmortem
- Capture the facts
- Identify and analyze the root causes
- Plan for the future
- Execute the plan
Conduct postmortem analyses after major events and non-major events like the following:
- User-visible downtimes or degradations beyond a certain threshold.
- Data losses of any kind.
- Interventions from on-call engineers, such as a release rollback or rerouting of traffic.
- Resolution times above a defined threshold.
- Monitoring failures, which usually imply manual incident discovery.
Recommendations
Define postmortem criteria before an incident occurs so that everyone knows when a post mortem is necessary.
To conduct effective postmortems, consider the recommendations in the following subsections.
Conduct blameless postmortems
Effective postmortems focus on processes, tools, and technologies, and don't place blame on individuals or teams. The purpose of a postmortem analysis is to improve your technology and future, not to find who is guilty. Everyone makes mistakes. The goal should be to analyze the mistakes and learn from them.
The following examples show the difference between feedback that assigns blame and blameless feedback:
- Feedback that assigns blame: "We need to rewrite the entire complicated backend system! It's been breaking weekly for the last three quarters and I'm sure we're all tired of fixing things piecemeal. Seriously, if I get paged one more time I'll rewrite it myself…"
- Blameless feedback: "An action item to rewrite the entire backend system might actually prevent these pages from continuing to happen. The maintenance manual for this version is quite long and really difficult to be fully trained up on. I'm sure our future on-call engineers will thank us!"
Make the postmortem report readable by all the intended audiences
For each piece of information that you plan to include in the report, assess whether that information is important and necessary to help the audience understand what happened. You can move supplementary data and explanations to an appendix of the report. Reviewers who need more information can request it.
Avoid complex or over-engineered solutions
Before you start to explore solutions for a problem, evaluate the importance of the problem and the likelihood of a recurrence. Adding complexity to the system to solve problems that are unlikely to occur again can lead to increased instability.
Share the postmortem as widely as possible
To ensure that issues don't remain unresolved, publish the outcome of the postmortem to a wide audience and get support from management. The value of a postmortem is proportional to the learning that occurs after the postmortem. When more people learn from incidents, the likelihood of similar failures recurring is reduced.
Google Cloud Architecture Framework: Cost optimization
The cost optimization pillar in the Google Cloud Architecture Framework describes principles and recommendations to optimize the cost of your workloads in Google Cloud.
The intended audience includes the following:
- CTOs, CIOs, CFOs, and other executives who are responsible for strategic cost management.
- Architects, developers, administrators, and operators who make decisions that affect cost at all the stages of an organization's cloud journey.
The cost models for on-premises and cloud workloads differ significantly. On-premises IT costs include capital expenditure (CapEx) and operational expenditure (OpEx). On-premises hardware and software assets are acquired and the acquisition costs are depreciated over the operating life of the assets. In the cloud, the costs for most cloud resources are treated as OpEx, where costs are incurred when the cloud resources are consumed. This fundamental difference underscores the importance of the following core principles of cost optimization.
For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization in the Architecture Framework.
Core principles
The recommendations in the cost optimization pillar of the Architecture Framework are mapped to the following core principles:
- Align cloud spending with business value: Ensure that your cloud resources deliver measurable business value by aligning IT spending with business objectives.
- Foster a culture of cost awareness: Ensure that people across your organization consider the cost impact of their decisions and activities, and ensure that they have access to the cost information required to make informed decisions.
- Optimize resource usage: Provision only the resources that you need, and pay only for the resources that you consume.
- Optimize continuously: Continuously monitor your cloud resource usage and costs, and proactively make adjustments as needed to optimize your spending. This approach involves identifying and addressing potential cost inefficiencies before they become significant problems.
These principles are closely aligned with the core tenets of cloud FinOps. FinOps is relevant to any organization, regardless of its size or maturity in the cloud. By adopting these principles and following the related recommendations, you can control and optimize costs throughout your journey in the cloud.
Contributors
Author: Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
Other contributors:
- Anuradha Bajpai | Solutions Architect
- Daniel Lees | Cloud Security Architect
- Eric Lam | Head of Google Cloud FinOps
- Fernando Rubbo | Cloud Solutions Architect
- Filipe Gracio, PhD | Customer Engineer
- Gary Harmson | Customer Engineer
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kent Hua | Solutions Manager
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Steve McGhee | Reliability Advocate
- Sergei Lilichenko | Solutions Architect
- Wade Holmes | Global Solutions Director
- Zach Seils | Networking Specialist
Align cloud spending with business value
This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to align your use of Google Cloud resources with your organization's business goals.
Principle overview
To effectively manage cloud costs, you need to maximize the business value that the cloud resources provide and minimize the total cost of ownership (TCO). When you evaluate the resource options for your cloud workloads, consider not only the cost of provisioning and using the resources, but also the cost of managing them. For example, virtual machines (VMs) on Compute Engine might be a cost-effective option for hosting applications. However, when you consider the overhead to maintain, patch, and scale the VMs, the TCO can increase. On the other hand, serverless services like Cloud Run can offer greater business value. The lower operational overhead lets your team focus on core activities and helps to increase agility.
To ensure that your cloud resources deliver optimal value, evaluate the following factors:
- Provisioning and usage costs: The expenses incurred when you purchase, provision, or consume resources.
- Management costs: The recurring expenses for operating and maintaining resources, including tasks like patching, monitoring and scaling.
- Indirect costs: The costs that you might incur to manage issues like downtime, data loss, or security breaches.
- Business impact: The potential benefits from the resources, like increased revenue, improved customer satisfaction, and faster time to market.
By aligning cloud spending with business value, you get the following benefits:
- Value-driven decisions: Your teams are encouraged to prioritize solutions that deliver the greatest business value and to consider both short-term and long-term cost implications.
- Informed resource choice: Your teams have the information and knowledge that they need to assess the business value and TCO of various deployment options, so they choose resources that are cost-effective.
- Cross-team alignment: Cross-functional collaboration between business, finance, and technical teams ensures that cloud decisions are aligned with the overall objectives of the organization.
Recommendations
To align cloud spending with business objectives, consider the following recommendations.
Prioritize managed services and serverless products
Whenever possible, choose managed services and serverless products to reduce operational overhead and maintenance costs. This choice lets your teams concentrate on their core business activities. They can accelerate the delivery of new features and functionalities, and help drive innovation and value.
The following are examples of how you can implement this recommendation:
- To run PostgreSQL, MySQL, or Microsoft SQL Server server databases, use Cloud SQL instead of deploying those databases on VMs.
- To run and manage Kubernetes clusters, use Google Kubernetes Engine (GKE) Autopilot instead of deploying containers on VMs.
- For your Apache Hadoop or Apache Spark processing needs, use Dataproc and Dataproc Serverless. Per-second billing can help to achieve significantly lower TCO when compared to on-premises data lakes.
Balance cost efficiency with business agility
Controlling costs and optimizing resource utilization are important goals. However, you must balance these goals with the need for flexible infrastructure that lets you innovate rapidly, respond quickly to changes, and deliver value faster. The following are examples of how you can achieve this balance:
- Adopt DORA metrics for software delivery performance. Metrics like change failure rate (CFR), time to detect (TTD), and time to restore (TTR) can help to identify and fix bottlenecks in your development and deployment processes. By reducing downtime and accelerating delivery, you can achieve both operational efficiency and business agility.
- Follow Site Reliability Engineering (SRE) practices to improve operational reliability. SRE's focus on automation, observability, and incident response can lead to reduced downtime, lower recovery time, and higher customer satisfaction. By minimizing downtime and improving operational reliability, you can prevent revenue loss and avoid the need to overprovision resources as a safety net to handle outages.
Enable self-service optimization
Encourage a culture of experimentation and exploration by providing your teams with self-service cost optimization tools, observability tools, and resource management platforms. Enable them to provision, manage, and optimize their cloud resources autonomously. This approach helps to foster a sense of ownership, accelerate innovation, and ensure that teams can respond quickly to changing needs while being mindful of cost efficiency.
Adopt and implement FinOps
Adopt FinOps to establish a collaborative environment where everyone is empowered to make informed decisions that balance cost and value. FinOps fosters financial accountability and promotes effective cost optimization in the cloud.
Promote a value-driven and TCO-aware mindset
Encourage your team members to adopt a holistic attitude toward cloud spending, with an emphasis on TCO and not just upfront costs. Use techniques like value stream mapping to visualize and analyze the flow of value through your software delivery process and to identify areas for improvement. Implement unit costing for your applications and services to gain a granular understanding of cost drivers and discover opportunities for cost optimization. For more information, see Maximize business value with cloud FinOps.
Foster a culture of cost awareness
This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to promote cost awareness across your organization and ensure that team members have the cost information that they need to make informed decisions.
Conventionally, the responsibility for cost management might be centralized to a few select stakeholders and primarily focused on initial project architecture decisions. However, team members across all cloud user roles (analyst, architect, developer, or administrator) can help to reduce the cost of your resources in Google Cloud. By sharing cost data appropriately, you can empower team members to make cost-effective decisions throughout their development and deployment processes.
Principle overview
Stakeholders across various roles – product owners, developers, deployment engineers, administrators, and financial analysts – need visibility into relevant cost data and its relationship to business value. When provisioning and managing cloud resources, they need the following data:
- Projected resource costs: Cost estimates at the time of design and deployment.
- Real-time resource usage costs: Up-to-date cost data that can be used for ongoing monitoring and budget validation.
- Costs mapped to business metrics: Insights into how cloud spending affects key performance indicators (KPIs), to enable teams to identify cost-effective strategies.
Every individual might not need access to raw cost data. However, promoting cost awareness across all roles is crucial because individual decisions can affect costs.
By promoting cost visibility and ensuring clear ownership of cost management practices, you ensure that everyone is aware of the financial implications of their choices and everyone actively contributes to the organization's cost optimization goals. Whether through a centralized FinOps team or a distributed model, establishing accountability is crucial for effective cost optimization efforts.
Recommendations
To promote cost awareness and ensure that your team members have the cost information that they need to make informed decisions, consider the following recommendations.
Provide organization-wide cost visibility
To achieve organization-wide cost visibility, the teams that are responsible for cost management can take the following actions:
- Standardize cost calculation and budgeting: Use a consistent method to determine the full costs of cloud resources, after factoring in discounts and shared costs. Establish clear and standardized budgeting processes that align with your organization's goals and enable proactive cost management.
- Use standardized cost management and visibility tools: Use appropriate tools that provide real-time insights into cloud spending and generate regular (for example, weekly) cost progression snapshots. These tools enable proactive budgeting, forecasting, and identification of optimization opportunities. The tools could be cloud provider tools (like the Google Cloud Billing dashboard), third-party solutions, or open-source solutions like the Cost Attribution solution.
- Implement a cost allocation system: Allocate a portion of the overall cloud budget to each team or project. Such an allocation gives the teams a sense of ownership over cloud spending and encourages them to make cost-effective decisions within their allocated budget.
- Promote transparency: Encourage teams to discuss cost implications during the design and decision-making processes. Create a safe and supportive environment for sharing ideas and concerns related to cost optimization. Some organizations use positive reinforcement mechanisms like leaderboards or recognition programs. If your organization has restrictions on sharing raw cost data due to business concerns, explore alternative approaches for sharing cost information and insights. For example, consider sharing aggregated metrics (like the total cost for an environment or feature) or relative metrics (like the average cost per transaction or user).
Understand how cloud resources are billed
Pricing for Google Cloud resources might vary across regions. Some resources are billed monthly at a fixed price, and others might be billed based on usage. To understand how Google Cloud resources are billed, use the Google Cloud pricing calculator and product-specific pricing information (for example, Google Kubernetes Engine (GKE) pricing).
Understand resource-based cost optimization options
For each type of cloud resource that you plan to use, explore strategies to optimize utilization and efficiency. The strategies include rightsizing, autoscaling, and adopting serverless technologies where appropriate. The following are examples of cost optimization options for a few Google Cloud products:
- Cloud Run lets you configure always-allocated CPUs to handle predictable traffic loads at a fraction of the price of the default allocation method (that is, CPUs allocated only during request processing).
- You can purchase BigQuery slot commitments to save money on data analysis.
- GKE provides detailed metrics to help you understand cost optimization options.
- Understand how network pricing can affect the cost of data transfers and how you can optimize costs for specific networking services. For example, you can reduce the data transfer costs for external Application Load Balancers by using Cloud CDN or Google Cloud Armor. For more information, see Ways to lower external Application Load Balancer costs.
Understand discount-based cost optimization options
Familiarize yourself with the discount programs that Google Cloud offers, such as the following examples:
- Committed use discounts (CUDs): CUDs are suitable for resources that have predictable and steady usage. CUDs let you get significant reductions in price in exchange for committing to specific resource usage over a period (typically one to three years). You can also use CUD auto-renewal to avoid having to manually repurchase commitments when they expire.
- Sustained use discounts: For certain Google Cloud products like Compute Engine and GKE, you can get automatic discount credits after continuous resource usage beyond specific duration thresholds.
- Spot VMs: For fault-tolerant and flexible workloads, Spot VMs can help to reduce your Compute Engine costs. The cost of Spot VMs is significantly lower than regular VMs. However, Compute Engine might preemptively stop or delete Spot VMs to reclaim capacity. Spot VMs are suitable for batch jobs that can tolerate preemption and don't have high availability requirements.
- Discounts for specific product options: Some managed services like BigQuery offer discounts when you purchase dedicated or autoscaling query processing capacity.
Evaluate and choose the discounts options that align with your workload characteristics and usage patterns.
Incorporate cost estimates into architecture blueprints
Encourage teams to develop architecture blueprints that include cost estimates for different deployment options and configurations. This practice empowers teams to compare costs proactively and make informed decisions that align with both technical and financial objectives.
Use a consistent and standard set of labels for all your resources
You can use labels to track costs and to identify and classify resources. Specifically, you can use labels to allocate costs to different projects, departments, or cost centers. Defining a formal labeling policy that aligns with the needs of the main stakeholders in your organization helps to make costs visible more widely. You can also use labels to filter resource cost and usage data based on target audience.
Use automation tools like Terraform to enforce labeling on every resource that is created. To enhance cost visibility and attribution further, you can use the tools provided by the open-source cost attribution solution.
Share cost reports with team members
By sharing cost reports with your team members, you empower them to take ownership of their cloud spending. This practice enables cost-effective decision making, continuous cost optimization, and systematic improvements to your cost allocation model.
Cost reports can be of several types, including the following:
- Periodic cost reports: Regular reports inform teams about their current cloud spending. Conventionally, these reports might be spreadsheet exports. More effective methods include automated emails and specialized dashboards. To ensure that cost reports provide relevant and actionable information without overwhelming recipients with unnecessary detail, the reports must be tailored to the target audiences. Setting up tailored reports is a foundational step toward more real-time and interactive cost visibility and management.
- Automated notifications: You can configure cost reports to proactively notify relevant stakeholders (for example, through email or chat) about cost anomalies, budget thresholds, or opportunities for cost optimization. By providing timely information directly to those who can act on it, automated alerts encourage prompt action and foster a proactive approach to cost optimization.
- Google Cloud dashboards: You can use the built-in billing dashboards in Google Cloud to get insights into cost breakdowns and to identify opportunities for cost optimization. Google Cloud also provides FinOps hub to help you monitor savings and get recommendations for cost optimization. An AI engine powers the FinOps hub to recommend cost optimization opportunities for all the resources that are currently deployed. To control access to these recommendations, you can implement role-based access control (RBAC).
- Custom dashboards: You can create custom dashboards by exporting cost data to an analytics database, like BigQuery. Use a visualization tool like Looker Studio to connect to the analytics database to build interactive reports and enable fine-grained access control through role-based permissions.
- Multicloud cost reports: For multicloud deployments, you need a unified view of costs across all the cloud providers to ensure comprehensive analysis, budgeting, and optimization. Use tools like BigQuery to centralize and analyze cost data from multiple cloud providers, and use Looker Studio to build team-specific interactive reports.
Optimize resource usage
This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you plan and provision resources to match the requirements and consumption patterns of your cloud workloads.
Principle overview
To optimize the cost of your cloud resources, you need to thoroughly understand your workloads resource requirements and load patterns. This understanding is the basis for a well defined cost model that lets you forecast the total cost of ownership (TCO) and identify cost drivers throughout your cloud adoption journey. By proactively analyzing and forecasting cloud spending, you can make informed choices about resource provisioning, utilization, and cost optimization. This approach lets you control cloud spending, avoid overprovisioning, and ensure that cloud resources are aligned with the dynamic needs of your workloads and environments.
Recommendations
To effectively optimize cloud resource usage, consider the following recommendations.
Choose environment-specific resources
Each deployment environment has different requirements for availability, reliability and scalability. For example, developers might prefer an environment that lets them rapidly deploy and run applications for short durations, but might not need high availability. On the other hand, a production environment typically needs high availability. To maximize the utilization of your resources, define environment-specific requirements based on your business needs. The following table lists examples of environment-specific requirements.
Environment | Requirements |
Production |
|
Development and testing |
|
Other environments (like staging and QA) |
|
Choose workload-specific resources
Each of your cloud workloads might have different requirements for availability, scalability, security, and performance. To optimize costs, you need to align resource choices with the specific requirements of each workload. For example, a stateless application might not require the same level of availability or reliability as a stateful backend. The following table lists more examples of workload-specific requirements.
Workload type | Workload requirements | Resource options |
Mission-critical | Continuous availability, robust security, and high performance | Premium resources and managed services like Spanner for high availability and global consistency of data. |
Non-critical | Cost-efficient and autoscaling infrastructure | Resources with basic features and ephemeral resources like Spot VMs. |
Event-driven | Dynamic scaling based on the current demand for capacity and performance | Serverless services like Cloud Run and Cloud Run functions. |
Experimental workloads | Low cost and flexible environment for rapid development, iteration, testing, and innovation | Resources with basic features, ephemeral resources like Spot VMs, and sandbox environments with defined spending limits. |
A benefit of the cloud is the opportunity to take advantage of the most appropriate computing power for a given workload. Some workloads are developed to take advantage of processor instruction sets, and others might not be designed in this way. Benchmark and profile your workloads accordingly. Categorize your workloads and make workload-specific resource choices (for example, choose appropriate machine families for Compute Engine VMs). This practice helps to optimize costs, enable innovation, and maintain the level of availability and performance that your workloads need.
The following are examples of how you can implement this recommendation:
- For mission-critical workloads that serve globally distributed users, consider using Spanner. Spanner removes the need for complex database deployments by ensuring reliability and consistency of data in all regions.
- For workloads with fluctuating load levels, use autoscaling to ensure that you don't incur costs when the load is low and yet maintain sufficient capacity to meet the current load. You can configure autoscaling for many Google Cloud services, including Compute Engine VMs, Google Kubernetes Engine (GKE) clusters, and Cloud Run. When you set up autoscaling, you can configure maximum scaling limits to ensure that costs remain within specified budgets.
Select regions based on cost requirements
For your cloud workloads, carefully evaluate the available Google Cloud regions and choose regions that align with your cost objectives. The region with lowest cost might not offer optimal latency or it might not meet your sustainability requirements. Make informed decisions about where to deploy your workloads to achieve the desired balance. You can use the Google Cloud Region Picker to understand the trade-offs between cost, sustainability, latency, and other factors.
Use built-in cost optimization options
Google Cloud products provide built-in features to help you optimize resource usage and control costs. The following table lists examples of cost optimization features that you can use in some Google Cloud products:
Product | Cost optimization feature |
Compute Engine |
|
GKE |
|
Cloud Storage |
|
BigQuery |
|
Google Cloud VMware Engine |
|
Optimize resource sharing
To maximize the utilization of cloud resources, you can deploy multiple applications or services on the same infrastructure, while still meeting the security and other requirements of the applications. For example, in development and testing environments, you can use the same cloud infrastructure to test all the components of an application. For the production environment, you can deploy each component on a separate set of resources to limit the extent of impact in case of incidents.
The following are examples of how you can implement this recommendation:
- Use a single Cloud SQL instance for multiple non-production environments.
- Enable multiple development teams to share a GKE cluster by using the fleet team management feature in GKE Enterprise with appropriate access controls.
- Use GKE Autopilot to take advantage of cost-optimization techniques like bin packing and autoscaling that GKE implements by default.
- For AI and ML workloads, save GPU costs by using GPU-sharing strategies like multi-instance GPUs, time-sharing GPUs, and NVIDIA MPS.
Develop and maintain reference architectures
Create and maintain a repository of reference architectures that are tailored to meet the requirements of different deployment environments and workload types. To streamline the design and implementation process for individual projects, the blueprints can be centrally managed by a team like a Cloud Center of Excellence (CCoE). Project teams can choose suitable blueprints based on clearly defined criteria, to ensure architectural consistency and adoption of best practices. For requirements that are unique to a project, the project team and the central architecture team should collaborate to design new reference architectures. You can share the reference architectures across the organization to foster knowledge sharing and expand the repository of available solutions. This approach ensures consistency, accelerates development, simplifies decision-making, and promotes efficient resource utilization.
Review the reference architectures provided by Google for various use cases and technologies. These reference architectures incorporate best practices for resource selection, sizing, configuration, and deployment. By using these reference architectures, you can accelerate your development process and achieve cost savings from the start.
Enforce cost discipline by using organization policies
Consider using organization policies to limit the available Google Cloud locations and products that team members can use. These policies help to ensure that teams adhere to cost-effective solutions and provision resources in locations that are aligned with your cost optimization goals.
Estimate realistic budgets and set financial boundaries
Develop detailed budgets for each project, workload, and deployment environment. Make sure that the budgets cover all aspects of cloud operations, including infrastructure costs, software licenses, personnel, and anticipated growth. To prevent overspending and ensure alignment with your financial goals, establish clear spending limits or thresholds for projects, services, or specific resources. Monitor cloud spending regularly against these limits. You can use proactive quota alerts to identify potential cost overruns early and take timely corrective action.
In addition to setting budgets, you can use quotas and limits to help enforce cost discipline and prevent unexpected spikes in spending. You can exercise granular control over resource consumption by setting quotas at various levels, including projects, services, and even specific resource types.
The following are examples of how you can implement this recommendation:
- Project-level quotas: Set spending limits or resource quotas at the project level to establish overall financial boundaries and control resource consumption across all the services within the project.
- Service-specific quotas: Configure quotas for specific Google Cloud services like Compute Engine or BigQuery to limit the number of instances, CPUs, or storage capacity that can be provisioned.
- Resource type-specific quotas: Apply quotas to individual resource types like Compute Engine VMs, Cloud Storage buckets, Cloud Run instances, or GKE nodes to restrict their usage and prevent unexpected cost overruns.
- Quota alerts: Get notifications when your quota usage (at the project level) reaches a percentage of the maximum value.
By using quotas and limits in conjunction with budgeting and monitoring, you can create a proactive and multi-layered approach to cost control. This approach helps to ensure that your cloud spending remains within defined boundaries and aligns with your business objectives. Remember, these cost controls are not permanent or rigid. To ensure that the cost controls remain aligned with current industry standards and reflect your evolving business needs, you must review the controls regularly and adjust them to include new technologies and best practices.
Optimize continuously
This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your cloud deployments based on constantly changing and evolving business goals.
As your business grows and evolves, your cloud workloads need to adapt to changes in resource requirements and usage patterns. To derive maximum value from your cloud spending, you must maintain cost-efficiency while continuing to support business objectives. This requires a proactive and adaptive approach that focuses on continuous improvement and optimization.
Principle overview
To optimize cost continuously, you must proactively monitor and analyze your cloud environment and make suitable adjustments to meet current requirements. Focus your monitoring efforts on key performance indicators (KPIs) that directly affect your end users' experience, align with your business goals, and provide insights for continuous improvement. This approach lets you identify and address inefficiencies, adapt to changing needs, and continuously align cloud spending with strategic business goals. To balance comprehensive observability with cost effectiveness, understand the costs and benefits of monitoring resource usage and use appropriate process-improvement and optimization strategies.
Recommendations
To effectively monitor your Google Cloud environment and optimize cost continuously, consider the following recommendations.
Focus on business-relevant metrics
Effective monitoring starts with identifying the metrics that are most important for your business and customers. These metrics include the following:
- User experience metrics: Latency, error rates, throughput, and customer satisfaction metrics are useful for understanding your end users' experience when using your applications.
- Business outcome metrics: Revenue, customer growth, and engagement can be correlated with resource usage to identify opportunities for cost optimization.
- DevOps Research & Assessment (DORA) metrics: Metrics like deployment frequency, lead time for changes, change failure rate, and time to restore provide insights into the efficiency and reliability of your software delivery process. By improving these metrics, you can increase productivity, reduce downtime, and optimize cost.
- Site Reliability Engineering (SRE) metrics: Error budgets help teams to quantify and manage the acceptable level of service disruption. By establishing clear expectations for reliability, error budgets empower teams to innovate and deploy changes more confidently, knowing their safety margin. This proactive approach promotes a balance between innovation and stability, helping prevent excessive operational costs associated with major outages or prolonged downtime.
Use observability for resource optimization
The following are recommendations to use observability to identify resource bottlenecks and underutilized resources in your cloud deployments:
- Monitor resource utilization: Use resource utilization metrics to identify Google Cloud resources that are underutilized. For example, use metrics like CPU and memory utilization to identify idle VM resources. For Google Kubernetes Engine (GKE), you can view a detailed breakdown of costs and cost-related optimization metrics. For Google Cloud VMware Engine, review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
- Use cloud recommendations: Active Assist is a portfolio of intelligent tools that help you optimize your cloud operations. These tools provide actionable recommendations to reduce costs, increase performance, improve security and even make sustainability-focused decisions. For example, VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
- Correlate resource utilization with performance: Analyze the relationship between resource utilization and application performance to determine whether you can downgrade to less expensive resources without affecting the user experience.
Balance troubleshooting needs with cost
Detailed observability data can help with diagnosing and troubleshooting issues. However, storing excessive amounts of observability data or exporting unnecessary data to external monitoring tools can lead to unnecessary costs. For efficient troubleshooting, consider the following recommendations:
- Collect sufficient data for troubleshooting: Ensure that your monitoring solution captures enough data to efficiently diagnose and resolve issues when they arise. This data might include logs, traces, and metrics at various levels of granularity.
- Use sampling and aggregation: Balance the need for detailed data with cost considerations by using sampling and aggregation techniques. This approach lets you collect representative data without incurring excessive storage costs.
- Understand the pricing models of your monitoring tools and services: Evaluate different monitoring solutions and choose options that align with your project's specific needs, budget, and usage patterns. Consider factors like data volume, retention requirements, and the required features when making your selection.
- Regularly review your monitoring configuration: Avoid collecting excessive data by removing unnecessary metrics or logs.
Tailor data collection to roles and set role-specific retention policies
Consider the specific data needs of different roles. For example, developers might primarily need access to traces and application-level logs, whereas IT administrators might focus on system logs and infrastructure metrics. By tailoring data collection, you can reduce unnecessary storage costs and avoid overwhelming users with irrelevant information.
Additionally, you can define retention policies based on the needs of each role and any regulatory requirements. For example, developers might need access to detailed logs for a shorter period, while financial analysts might require longer-term data.
Consider regulatory and compliance requirements
In certain industries, regulatory requirements mandate data retention. To avoid legal and financial risks, you need to ensure that your monitoring and data retention practices help you adhere to relevant regulations. At the same time, you need to maintain cost efficiency. Consider the following recommendations:
- Determine the specific data retention requirements for your industry or region, and ensure that your monitoring strategy meets the requirements of those requirements.
- Implement appropriate data archival and retrieval mechanisms to meet audit and compliance needs while minimizing storage costs.
Implement smart alerting
Alerting helps to detect and resolve issues in a timely manner. However, a balance is necessary between an approach that keeps you informed, and one that overwhelms you with notifications. By designing intelligent alerting systems, you can prioritize critical issues that have higher business impact. Consider the following recommendations:
- Prioritize issues that affect customers: Design alerts that trigger rapidly for issues that directly affect the customer experience, like website outages, slow response times, or transaction failures.
- Tune for temporary problems: Use appropriate thresholds and delay mechanisms to avoid unnecessary alerts for temporary problems or self-healing system issues that don't affect customers.
- Customize alert severity: Ensure that the most urgent issues receive immediate attention by differentiating between critical and noncritical alerts.
- Use notification channels wisely: Choose appropriate channels for alert notifications (email, SMS, or paging) based on the severity and urgency of the alerts.
Google Cloud Architecture Framework: Performance optimization
This pillar in the Google Cloud Architecture Framework provides recommendations to optimize the performance of workloads in Google Cloud.
This document is intended for architects, developers, and administrators who plan, design, deploy, and manage workloads in Google Cloud.
The recommendations in this pillar can help your organization to operate efficiently, improve customer satisfaction, increase revenue, and reduce cost. For example, when the backend processing time of an application decreases, users experience faster response times, which can lead to higher user retention and more revenue.
The performance optimization process can involve a trade-off between performance and cost. However, optimizing performance can sometimes help you reduce costs. For example, when the load increases, autoscaling can help to provide predictable performance by ensuring that the system resources aren't overloaded. Autoscaling also helps you to reduce costs by removing unused resources during periods of low load.
Performance optimization is a continuous process, not a one-time activity. The following diagram shows the stages in the performance optimization process:
The performance optimization process is an ongoing cycle that includes the following stages:
- Define requirements: Define granular performance requirements for each layer of the application stack before you design and develop your applications. To plan resource allocation, consider the key workload characteristics and performance expectations.
- Design and deploy: Use elastic and scalable design patterns that can help you meet your performance requirements.
- Monitor and analyze: Monitor performance continually by using logs, tracing, metrics, and alerts.
Optimize: Consider potential redesigns as your applications evolve. Rightsize cloud resources and use new features to meet changing performance requirements.
As shown in the preceding diagram, continue the cycle of monitoring, re-assessing requirements, and adjusting the cloud resources.
For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization in the Architecture Framework.
Core principles
The recommendations in the performance optimization pillar of the Architecture Framework are mapped to the following core principles:
- Plan resource allocation
- Take advantage of elasticity
- Promote modular design
- Continuously monitor and improve performance
Contributors
Authors:
- Daniel Lees | Cloud Security Architect
- Gary Harmson | Customer Engineer
- Luis Urena | Developer Relations Engineer
- Zach Seils | Networking Specialist
Other contributors:
- Filipe Gracio, PhD | Customer Engineer
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
- Ryan Cox | Principal Architect
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Wade Holmes | Global Solutions Director
Plan resource allocation
This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you plan resources for your workloads in Google Cloud. It emphasizes the importance of defining granular requirements before you design and develop applications for cloud deployment or migration.
Principle overview
To meet your business requirements, it's important that you define the performance requirements for your applications, before design and development. Define these requirements as granularly as possible for the application as a whole and for each layer of the application stack. For example, in the storage layer, you must consider the throughput and I/O operations per second (IOPS) that the applications need.
From the beginning, plan application designs with performance and scalability in mind. Consider factors such as the number of users, data volume, and potential growth over time.
Performance requirements for each workload vary and depend on the type of workload. Each workload can contain a mix of component systems and services that have unique sets of performance characteristics. For example, a system that's responsible for periodic batch processing of large datasets has different performance demands than an interactive virtual desktop solution. Your optimization strategies must address the specific needs of each workload.
Select services and features that align with the performance goals of each workload. For performance optimization, there's no one-size-fits-all solution. When you optimize each workload, the entire system can achieve optimal performance and efficiency.
Consider the following workload characteristics that can influence your performance requirements:
- Deployment archetype: The deployment archetype that you select for an application can influence your choice of products and features, which then determine the performance that you can expect from your application.
- Resource placement: When you select a Google Cloud region for your application resources, we recommend that you prioritize low latency for end users, adhere to data-locality regulations, and ensure the availability of required Google Cloud products and services.
- Network connectivity: Choose networking services that optimize data access and content delivery. Take advantage of Google Cloud's global network, high-speed backbones, interconnect locations, and caching services.
- Application hosting options: When you select a hosting platform, you must evaluate the performance advantages and disadvantages of each option. For example, consider bare metal, virtual machines, containers, and serverless platforms.
- Storage strategy: Choose an optimal storage strategy that's based on your performance requirements.
- Resource configurations: The machine type, IOPS, and throughput can have a significant impact on performance. Additionally, early in the design phase, you must consider appropriate security capabilities and their impact on resources. When you plan security features, be prepared to accommodate the necessary performance trade-offs to avoid any unforeseen effects.
Recommendations
To ensure optimal resource allocation, consider the recommendations in the following sections.
Configure and manage quotas
Ensure that your application uses only the necessary resources, such as memory, storage, and processing power. Over-allocation can lead to unnecessary expenses, while under-allocation might result in performance degradation.
To accommodate elastic scaling and to ensure that adequate resources are available, regularly monitor the capacity of your quotas. Additionally, track quota usage to identify potential scaling constraints or over-allocation issues, and then make informed decisions about resource allocation.
Educate and promote awareness
Inform your users about the performance requirements and provide educational resources about effective performance management techniques.
To evaluate progress and to identify areas for improvement, regularly document the target performance and the actual performance. Load test your application to find potential breakpoints and to understand how you can scale the application.
Monitor performance metrics
Use Cloud Monitoring to analyze trends in performance metrics, to analyze the effects of experiments, to define alerts for critical metrics, and to perform retrospective analyses.
Active Assist is a set of tools that can provide insights and recommendations to help optimize resource utilization. These recommendations can help you to adjust resource allocation and improve performance.
Take advantage of elasticity
This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you incorporate elasticity, which is the ability to adjust resources dynamically based on changes in workload requirements.
Elasticity allows different components of a system to scale independently. This targeted scaling can help improve performance and cost efficiency by allocating resources precisely where they're needed, without over provisioning or under provisioning your resources.
Principle overview
The performance requirements of a system directly influence when and how the system scales vertically or scales horizontally. You need to evaluate the system's capacity and determine the load that the system is expected to handle at baseline. Then, you need to determine how you want the system to respond to increases and decreases in the load.
When the load increases, the system must scale out horizontally, scale up vertically, or both. For horizontal scaling, add replica nodes to ensure that the system has sufficient overall capacity to fulfill the increased demand. For vertical scaling, replace the application's existing components with components that contain more capacity, more memory, and more storage.
When the load decreases, the system must scale down (horizontally, vertically, or both).
Define the circumstances in which the system scales up or scales down. Plan to manually scale up systems for known periods of high traffic. Use tools like autoscaling, which responds to increases or decreases in the load.
Recommendations
To take advantage of elasticity, consider the recommendations in the following sections.
Plan for peak load periods
You need to plan an efficient scaling path for known events, such as expected periods of increased customer demand.
Consider scaling up your system ahead of known periods of high traffic. For example, if you're a retail organization, you expect demand to increase during seasonal sales. We recommend that you manually scale up or scale out your systems before those sales to ensure that your system can immediately handle the increased load or immediately adjust existing limits. Otherwise, the system might take several minutes to add resources in response to real-time changes. Your application's capacity might not increase quickly enough and cause some users to experience delays.
For unknown or unexpected events, such as a sudden surge in demand or traffic, you can use autoscaling features to trigger elastic scaling that's based on metrics. These metrics can include CPU utilization, load balancer serving capacity, latency, and even custom metrics that you define in Cloud Monitoring.
For example, consider an application that runs on a Compute Engine managed instance group (MIG). This application has a requirement that each instance performs optimally until the average CPU utilization reaches 75%. In this example, you might define an autoscaling policy that creates more instances when the CPU utilization reaches the threshold. These newly-created instances help absorb the load, which helps ensure that the average CPU utilization remains at an optimal rate until the maximum number of instances that you've configured for the MIG is reached. When the demand decreases, the autoscaling policy removes the instances that are no longer needed.
Plan resource slot reservations in BigQuery or adjust the limits for autoscaling configurations in Spanner by using the managed autoscaler.
Use predictive scaling
If your system components include Compute Engine, you must evaluate whether predictive autoscaling is suitable for your workload. Predictive autoscaling forecasts the future load based on your metrics' historical trends—for example, CPU utilization. Forecasts are recomputed every few minutes, so the autoscaler rapidly adapts its forecast to very recent changes in load. Without predictive autoscaling, an autoscaler can only scale a group reactively, based on observed real-time changes in load. Predictive autoscaling works with both real-time data and historical data to respond to both the current and the forecasted load.
Implement serverless architectures
Consider implementing a serverless architecture with serverless services that are inherently elastic, such as the following:
Unlike autoscaling in other services that require fine-tuning rules (for example, Compute Engine), serverless autoscaling is instant and can scale down to zero resources.
Use Autopilot mode for Kubernetes
For complex applications that require greater control over Kubernetes, consider Autopilot mode in Google Kubernetes Engine (GKE). Autopilot mode provides automation and scalability by default. GKE automatically scales nodes and resources based on traffic. GKE manages nodes, creates new nodes for your applications, and configures automatic upgrades and repairs.
Promote modular design
This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you promote a modular design. Modular components and clear interfaces can enable flexible scaling, independent updates, and future component separation.
Principle overview
Understand the dependencies between the application components and the system components to design a scalable system.
Modular design enables flexibility and resilience, regardless of whether a monolithic or microservices architecture was initially deployed. By decomposing the system into well-defined, independent modules with clear interfaces, you can scale individual components to meet specific demands.
Targeted scaling can help optimize resource utilization and reduce costs in the following ways:
- Provisions only the necessary resources to each component, and allocates fewer resources to less-demanding components.
- Adds more resources during high-traffic periods to maintain the user experience.
- Removes under-utilized resources without compromising performance.
Modularity also enhances maintainability. Smaller, self-contained units are easier to understand, debug, and update, which can lead to faster development cycles and reduced risk.
While modularity offers significant advantages, you must evaluate the potential performance trade-offs. The increased communication between modules can introduce latency and overhead. Strive for a balance between modularity and performance. A highly modular design might not be universally suitable. When performance is critical, a more tightly coupled approach might be appropriate. System design is an iterative process, in which you continuously review and refine your modular design.
Recommendations
To promote modular designs, consider the recommendations in the following sections.
Design for loose coupling
Design a loosely coupled architecture. Independent components with minimal dependencies can help you build scalable and resilient applications. As you plan the boundaries for your services, you must consider the availability and scalability requirements. For example, if one component has requirements that are different from your other components, you can design the component as a standalone service. Implement a plan for graceful failures for less-important subprocesses or services that don't impact the response time of the primary services.
Design for concurrency and parallelism
Design your application to support multiple tasks concurrently, like processing multiple user requests or running background jobs while users interact with your system. Break large tasks into smaller chunks that can be processed at the same time by multiple service instances. Task concurrency lets you use features like autoscaling to increase the resource allocation in products like the following:
Balance modularity for flexible resource allocation
Where possible, ensure that each component uses only the necessary resources (like memory, storage, and processing power) for specific operations. Resource over-allocation can result in unnecessary costs, while resource under-allocation can compromise performance.
Use well-defined interfaces
Ensure modular components communicate effectively through clear, standardized interfaces (like APIs and message queues) to reduce overhead from translation layers or from extraneous traffic.
Use stateless models
A stateless model can help ensure that you can handle each request or interaction with the service independently from previous requests. This model facilitates scalability and recoverability, because you can grow, shrink, or restart the service without losing the data necessary for in-progress requests or processes.
Choose complementary technologies
Choose technologies that complement the modular design. Evaluate programming languages, frameworks, and databases for their modularity support.
For more information, see the following resources:
Continuously monitor and improve performance
This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you continuously monitor and improve performance.
After you deploy applications, continuously monitor their performance by using logs, tracing, metrics, and alerts. As your applications grow and evolve, you can use the trends in these data points to re-assess your performance requirements. You might eventually need to redesign parts of your applications to maintain or improve their performance.
Principle overview
The process of continuous performance improvement requires robust monitoring tools and strategies. Cloud observability tools can help you to collect key performance indicators (KPIs) such as latency, throughput, error rates, and resource utilization. Cloud environments offer a variety of methods to conduct granular performance assessments across the application, the network, and the end-user experience.
Improving performance is an ongoing effort that requires a multi-faceted approach. The following key mechanisms and processes can help you to boost performance:
- To provide clear direction and help track progress, define performance objectives that align with your business goals. Set SMART goals: specific, measurable, achievable, relevant, and time-bound.
- To measure performance and identify areas for improvement, gather KPI metrics.
- To continuously monitor your systems for issues, use visualized workflows in monitoring tools. Use architecture process mapping techniques to identify redundancies and inefficiencies.
- To create a culture of ongoing improvement, provide training and programs that support your employees' growth.
- To encourage proactive and continuous improvement, incentivize your employees and customers to provide ongoing feedback about your application's performance.
Recommendations
To promote modular designs, consider the recommendations in the following sections.
Define clear performance goals and metrics
Define clear performance objectives that align with your business goals. This requires a deep understanding of your application's architecture and the performance requirements of each application component.
As a priority, optimize the most critical components that directly influence your core business functions and user experience. To help ensure that these components continue to run efficiently and meet your business needs, set specific and measurable performance targets. These targets can include response times, error rates, and resource utilization thresholds.
This proactive approach can help you to identify and address potential bottlenecks, optimize resource allocation, and ultimately deliver a seamless and high-performing experience for your users.
Monitor performance
Continuously monitor your cloud systems for performance issues and set up alerts for any potential problems. Monitoring and alerts can help you to catch and fix issues before they affect users. Application profiling can help to identify bottlenecks and can help to optimize resource use.
You can use tools that facilitate effective troubleshooting and network optimization. Use Google Cloud Observability to identify areas that have high CPU consumption, memory consumption, or network consumption. These capabilities can help developers improve efficiency, reduce costs, and enhance the user experience. Network Intelligence Center shows visualizations of the topology of your network infrastructure, and can help you to identify high-latency paths.
Incentivize continuous improvement
Create a culture of ongoing improvement that can benefit both the application and the user experience.
Provide your employees with training and development opportunities that enhance their skills and knowledge in performance techniques across cloud services. Establish a community of practice (CoP) and offer mentorship and coaching programs to support employee growth.
To prevent reactive performance management and encourage proactive performance management, encourage ongoing feedback from your employees, your customers, and your stakeholders. You can consider gamifying the process by tracking KPIs on performance and presenting those metrics to teams on a frequent basis in the form of a league table.
To understand your performance and user happiness over time, we recommend that you measure user feedback quantitatively and qualitatively. The HEART framework can help you capture user feedback across five categories:
- Happiness
- Engagement
- Adoption
- Retention
- Task success
By using such a framework, you can incentivize engineers with data-driven feedback, user-centered metrics, actionable insights, and a clear understanding of goals.
Design for environmental sustainability
This document in the Google Cloud Architecture Framework summarizes how you can approach environmental sustainability for your workloads in Google Cloud. It includes information about how to minimize your carbon footprint on Google Cloud.
Understand your carbon footprint
To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use.
Choose the most suitable cloud regions
One effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions.
When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker.
Choose the most suitable cloud services
To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine.
Consider serverless options for workloads that don't need VMs. These managed services often optimize resource usage automatically, reducing costs and carbon footprint.
Minimize idle cloud resources
Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following:
- Unused active cloud resources, such as idle VM instances.
- Over-provisioned resources, such as larger VM instances machine types than necessary for a workload.
- Non-optimal architectures, such as lift-and-shift migrations that aren't always optimized for efficiency. Consider making incremental improvements to these architectures.
The following are some general strategies to help minimize wasted cloud resources:
- Identify idle or overprovisioned resources and either delete them or rightsize them.
- Refactor your architecture to incorporate a more optimal design.
- Migrate workloads to managed services.
Reduce emissions for batch workloads
Run batch workloads in regions with lower carbon emissions. For further reductions, run workloads at times that coincide with lower grid carbon intensity when possible.
What's next
- Learn how to use Carbon Footprint data to measure, report, and reduce your cloud carbon emissions.
Architecture Framework: AI and ML perspective
This document in the Google Cloud Architecture Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.
The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.
The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Google Cloud Architecture Framework:
- AI and ML perspective: Operational excellence
- AI and ML perspective: Security
- AI and ML perspective: Reliability
- AI and ML perspective: Cost optimization
- AI and ML perspective: Performance optimization
Contributors
Authors:
- Benjamin Sadik | AI and ML Specialist Customer Engineer
- Filipe Gracio, PhD | Customer Engineer
- Isaac Lo | AI Business Development Manager
- Kamilla Kurta | GenAI/ML Specialist Customer Engineer
- Mohamed Fawzi | Benelux Security and Compliance Lead
- Rick (Rugui) Chen | AI Infrastructure Solutions Architect
- Sannya Dang | AI Solution Architect
Other contributors:
- Daniel Lees | Cloud Security Architect
- Gary Harmson | Customer Engineer
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
- Radhika Kanakam | Senior Program Manager, Cloud GTM
- Ryan Cox | Principal Architect
- Stef Ruinard | Generative AI Field Solutions Architect
- Wade Holmes | Global Solutions Director
- Zach Seils | Networking Specialist
AI and ML perspective: Operational excellence
This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to help you to build and operate robust AI and ML systems on Google Cloud. These recommendations help you to set up foundational elements like observability, automation, and scalability. This document's recommendations align with the operational excellence pillar of the Architecture Framework.
Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the intricate AI and ML systems and pipelines that power your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that operations remain aligned with business goals.
Build a robust foundation for model development
Establish a robust foundation to streamline model development, from problem definition to deployment. Such a foundation ensures that your AI solutions are built on reliable and efficient components and choices. This kind of foundation helps you to release changes and improvements quickly and easily.
Consider the following recommendations:
- Define the problem that the AI system solves and the outcome that you want.
- Identify and gather relevant data that's required to train and evaluate your models. Then, clean and preprocess the raw data. Implement data validation checks to ensure data quality and integrity.
- Choose the appropriate ML approach for the task. When you design the structure and parameters of the model, consider the model's complexity and computational requirements.
- Adopt a version control system for code, model, and data.
Automate the model-development lifecycle
From data preparation and training to deployment and monitoring, automation helps you to improve the quality and efficiency of your operations. Automation enables seamless, repeatable, and error-free model development and deployment. Automation minimizes manual intervention, speeds up release cycles, and ensures consistency across environments.
Consider the following recommendations:
- Use a managed pipeline orchestration system to orchestrate and automate the ML workflow. The pipeline must handle the major steps of your development lifecycle: preparation, training, deployment, and evaluation.
- Implement CI/CD pipelines for the model-development lifecycle. These pipelines should automate the building, testing, and deployment of models. The pipelines should also include continuous training to retrain models on new data as needed.
- Implement phased release approaches such as canary deployments or A/B testing, for safe and controlled model releases.
Implement observability
When you implement observability, you can gain deep insights into model performance, data drift, and system health. Implement continuous monitoring, alerting, and logging mechanisms to proactively identify issues, trigger timely responses, and ensure operational continuity.
Consider the following recommendations:
- Implement permanent and automated performance monitoring for your models. Use metrics and success criteria for ongoing evaluation of the model after deployment.
- Monitor your deployment endpoints and infrastructure to ensure service availability.
- Set up custom alerting based on business-specific thresholds and anomalies to ensure that issues are identified and resolved in a timely manner.
- Use explainable AI techniques to understand and interpret model outputs.
Build a culture of operational excellence
Operational excellence is built on a foundation of people, culture, and professional practices. The success of your team and business depends on how effectively your organization implements methodologies that enable the reliable and rapid development of AI capabilities.
Consider the following recommendations:
- Champion automation and standardization as core development methodologies. Streamline your workflows and manage the ML lifecycle efficiently by using MLOps techniques. Automate tasks to free up time for innovation, and standardize processes to support consistency and easier troubleshooting.
- Prioritize continuous learning and improvement. Promote learning opportunities that team members can use to enhance their skills and stay current with AI and ML advancements. Encourage experimentation and conduct regular retrospectives to identify areas for improvement.
- Cultivate a culture of accountability and ownership. Define clear roles so that everyone understands their contributions. Empower teams to make decisions within boundaries and track progress by using transparent metrics.
- Embed AI ethics and safety into the culture. Prioritize responsible systems by integrating ethics considerations into every stage of the ML lifecycle. Establish clear ethics principles and foster open discussions about ethics-related challenges.
Design for scalability
Architect your AI solutions to handle growing data volumes and user demands. Use scalable infrastructure so that your models can adapt and perform optimally as your project expands.
Consider the following recommendations:
- Plan for capacity and quotas. Anticipate future growth, and plan your infrastructure capacity and resource quotas accordingly.
- Prepare for peak events. Ensure that your system can handle sudden spikes in traffic or workload during peak events.
- Scale AI applications for production. Design for horizontal scaling to accommodate increases in the workload. Use frameworks like Ray on Vertex AI to parallelize tasks across multiple machines.
- Use managed services where appropriate. Use services that help you to scale while minimizing the operational overhead and complexity of manual interventions.
Contributors
Authors:
- Sannya Dang | AI Solution Architect
- Filipe Gracio, PhD | Customer Engineer
Other contributors:
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Ryan Cox | Principal Architect
- Stef Ruinard | Generative AI Field Solutions Architect
AI and ML perspective: Security
This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Architecture Framework.
Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.
Define clear goals and requirements
It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.
Consider the following recommendations:
- Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
- Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.
Keep data secure and prevent loss or mishandling
Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.
Consider the following recommendations:
- Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
- Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
- Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
- Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
- Guard against data poisoning for your ML training systems.
Keep AI pipelines secure and robust against tampering
Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.
Consider the following recommendations:
- Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
- Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
- Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.
Deploy on secure systems with secure tools and artifacts
Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.
Consider the following recommendations:
- Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
- Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
- Prefer using validated prebuilt container images that are specifically designed for AI workloads.
Protect and monitor inputs
AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.
Consider the following recommendations:
- Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
- Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
- Ensure that only the intended users of a deployed system can use it.
Monitor, evaluate, and prepare to respond to outputs
AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.
Consider the following recommendations:
- Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
- Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
- Implement robust alerting and incident response procedures to address any potential issues.
Contributors
Authors:
- Kamilla Kurta | GenAI/ML Specialist Customer Engineer
- Filipe Gracio, PhD | Customer Engineer
- Mohamed Fawzi | Benelux Security and Compliance Lead
Other contributors:
- Daniel Lees | Cloud Security Architect
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Wade Holmes | Global Solutions Director
AI and ML perspective: Reliability
This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Architecture Framework.
In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.
Ensure that infrastructure is scalable and highly available
By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.
Consider the following recommendations:
- Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
- Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
- Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
- Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.
Use a modular and loosely coupled architecture
To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.
Consider the following recommendations:
- Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
- Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
- Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
- Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.
Build an automated MLOps platform
With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.
Consider the following recommendations:
- Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
- Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
- Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
- Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
- Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.
Maintain trust and control through data and model governance
The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.
Consider the following recommendations:
- Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
- Implement strict access controls and audit trails to protect sensitive data and models.
- Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
- Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.
Implement holistic AI and ML observability and reliability practices
To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.
Consider the following recommendations:
- Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
- Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
- Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
- Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).
Contributors
Authors:
- Rick (Rugui) Chen | AI Infrastructure Solutions Architect
- Filipe Gracio, PhD | Customer Engineer
Other contributors:
- Jose Andrade | Enterprise Infrastructure Customer Engineer
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
AI and ML perspective: Cost optimization
This document in Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Architecture Framework.
AI and ML systems can help you to unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.
Define and measure costs and returns
To effectively manage your AI and ML costs in Google Cloud, you must define and measure the expenses for cloud resources and the business value of your AI and ML initiatives. Google Cloud provides comprehensive tools for billing and cost management to help you to track expenses granularly. Business value metrics that you can measure include customer satisfaction, revenue, and operational costs. By establishing concrete metrics for both costs and business value, you can make informed decisions about resource allocation and optimization.
Consider the following recommendations:
- Establish clear business objectives and key performance indicators (KPIs) for your AI and ML projects.
- Use the billing information provided by Google Cloud to implement cost monitoring and reporting processes that can help you to attribute costs to specific AI and ML activities.
- Establish dashboards, alerting, and reporting systems to track costs and returns against KPIs.
Optimize resource allocation
To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. By carefully aligning resource allocation with the needs of your workloads, you can avoid unnecessary expenses and ensure that your AI and ML systems have the resources that they need to perform optimally.
Consider the following recommendations:
- Use autoscaling to dynamically adjust resources for training and inference.
- Start with small models and data. Save costs by testing hypotheses at a smaller scale when possible.
- Discover your compute needs through experimentation. Rightsize the resources that are used for training and serving based on your ML requirements.
- Adopt MLOps practices to reduce duplication, manual processes, and inefficient resource allocation.
Enforce data management and governance practices
Effective data management and governance practices play a critical role in cost optimization. Well-organized data helps your organization to avoid needless duplication, reduces the effort required to obtain high quality data, and encourages teams to reuse datasets. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained and operate on the most relevant and valuable data.
Consider the following recommendations:
- Establish and adopt a well-defined data governance framework.
- Apply labels and relevant metadata to datasets at the point of data ingestion.
- Ensure that datasets are discoverable and accessible across the organization.
- Make your datasets and features reusable throughout the ML lifecycle wherever possible.
Automate and streamline with MLOps
A primary benefit of adopting MLOps practices is a reduction in costs, both from a technology perspective and in terms of personnel activities. Automation helps you to avoid duplication of ML activities and improve the productivity of data scientists and ML engineers.
Consider the following recommendations:
- Increase the level of automation and standardization in your data collection and processing technologies to reduce development effort and time.
- Develop automated training pipelines to reduce the need for manual interventions and increase engineer productivity. Implement mechanisms for the pipelines to reuse existing assets like prepared datasets and trained models.
- Use the model evaluation and tuning services in Google Cloud to increase model performance with fewer iterations. This enables your AI and ML teams to achieve more objectives in less time.
Use managed services and pre-trained or existing models
There are many approaches to achieving business goals by using AI and ML. Adopt an incremental approach to model selection and model development. This helps you to avoid excessive costs that are associated with starting fresh every time. To control costs, start with a simple approach: use ML frameworks, managed services, and pre-trained models.
Consider the following recommendations:
- Enable exploratory and quick ML experiments by using notebook environments.
- Use existing and pre-trained models as a starting point to accelerate your model selection and development process.
- Use managed services to train or serve your models. Both AutoML and managed custom model training services can help to reduce the cost of model training. Managed services can also help to reduce the cost of your model-serving infrastructure.
Foster a culture of cost awareness and continuous optimization
Cultivate a collaborative environment that encourages communication and regular reviews. This approach helps teams to identify and implement cost-saving opportunities throughout the ML lifecycle.
Consider the following recommendations:
- Adopt FinOps principles across your ML lifecycle.
- Ensure that all costs and business benefits of AI and ML projects have assigned owners with clear accountability.
Contributors
Authors:
- Isaac Lo | AI Business Development Manager
- Filipe Gracio, PhD | Customer Engineer
Other contributors:
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Nicolas Pintaux | Customer Engineer, Application Modernization Specialist
AI and ML perspective: Performance optimization
This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Architecture Framework.
AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.
To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.
Translate business goals to performance objectives
To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.
Consider the following recommendations:
- Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
- Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
- Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.
Run and track frequent experiments
To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.
Consider the following recommendations:
- Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
- Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.
Build and automate training and serving services
Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.
Consider the following recommendations:
- Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
- Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.
Match design choices to performance requirements
When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.
Consider the following recommendations:
- Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
- Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
- Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
- Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.
Link performance metrics to design and configuration choices
To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.
Consider the following recommendations:
- Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
- Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.
Contributors
Authors:
- Benjamin Sadik | AI and ML Specialist Customer Engineer
- Filipe Gracio, PhD | Customer Engineer
Other contributors:
- Kumar Dhanagopal | Cross-Product Solution Developer
- Marwan Al Shawi | Partner Customer Engineer
- Zach Seils | Networking Specialist