Google Cloud Architecture Framework

Last reviewed 2024-12-06 UTC

The Google Cloud Architecture Framework provides recommendations to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that's secure, efficient, resilient, high-performing, and cost-effective. The Google Cloud Architecture Framework is our version of a well-architected framework.

A cross-functional team of experts at Google validates the recommendations in the Architecture Framework. The team curates the Architecture Framework to reflect the expanding capabilities of Google Cloud, industry best practices, community knowledge, and feedback from you. For a summary of the significant changes to the Architecture Framework, see What's new.

The Architecture Framework is relevant to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.

Architecture Framework pillars and perspectives

The Google Cloud Architecture Framework is organized into five pillars, as shown in the following diagram. We also provide cross-pillar perspectives that focus on recommendations for selected domains, industries, and technologies like AI and machine learning (ML).

Google Cloud Architecture Framework.

Pillars

Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
Reliability
Design and operate resilient and highly available workloads in the cloud.
Cost optimization
Maximize the business value of your investment in Google Cloud.
Performance optimization
Design and tune your cloud resources for optimal performance.

Perspectives

AI and ML
A cross-pillar view of recommendations that are specific to AI and ML workloads.

Core principles

Before you explore the recommendations in each pillar of the Architecture Framework, review the following core principles:

Design for change

No system is static. The needs of its users, the goals of the team that builds the system, and the system itself are constantly changing. With the need for change in mind, build a development and production process that enables teams to regularly deliver small changes and get fast feedback on those changes. Consistently demonstrating the ability to deploy changes helps to build trust with stakeholders, including the teams responsible for the system, and the users of the system. Using DORA's software delivery metrics can help your team monitor the speed, ease, and safety of making changes to the system.

Document your architecture

When you start to move your workloads to the cloud or build your applications, lack of documentation about the system can be a major obstacle. Documentation is especially important for correctly visualizing the architecture of your current deployments.

Quality documentation isn't achieved by producing a specific amount of documentation, but by how clear content is, how useful it is, and how it's maintained as the system changes.

A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. The documentation also provides the information that's necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.

Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current design, strategy, or history.

Analysis by DORA has found a clear link between documentation quality and organizational performance — the organization's ability to meet their performance and profitability goals.

Simplify your design and use fully managed services

Simplicity is crucial for design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.

Decouple your architecture

Research from DORA shows that architecture is an important predictor for achieving continuous delivery. Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might separate a monolithic application stack into individual service components. In a loosely coupled architecture, an application can run its functions independently, regardless of the various dependencies.

A decoupled architecture gives you increased flexibility to do the following:

  • Apply independent upgrades.
  • Enforce specific security controls.
  • Establish reliability goals for each subsystem.
  • Monitor health.
  • Granularly control performance and cost parameters.

You can start the decoupling process early in your design phase or incorporate it as part of your system upgrades as you scale.

Use a stateless architecture

A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as local caching of data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

Google Cloud Architecture Framework: Operational excellence

The operational excellence pillar in the Google Cloud Architecture Framework provides recommendations to operate workloads efficiently on Google Cloud. Operational excellence in the cloud involves designing, implementing, and managing cloud solutions that provide value, performance, security, and reliability. The recommendations in this pillar help you to continuously improve and adapt workloads to meet the dynamic and ever-evolving needs in the cloud.

The operational excellence pillar is relevant to the following audiences:

  • Managers and leaders: A framework to establish and maintain operational excellence in the cloud and to ensure that cloud investments deliver value and support business objectives.
  • Cloud operations teams: Guidance to manage incidents and problems, plan capacity, optimize performance, and manage change.
  • Site reliability engineers (SREs): Best practices that help you to achieve high levels of service reliability, including monitoring, incident response, and automation.
  • Cloud architects and engineers: Operational requirements and best practices for the design and implementation phases, to help ensure that solutions are designed for operational efficiency and scalability.
  • DevOps teams: Guidance about automation, CI/CD pipelines, and change management, to help enable faster and more reliable software delivery.

To achieve operational excellence, you should embrace automation, orchestration, and data-driven insights. Automation helps to eliminate toil. It also streamlines and builds guardrails around repetitive tasks. Orchestration helps to coordinate complex processes. Data-driven insights enable evidence-based decision-making. By using these practices, you can optimize cloud operations, reduce costs, improve service availability, and enhance security.

Operational excellence in the cloud goes beyond technical proficiency in cloud operations. It includes a cultural shift that encourages continuous learning and experimentation. Teams must be empowered to innovate, iterate, and adopt a growth mindset. A culture of operational excellence fosters a collaborative environment where individuals are encouraged to share ideas, challenge assumptions, and drive improvement.

For operational excellence principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Operational excellence in the Architecture Framework.

Core principles

The recommendations in the operational excellence pillar of the Architecture Framework are mapped to the following core principles:

Contributors

Authors:

Other contributors:

Ensure operational readiness and performance using CloudOps

This principle in the operational excellence pillar of the Google Cloud Architecture Framework helps you to ensure operational readiness and performance of your cloud workloads. It emphasizes establishing clear expectations and commitments for service performance, implementing robust monitoring and alerting, conducting performance testing, and proactively planning for capacity needs.

Principle overview

Different organizations might interpret operational readiness differently. Operational readiness is how your organization prepares to successfully operate workloads on Google Cloud. Preparing to operate a complex, multilayered cloud workload requires careful planning for both go-live and day-2 operations. These operations are often called CloudOps.

Focus areas of operational readiness

Operational readiness consists of four focus areas. Each focus area consists of a set of activities and components that are necessary to prepare to operate a complex application or environment in Google Cloud. The following table lists the components and activities of each focus area:

Focus area of operational readiness Activities and components
Workforce
  • Defining clear roles and responsibilities for the teams that manage and operate the cloud resources.
  • Ensuring that team members have appropriate skills.
  • Developing a learning program.
  • Establishing a clear team structure.
  • Hiring the required talent.
Processes
  • Observability.
  • Managing service disruptions.
  • Cloud delivery.
  • Core cloud operations.
Tooling Tools that are required to support CloudOps processes.
Governance
  • Service levels and reporting.
  • Cloud financials.
  • Cloud operating model.
  • Architectural review and governance boards.
  • Cloud architecture and compliance.

Recommendations

To ensure operational readiness and performance by using CloudOps, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Define SLOs and SLAs

A core responsibility of the cloud operations team is to define service level objectives (SLOs) and service level agreements (SLAs) for all of the critical workloads. This recommendation is relevant to the governance focus area of operational readiness.

SLOs must be specific, measurable, achievable, relevant, and time-bound (SMART), and they must reflect the level of service and performance that you want.

  • Specific: Clearly articulates the required level of service and performance.
  • Measurable: Quantifiable and trackable.
  • Achievable: Attainable within the limits of your organization's capabilities and resources.
  • Relevant: Aligned with business goals and priorities.
  • Time-bound: Has a defined timeframe for measurement and evaluation.

For example, an SLO for a web application might be "99.9% availability" or "average response time less than 200 ms." Such SLOs clearly define the required level of service and performance for the web application, and the SLOs can be measured and tracked over time.

SLAs outline the commitments to customers regarding service availability, performance, and support, including any penalties or remedies for noncompliance. SLAs must include specific details about the services that are provided, the level of service that can be expected, the responsibilities of both the service provider and the customer, and any penalties or remedies for noncompliance. SLAs serve as a contractual agreement between the two parties, ensuring that both have a clear understanding of the expectations and obligations that are associated with the cloud service.

Google Cloud provides tools like Cloud Monitoring and service level indicators (SLIs) to help you define and track SLOs. Cloud Monitoring provides comprehensive monitoring and observability capabilities that enable your organization to collect and analyze metrics that are related to the availability, performance, and latency of cloud-based applications and services. SLIs are specific metrics that you can use to measure and track SLOs over time. By utilizing these tools, you can effectively monitor and manage cloud services, and ensure that they meet the SLOs and SLAs.

Clearly defining and communicating SLOs and SLAs for all of your critical cloud services helps to ensure reliability and performance of your deployed applications and services.

Implement comprehensive observability

To get real-time visibility into the health and performance of your cloud environment, we recommend that you use a combination of Google Cloud Observability tools and third-party solutions. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Implementing a combination of observability solutions provides you with a comprehensive observability strategy that covers various aspects of your cloud infrastructure and applications. Google Cloud Observability is a unified platform for collecting, analyzing, and visualizing metrics, logs, and traces from various Google Cloud services, applications, and external sources. By using Cloud Monitoring, you can gain insights into resource utilization, performance characteristics, and overall health of your resources.

To ensure comprehensive monitoring, monitor important metrics that align with system health indicators such as CPU utilization, memory usage, network traffic, disk I/O, and application response times. You must also consider business-specific metrics. By tracking these metrics, you can identify potential bottlenecks, performance issues, and resource constraints. Additionally, you can set up alerts to notify relevant teams proactively about potential issues or anomalies.

To enhance your monitoring capabilities further, you can integrate third-party solutions with Google Cloud Observability. These solutions can provide additional functionality, such as advanced analytics, machine learning-powered anomaly detection, and incident management capabilities. This combination of Google Cloud Observability tools and third-party solutions lets you create a robust and customizable monitoring ecosystem that's tailored to your specific needs. By using this combination approach, you can proactively identify and address issues, optimize resource utilization, and ensure the overall reliability and availability of your cloud applications and services.

Implement performance and load testing

Performing regular performance testing helps you to ensure that your cloud-based applications and infrastructure can handle peak loads and maintain optimal performance. Load testing simulates realistic traffic patterns. Stress testing pushes the system to its limits to identify potential bottlenecks and performance limitations. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Tools like Cloud Load Balancing and load testing services can help you to simulate real-world traffic patterns and stress-test your applications. These tools provide valuable insights into how your system behaves under various load conditions, and can help you to identify areas that require optimization.

Based on the results of performance testing, you can make decisions to optimize your cloud infrastructure and applications for optimal performance and scalability. This optimization might involve adjusting resource allocation, tuning configurations, or implementing caching mechanisms.

For example, if you find that your application is experiencing slowdowns during periods of high traffic, you might need to increase the number of virtual machines or containers that are allocated to the application. Alternatively, you might need to adjust the configuration of your web server or database to improve performance.

By regularly conducting performance testing and implementing the necessary optimizations, you can ensure that your cloud-based applications and infrastructure always run at peak performance, and deliver a seamless and responsive experience for your users. Doing so can help you to maintain a competitive advantage and build trust with your customers.

Plan and manage capacity

Proactively planning for future capacity needs—both organic or inorganic—helps you to ensure the smooth operation and scalability of your cloud-based systems. This recommendation is relevant to the processes focus area of operational readiness.

Planning for future capacity includes understanding and managing quotas for various resources like compute instances, storage, and API requests. By analyzing historical usage patterns, growth projections, and business requirements, you can accurately anticipate future capacity requirements. You can use tools like Cloud Monitoring and BigQuery to collect and analyze usage data, identify trends, and forecast future demand.

Historical usage patterns provide valuable insights into resource utilization over time. By examining metrics like CPU utilization, memory usage, and network traffic, you can identify periods of high demand and potential bottlenecks. Additionally, you can help to estimate future capacity needs by making growth projections based on factors like growth in the user base, new products and features, and marketing campaigns. When you assess capacity needs, you should also consider business requirements like SLAs and performance targets.

When you determine the resource sizing for a workload, consider factors that can affect utilization of resources. Seasonal variations like holiday shopping periods or end-of-quarter sales can lead to temporary spikes in demand. Planned events like product launches or marketing campaigns can also significantly increase traffic. To make sure that your primary and disaster recovery (DR) system can handle unexpected surges in demand, plan for capacity that can support graceful failover during disruptions like natural disasters and cyberattacks.

Autoscaling is an important strategy for dynamically adjusting your cloud resources based on workload fluctuations. By using autoscaling policies, you can automatically scale compute instances, storage, and other resources in response to changing demand. This ensures optimal performance during peak periods while minimizing costs when resource utilization is low. Autoscaling algorithms use metrics like CPU utilization, memory usage, and queue depth to determine when to scale resources.

Continuously monitor and optimize

To manage and optimize cloud workloads, you must establish a process for continuously monitoring and analyzing performance metrics. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

To establish a process for continuous monitoring and analysis, you track, collect, and evaluate data that's related to various aspects of your cloud environment. By using this data, you can proactively identify areas for improvement, optimize resource utilization, and ensure that your cloud infrastructure consistently meets or exceeds your performance expectations.

An important aspect of performance monitoring is regularly reviewing logs and traces. Logs provide valuable insights into system events, errors, and warnings. Traces provide detailed information about the flow of requests through your application. By analyzing logs and traces, you can identify potential issues, identify the root causes of problems, and get a better understanding of how your applications behave under different conditions. Metrics like the round-trip time between services can help you to identify and understand bottlenecks that are in your workloads.

Further, you can use performance-tuning techniques to significantly enhance application response times and overall efficiency. The following are examples of techniques that you can use:

  • Caching: Store frequently accessed data in memory to reduce the need for repeated database queries or API calls.
  • Database optimization: Use techniques like indexing and query optimization to improve the performance of database operations.
  • Code profiling: Identify areas of your code that consume excessive resources or cause performance issues.

By applying these techniques, you can optimize your applications and ensure that they run efficiently in the cloud.

Manage incidents and problems

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage incidents and problems related to your cloud workloads. It involves implementing comprehensive monitoring and observability, establishing clear incident response procedures, conducting thorough root cause analysis, and implementing preventive measures. Many of the topics that are discussed in this principle are covered in detail in the Reliability pillar.

Principle overview

Incident management and problem management are important components of a functional operations environment. How you respond to, categorize, and solve incidents of differing severity can significantly affect your operations. You must also proactively and continuously make adjustments to optimize reliability and performance. An efficient process for incident and problem management relies on the following foundational elements:

  • Continuous monitoring: Identify and resolve issues quickly.
  • Automation: Streamline tasks and improve efficiency.
  • Orchestration: Coordinate and manage cloud resources effectively.
  • Data-driven insights: Optimize cloud operations and make informed decisions.

These elements help you to build a resilient cloud environment that can handle a wide range of challenges and disruptions. These elements can also help to reduce the risk of costly incidents and downtime, and they can help you to achieve greater business agility and success. These foundational elements are spread across the four focus areas of operational readiness: Workforce, Processes, Tooling, and Governance.

Recommendations

To manage incidents and problems effectively, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Establish clear incident response procedures

Clear roles and responsibilities are essential to ensure effective and coordinated response to incidents. Additionally, clear communication protocols and escalation paths help to ensure that information is shared promptly and effectively during an incident. This recommendation is relevant to these focus areas of operational readiness: workforce, processes, and tooling.

To establish incident response procedures, you need to define the roles and expectations of each team member, such as incident commanders, investigators, communicators, and technical experts. Establishing communication and escalation paths includes identifying important contacts, setting up communication channels, and defining the process for escalating incidents to higher levels of management when necessary. Regular training and preparation helps to ensure that teams are equipped with the knowledge and skills to respond to incidents effectively.

By documenting incident response procedures in a runbook or playbook, you can provide a standardized reference guide for teams to follow during an incident. The runbook must outline the steps to be taken at each stage of the incident response process, including communication, triage, investigation, and resolution. It must also include information about relevant tools and resources and contact information for important personnel. You must regularly review and update the runbook to ensure that it remains current and effective.

Centralize incident management

For effective tracking and management throughout the incident lifecycle, consider using a centralized incident management system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

A centralized incident management system provides the following advantages:

  • Improved visibility: By consolidating all incident-related data in a single location, you eliminate the need for teams to search in various channels or systems for context. This approach saves time and reduces confusion, and it gives stakeholders a comprehensive view of the incident, including its status, impact, and progress.
  • Better coordination and collaboration: A centralized system provides a unified platform for communication and task management. It promotes seamless collaboration between the different departments and functions that are involved in incident response. This approach ensures that everyone has access to up-to-date information and it reduces the risk of miscommunication and misalignment.
  • Enhanced accountability and ownership: A centralized incident management system enables your organization to allocate tasks to specific individuals or teams and it ensures that responsibilities are clearly defined and tracked. This approach promotes accountability and encourages proactive problem-solving because team members can easily monitor their progress and contributions.

A centralized incident management system must offer robust features for incident tracking, task assignment, and communication management. These features let you customize workflows, set priorities, and integrate with other systems, such as monitoring tools and ticketing systems.

By implementing a centralized incident management system, you can optimize your organization's incident response processes, improve collaboration, and enhance visibility. Doing so leads to faster incident resolution times, reduced downtime, and improved customer satisfaction. It also helps foster a culture of continuous improvement because you can learn from past incidents and identify areas for improvement.

Conduct thorough post-incident reviews

After an incident occurs, you must conduct a detailed post-incident review (PIR), which is also known as a postmortem, to identify the root cause, contributing factors, and lessons learned. This thorough review helps you to prevent similar incidents in the future. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

The PIR process must involve a multidisciplinary team that has expertise in various aspects of the incident. The team must gather all of the relevant information through interviews, documentation review, and site inspections. A timeline of events must be created to establish the sequence of actions that led up to the incident.

After the team gathers the required information, they must conduct a root cause analysis to determine the factors that led to the incident. This analysis must identify both the immediate cause and the systemic issues that contributed to the incident.

Along with identifying the root cause, the PIR team must identify any other contributing factors that might have caused the incident. These factors could include human error, equipment failure, or organizational factors like communication breakdowns and lack of training.

The PIR report must document the findings of the investigation, including the timeline of events, root cause analysis, and recommended actions. The report is a valuable resource for implementing corrective actions and preventing recurrence. The report must be shared with all of the relevant stakeholders and it must be used to develop safety training and procedures.

To ensure a successful PIR process, your organization must foster a blameless culture that focuses on learning and improvement rather than assigning blame. This culture encourages individuals to report incidents without fear of retribution, and it lets you address systemic issues and make meaningful improvements.

By conducting thorough PIRs and implementing corrective measures based on the findings, you can significantly reduce the risk of similar incidents occurring in the future. This proactive approach to incident investigation and prevention helps to create a safer and more efficient work environment for everyone involved.

Maintain a knowledge base

A knowledge base of known issues, solutions, and troubleshooting guides is essential for incident management and resolution. Team members can use the knowledge base to quickly identify and address common problems. Implementing a knowledge base helps to reduce the need for escalation and it improves overall efficiency. This recommendation is relevant to these focus areas of operational readiness: workforce and processes.

A primary benefit of a knowledge base is that it lets teams learn from past experiences and avoid repeating mistakes. By capturing and sharing solutions to known issues, teams can build a collective understanding of how to resolve common problems and best practices for incident management. Use of a knowledge base saves time and effort, and helps to standardize processes and ensure consistency in incident resolution.

Along with helping to improve incident resolution times, a knowledge base promotes knowledge sharing and collaboration across teams. With a central repository of information, teams can easily access and contribute to the knowledge base, which promotes a culture of continuous learning and improvement. This culture encourages teams to share their expertise and experiences, leading to a more comprehensive and valuable knowledge base.

To create and manage a knowledge base effectively, use appropriate tools and technologies. Collaboration platforms like Google Workspace are well-suited for this purpose because they let you easily create, edit, and share documents collaboratively. These tools also support version control and change tracking, which ensures that the knowledge base remains up-to-date and accurate.

Make the knowledge base easily accessible to all relevant teams. You can achieve this by integrating the knowledge base with existing incident management systems or by providing a dedicated portal or intranet site. A knowledge base that's readily available lets teams quickly access the information that they need to resolve incidents efficiently. This availability helps to reduce downtime and minimize the impact on business operations.

Regularly review and update the knowledge base to ensure that it remains relevant and useful. Monitor incident reports, identify common issues and trends, and incorporate new solutions and troubleshooting guides into the knowledge base. An up-to-date knowledge base helps your teams resolve incidents faster and more effectively.

Automate incident response

Automation helps to streamline your incident response and remediation processes. It lets you address security breaches and system failures promptly and efficiently. By using Google Cloud products like Cloud Run functions or Cloud Run, you can automate various tasks that are typically manual and time-consuming. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

Automated incident response provides the following benefits:

  • Reduction in incident detection and resolution times: Automated tools can continuously monitor systems and applications, detect suspicious or anomalous activities in real time, and notify stakeholders or respond without intervention. This automation lets you identify potential threats or issues before they escalate into major incidents. When an incident is detected, automated tools can trigger predefined remediation actions, such as isolating affected systems, quarantining malicious files, or rolling back changes to restore the system to a known good state.
  • Reduced burden on security and operations teams: Automated incident response lets the security and operations teams focus on more strategic tasks. By automating routine and repetitive tasks, such as collecting diagnostic information or triggering alerts, your organization can free up personnel to handle more complex and critical incidents. This automation can lead to improved overall incident response effectiveness and efficiency.
  • Enhanced consistency and accuracy of the remediation process: Automated tools can ensure that remediation actions are applied uniformly across all affected systems, minimizing the risk of human error or inconsistency. This standardization of the remediation process helps to minimize the impact of incidents on users and the business.

Manage and optimize cloud resources

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you manage and optimize the resources that are used by your cloud workloads. It involves right-sizing resources based on actual usage and demand, using autoscaling for dynamic resource allocation, implementing cost optimization strategies, and regularly reviewing resource utilization and costs. Many of the topics that are discussed in this principle are covered in detail in the Cost optimization pillar.

Principle overview

Cloud resource management and optimization play a vital role in optimizing cloud spending, resource usage, and infrastructure efficiency. It includes various strategies and best practices aimed at maximizing the value and return from your cloud spending.

This pillar's focus on optimization extends beyond cost reduction. It emphasizes the following goals:

  • Efficiency: Using automation and data analytics to achieve peak performance and cost savings.
  • Performance: Scaling resources effortlessly to meet fluctuating demands and deliver optimal results.
  • Scalability: Adapting infrastructure and processes to accommodate rapid growth and diverse workloads.

By focusing on these goals, you achieve a balance between cost and functionality. You can make informed decisions regarding resource provisioning, scaling, and migration. Additionally, you gain valuable insights into resource consumption patterns, which lets you proactively identify and address potential issues before they escalate.

Recommendations

To manage and optimize resources, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Right-size resources

Continuously monitoring resource utilization and adjusting resource allocation to match actual demand are essential for efficient cloud resource management. Over-provisioning resources can lead to unnecessary costs, and under-provisioning can cause performance bottlenecks that affect application performance and user experience. To achieve an optimal balance, you must adopt a proactive approach to right-sizing cloud resources. This recommendation is relevant to the governance focus area of operational readiness.

Cloud Monitoring and Recommender can help you to identify opportunities for right-sizing. Cloud Monitoring provides real-time visibility into resource utilization metrics. This visibility lets you track resource usage patterns and identify potential inefficiencies. Recommender analyzes resource utilization data to make intelligent recommendations for optimizing resource allocation. By using these tools, you can gain insights into resource usage and make informed decisions about right-sizing the resources.

In addition to Cloud Monitoring and Recommender, consider using custom metrics to trigger automated right-sizing actions. Custom metrics let you track specific resource utilization metrics that are relevant to your applications and workloads. You can also configure alerts to notify administrators when predefined thresholds are met. The administrators can then take necessary actions to adjust resource allocation. This proactive approach ensures that resources are scaled in a timely manner, which helps to optimize cloud costs and prevent performance issues.

Use autoscaling

Autoscaling compute and other resources helps to ensure optimal performance and cost efficiency of your cloud-based applications. Autoscaling lets you dynamically adjust the capacity of your resources based on workload fluctuations, so that you have the resources that you need when you need them and you can avoid over-provisioning and unnecessary costs. This recommendation is relevant to the processes focus area of operational readiness.

To meet the diverse needs of different applications and workloads, Google Cloud offers various autoscaling options, including the following:

  • Compute Engine managed instance groups (MIGs) are groups of VMs that are managed and scaled as a single entity. With MIGs, you can define autoscaling policies that specify the minimum and maximum number of VMs to maintain in the group, and the conditions that trigger autoscaling. For example, you can configure a policy to add VMs in a MIG when the CPU utilization reaches a certain threshold and to remove VMs when the utilization drops below a different threshold.
  • Google Kubernetes Engine (GKE) autoscaling dynamically adjusts your cluster resources to match your application's needs. It offers the following tools:

    • Cluster Autoscaler adds or removes nodes based on Pod resource demands.
    • Horizontal Pod Autoscaler changes the number of Pod replicas based on CPU, memory, or custom metrics.
    • Vertical Pod Autoscaler fine-tunes Pod resource requests and limits based on usage patterns.
    • Node Auto-Provisioning automatically creates optimized node pools for your workloads.

    These tools work together to optimize resource utilization, ensure application performance, and simplify cluster management.

  • Cloud Run is a serverless platform that lets you run code without having to manage infrastructure. Cloud Run offers built-in autoscaling, which automatically adjusts the number of instances based on the incoming traffic. When the volume of traffic increases, Cloud Run scales up the number of instances to handle the load. When traffic decreases, Cloud Run scales down the number of instances to reduce costs.

By using these autoscaling options, you can ensure that your cloud-based applications have the resources that they need to handle varying workloads, while avoiding overprovisioning and unnecessary costs. Using autoscaling can lead to improved performance, cost savings, and more efficient use of cloud resources.

Leverage cost optimization strategies

Optimizing cloud spending helps you to effectively manage your organization's IT budgets. This recommendation is relevant to the governance focus area of operational readiness.

Google Cloud offers several tools and techniques to help you optimize cloud costs. By using these tools and techniques, you can get the best value from your cloud spending. These tools and techniques help you to identify areas where costs can be reduced, such as identifying underutilized resources or recommending more cost-effective instance types. Google Cloud options to help optimize cloud costs include the following:

Pricing models might change over time, and new features might be introduced that offer better performance or lower cost compared to existing options. Therefore, you should regularly review pricing models and consider alternative features. By staying informed about the latest pricing models and features, you can make informed decisions about your cloud architecture to minimize costs.

Google Cloud's Cost Management tools, such as budgets and alerts, provide valuable insights into cloud spending. Budgets and alerts let users set budgets and receive alerts when the budgets are exceeded. These tools help users track their cloud spending and identify areas where costs can be reduced.

Track resource usage and costs

You can use tagging and labeling to track resource usage and costs. By assigning tags and labels to your cloud resources like projects, departments, or other relevant dimensions, you can categorize and organize the resources. This lets you monitor and analyze spending patterns for specific resources and identify areas of high usage or potential cost savings. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

Tools like Cloud Billing and Cost Management help you to get a comprehensive understanding of your spending patterns. These tools provide detailed insights into your cloud usage and they let you identify trends, forecast costs, and make informed decisions. By analyzing historical data and current spending patterns, you can identify the focus areas for your cost-optimization efforts.

Custom dashboards and reports help you to visualize cost data and gain deeper insights into spending trends. By customizing dashboards with relevant metrics and dimensions, you can monitor key performance indicators (KPIs) and track progress towards your cost optimization goals. Reports offer deeper analyses of cost data. Reports let you filter the data by specific time periods or resource types to understand the underlying factors that contribute to your cloud spending.

Regularly review and update your tags, labels, and cost analysis tools to ensure that you have the most up-to-date information on your cloud usage and costs. By staying informed and conducting cost postmortems or proactive cost reviews, you can promptly identify any unexpected increases in spending. Doing so lets you make proactive decisions to optimize cloud resources and control costs.

Establish cost allocation and budgeting

Accountability and transparency in cloud cost management are crucial for optimizing resource utilization and ensuring financial control. This recommendation is relevant to the governance focus area of operational readiness.

To ensure accountability and transparency, you need to have clear mechanisms for cost allocation and chargeback. By allocating costs to specific teams, projects, or individuals, your organization can ensure that each of these entities is responsible for its cloud usage. This practice fosters a sense of ownership and encourages responsible resource management. Additionally, chargeback mechanisms enable your organization to recover cloud costs from internal customers, align incentives with performance, and promote fiscal discipline.

Establishing budgets for different teams or projects is another essential aspect of cloud cost management. Budgets enable your organization to define spending limits and track actual expenses against those limits. This approach lets you make proactive decisions to prevent uncontrolled spending. By setting realistic and achievable budgets, you can ensure that cloud resources are used efficiently and aligned with business objectives. Regular monitoring of actual spending against budgets helps you to identify variances and address potential overruns promptly.

To monitor budgets, you can use tools like Cloud Billing budgets and alerts. These tools provide real-time insights into cloud spending and they notify stakeholders of potential overruns. By using these capabilities, you can track cloud costs and take corrective actions before significant deviations occur. This proactive approach helps to prevent financial surprises and ensures that cloud resources are used responsibly.

Automate and manage change

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you automate and manage change for your cloud workloads. It involves implementing infrastructure as code (IaC), establishing standard operating procedures, implementing a structured change management process, and using automation and orchestration.

Principle overview

Change management and automation play a crucial role in ensuring smooth and controlled transitions within cloud environments. For effective change management, you need to use strategies and best practices that minimize disruptions and ensure that changes are integrated seamlessly with existing systems.

Effective change management and automation include the following foundational elements:

  • Change governance: Establish clear policies and procedures for change management, including approval processes and communication plans.
  • Risk assessment: Identify potential risks associated with changes and mitigate them through risk management techniques.
  • Testing and validation: Thoroughly test changes to ensure that they meet functional and performance requirements and mitigate potential regressions.
  • Controlled deployment: Implement changes in a controlled manner, ensuring that users are seamlessly transitioned to the new environment, with mechanisms to seamlessly roll back if needed.

These foundational elements help to minimize the impact of changes and ensure that changes have a positive effect on business operations. These elements are represented by the processes, tooling, and governance focus areas of operational readiness.

Recommendations

To automate and manage change, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Adopt IaC

Infrastructure as code (IaC) is a transformative approach for managing cloud infrastructure. You can define and manage cloud infrastructure declaratively by using tools like Terraform. IaC helps you achieve consistency, repeatability, and simplified change management. It also enables faster and more reliable deployments. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of adopting the IaC approach for your cloud deployments:

  • Human-readable resource configurations: With the IaC approach, you can declare your cloud infrastructure resources in a human-readable format, like JSON or YAML. Infrastructure administrators and operators can easily understand and modify the infrastructure and collaborate with others.
  • Consistency and repeatability: IaC enables consistency and repeatability in your infrastructure deployments. You can ensure that your infrastructure is provisioned and configured the same way every time, regardless of who is performing the deployment. This approach helps to reduce errors and ensures that your infrastructure is always in a known state.
  • Accountability and simplified troubleshooting: The IaC approach helps to improve accountability and makes it easier to troubleshoot issues. By storing your IaC code in a version control system, you can track changes, and identify when changes were made and by whom. If necessary, you can easily roll back to previous versions.

Implement version control

A version control system like Git is a key component of the IaC process. It provides robust change management and risk mitigation capabilities, which is why it's widely adopted, either through in-house development or SaaS solutions. This recommendation is relevant to these focus areas of operational readiness: governance and tooling.

By tracking changes to IaC code and configurations, version control provides visibility into the evolution of the code, making it easier to understand the impact of changes and identify potential issues. This enhanced visibility fosters collaboration among team members who work on the same IaC project.

Most version control systems let you easily roll back changes if needed. This capability helps to mitigate the risk of unintended consequences or errors. By using tools like Git in your IaC workflow, you can significantly improve change management processes, foster collaboration, and mitigate risks, which leads to a more efficient and reliable IaC implementation.

Build CI/CD pipelines

Continuous integration and continuous delivery (CI/CD) pipelines streamline the process of developing and deploying cloud applications. CI/CD pipelines automate the building, testing, and deployment stages, which enables faster and more frequent releases with improved quality control. This recommendation is relevant to the tooling focus area of operational readiness.

CI/CD pipelines ensure that code changes are continuously integrated into a central repository, typically a version control system like Git. Continuous integration facilitates early detection and resolution of issues, and it reduces the likelihood of bugs or compatibility problems.

To create and manage CI/CD pipelines for cloud applications, you can use tools like Cloud Build and Cloud Deploy.

  • Cloud Build is a fully managed build service that lets developers define and execute build steps in a declarative manner. It integrates seamlessly with popular source-code management platforms and it can be triggered by events like code pushes and pull requests.
  • Cloud Deploy is a serverless deployment service that automates the process of deploying applications to various environments, such as testing, staging, and production. It provides features like blue-green deployments, traffic splitting, and rollback capabilities, making it easier to manage and monitor application deployments.

Integrating CI/CD pipelines with version control systems and testing frameworks helps to ensure the quality and reliability of your cloud applications. By running automated tests as part of the CI/CD process, development teams can quickly identify and fix any issues before the code is deployed to the production environment. This integration helps to improve the overall stability and performance of your cloud applications.

Use configuration management tools

Tools like Puppet, Chef, Ansible, and VM Manager help you to automate the configuration and management of cloud resources. Using these tools, you can ensure resource consistency and compliance across your cloud environments. This recommendation is relevant to the tooling focus area of operational readiness.

Automating the configuration and management of cloud resources provides the following benefits:

  • Significant reduction in the risk of manual errors: When manual processes are involved, there is a higher likelihood of mistakes due to human error. Configuration management tools reduce this risk by automating processes, so that configurations are applied consistently and accurately across all cloud resources. This automation can lead to improved reliability and stability of the cloud environment.
  • Improvement in operational efficiency: By automating repetitive tasks, your organization can free up IT staff to focus on more strategic initiatives. This automation can lead to increased productivity and cost savings and improved responsiveness to changing business needs.
  • Simplified management of complex cloud infrastructure: As cloud environments grow in size and complexity, managing the resources can become increasingly difficult. Configuration management tools provide a centralized platform for managing cloud resources. The tools make it easier to track configurations, identify issues, and implement changes. Using these tools can lead to improved visibility, control, and security of your cloud environment.

Automate testing

Integrating automated testing into your CI/CD pipelines helps to ensure the quality and reliability of your cloud applications. By validating changes before deployment, you can significantly reduce the risk of errors and regressions, which leads to a more stable and robust software system. This recommendation is relevant to these focus areas of operational readiness: processes and tooling.

The following are the main benefits of incorporating automated testing into your CI/CD pipelines:

  • Early detection of bugs and defects: Automated testing helps to detect bugs and defects early in the development process, before they can cause major problems in production. This capability saves time and resources by preventing the need for costly rework and bug fixes at later stages in the development process.
  • High quality and standards-based code: Automated testing can help improve the overall quality of your code by ensuring that the code meets certain standards and best practices. This capability leads to more maintainable and reliable applications that are less prone to errors.

You can use various types of testing techniques in CI/CD pipelines. Each test type serves a specific purpose.

  • Unit testing focuses on testing individual units of code, such as functions or methods, to ensure that they work as expected.
  • Integration testing tests the interactions between different components or modules of your application to verify that they work properly together.
  • End-to-end testing is often used along with unit and integration testing. End-to-end testing simulates real-world scenarios to test the application as a whole, and helps to ensure that the application meets the requirements of your end users.

To effectively integrate automated testing into your CI/CD pipelines, you must choose appropriate testing tools and frameworks. There are many different options, each with its own strengths and weaknesses. You must also establish a clear testing strategy that outlines the types of tests to be performed, the frequency of testing, and the criteria for passing or failing a test. By following these recommendations, you can ensure that your automated testing process is efficient and effective. Such a process provides valuable insights into the quality and reliability of your cloud applications.

Continuously improve and innovate

This principle in the operational excellence pillar of the Google Cloud Architecture Framework provides recommendations to help you continuously optimize cloud operations and drive innovation.

Principle overview

To continuously improve and innovate in the cloud, you need to focus on continuous learning, experimentation, and adaptation. This helps you to explore new technologies and optimize existing processes and it promotes a culture of excellence that enables your organization to achieve and maintain industry leadership.

Through continuous improvement and innovation, you can achieve the following goals:

  • Accelerate innovation: Explore new technologies and services to enhance capabilities and drive differentiation.
  • Reduce costs: Identify and eliminate inefficiencies through process-improvement initiatives.
  • Enhance agility: Adapt rapidly to changing market demands and customer needs.
  • Improve decision making: Gain valuable insights from data and analytics to make data-driven decisions.

Organizations that embrace the continuous improvement and innovation principle can unlock the full potential of the cloud environment and achieve sustainable growth. This principle maps primarily to the Workforce focus area of operational readiness. A culture of innovation lets teams experiment with new tools and technologies to expand capabilities and reduce costs.

Recommendations

To continuously improve and innovate your cloud workloads, consider the recommendations in the following sections. Each recommendation in this document is relevant to one or more of the focus areas of operational readiness.

Foster a culture of learning

Encourage teams to experiment, share knowledge, and learn continuously. Adopt a blameless culture where failures are viewed as opportunities for growth and improvement. This recommendation is relevant to the workforce focus area of operational readiness.

When you foster a culture of learning, teams can learn from mistakes and iterate quickly. This approach encourages team members to take risks, experiment with new ideas, and expand the boundaries of their work. It also creates a psychologically safe environment where individuals feel comfortable sharing failures and learning from them. Sharing in this way leads to a more open and collaborative environment.

To facilitate knowledge sharing and continuous learning, create opportunities for teams to share knowledge and learn from each other. You can do this through informal and formal learning sessions and conferences.

By fostering a culture of experimentation, knowledge sharing, and continuous learning, you can create an environment where teams are empowered to take risks, innovate, and grow. This environment can lead to increased productivity, improved problem-solving, and a more engaged and motivated workforce. Further, by promoting a blameless culture, you can create a safe space for employees to learn from mistakes and contribute to the collective knowledge of the team. This culture ultimately leads to a more resilient and adaptable workforce that is better equipped to handle challenges and drive success in the long run.

Conduct regular retrospectives

Retrospectives give teams an opportunity to reflect on their experiences, identify what went well, and identify what can be improved. By conducting retrospectives after projects or major incidents, teams can learn from successes and failures, and continuously improve their processes and practices. This recommendation is relevant to these focus areas of operational readiness: processes and governance.

An effective way to structure a retrospective is to use the Start-Stop-Continue model:

  • Start: In the Start phase of the retrospective, team members identify new practices, processes, and behaviors that they believe can enhance their work. They discuss why the changes are needed and how they can be implemented.
  • Stop: In the Stop phase, team members identify and eliminate practices, processes, and behaviors that are no longer effective or that hinder progress. They discuss why these changes are necessary and how they can be implemented.
  • Continue: In the Continue phase, team members identify practices, processes, and behaviors that work well and must be continued. They discuss why these elements are important and how they can be reinforced.

By using a structured format like the Start-Stop-Continue model, teams can ensure that retrospectives are productive and focused. This model helps to facilitate discussion, identify the main takeaways, and identify actionable steps for future enhancements.

Stay up-to-date with cloud technologies

To maximize the potential of Google Cloud services, you must keep up with the latest advancements, features, and best practices. This recommendation is relevant to the workforce focus area of operational readiness.

Participating in relevant conferences, webinars, and training sessions is a valuable way to expand your knowledge. These events provide opportunities to learn from Google Cloud experts, understand new capabilities, and engage with industry peers who might face similar challenges. By attending these sessions, you can gain insights into how to use new features effectively, optimize your cloud operations, and drive innovation within your organization.

To ensure that your team members keep up with cloud technologies, encourage them to obtain certifications and attend training courses. Google Cloud offers a wide range of certifications that validate skills and knowledge in specific cloud domains. Earning these certifications demonstrates commitment to excellence and provides tangible evidence of proficiency in cloud technologies. The training courses that are offered by Google Cloud and our partners delve deeper into specific topics. They provide direct experience and practical skills that can be immediately applied to real-world projects. By investing in the professional development of your team, you can foster a culture of continuous learning and ensure that everyone has the necessary skills to succeed in the cloud.

Actively seek and incorporate feedback

Collect feedback from users, stakeholders, and team members. Use the feedback to identify opportunities to improve your cloud solutions. This recommendation is relevant to the workforce focus area of operational readiness.

The feedback that you collect can help you to understand the evolving needs, issues, and expectations of the users of your solutions. This feedback serves as a valuable input to drive improvements and prioritize future enhancements. You can use various mechanisms to collect feedback:

  • Surveys are an effective way to gather quantitative data from a large number of users and stakeholders.
  • User interviews provide an opportunity for in-depth qualitative data collection. Interviews let you understand the specific challenges and experiences of individual users.
  • Feedback forms that are placed within the cloud solutions offer a convenient way for users to provide immediate feedback on their experience.
  • Regular meetings with team members can facilitate the collection of feedback on technical aspects and implementation challenges.

The feedback that you collect through these mechanisms must be analyzed and synthesized to identify common themes and patterns. This analysis can help you prioritize future enhancements based on the impact and feasibility of the suggested improvements. By addressing the needs and issues that are identified through feedback, you can ensure that your cloud solutions continue to meet the evolving requirements of your users and stakeholders.

Measure and track progress

Key performance indicators (KPIs) and metrics are crucial for tracking progress and measuring the effectiveness of your cloud operations. KPIs are quantifiable measurements that reflect the overall performance. Metrics are specific data points that contribute to the calculation of KPIs. Review the metrics regularly and use them to identify opportunities for improvement and measure progress. Doing so helps you to continuously improve and optimize your cloud environment. This recommendation is relevant to these focus areas of operational readiness: governance and processes.

A primary benefit of using KPIs and metrics is that they enable your organization to adopt a data-driven approach to cloud operations. By tracking and analyzing operational data, you can make informed decisions about how to improve the cloud environment. This data-driven approach helps you to identify trends, patterns, and anomalies that might not be visible without the use of systematic metrics.

To collect and analyze operational data, you can use tools like Cloud Monitoring and BigQuery. Cloud Monitoring enables real-time monitoring of cloud resources and services. BigQuery lets you store and analyze the data that you gather through monitoring. Using these tools together, you can create custom dashboards to visualize important metrics and trends.

Operational dashboards can provide a centralized view of the most important metrics, which lets you quickly identify any areas that need attention. For example, a dashboard might include metrics like CPU utilization, memory usage, network traffic, and latency for a particular application or service. By monitoring these metrics, you can quickly identify any potential issues and take steps to resolve them.

Google Cloud Architecture Framework: Security, privacy, and compliance

This pillar of the Google Cloud Architecture Framework shows you how to architect and operate secure services on Google Cloud. You also learn about Google Cloud products and features that support security and compliance.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework helps you design your Google Cloud deployment so that it matches your business needs.

Moving your workloads into Google Cloud requires an evaluation of your business requirements, risks, compliance obligations, and security controls. This document helps you consider key best practices related to designing a secure solution in Google Cloud.

Google core principles include defense in depth, at scale, and by default. In Google Cloud, data and systems are protected through multiple layered defenses using policies and controls that are configured across IAM, encryption, networking, detection, logging, and monitoring.

Google Cloud comes with many security controls that you can build on, such as the following:

  • Secure options for data in transit, and default encryption for data at rest.
  • Built-in security features for Google Cloud products and services.
  • A global infrastructure that's designed for geo-redundancy, with security controls throughout the information-processing lifecycle.
  • Automation capabilities that use infrastructure as code (IaC) and configuration guardrails.

For more information about the security posture of Google Cloud, see the Google security paper and the Google Infrastructure Security Design Overview. For an example secure-by-default environment, see the Google Cloud enterprise foundations blueprint.

For security principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Security in the Architecture Framework.

In the security pillar of the Architecture Framework, you learn to do the following:

Shared responsibilities and shared fate on Google Cloud

This document describes the differences between the shared responsibility model and shared fate in Google Cloud. It discusses the challenges and nuances of the shared responsibility model. This document describes what shared fate is and how we partner with our customers to address cloud security challenges.

Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers.

Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate.

Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud.

Shared responsibility

You're the expert in knowing the security and regulatory requirements for your business, and knowing the requirements for protecting your confidential data and resources. When you run your workloads on Google Cloud, you must identify the security controls that you need to configure in Google Cloud to help protect your confidential data and each workload. To decide which security controls to implement, you must consider the following factors:

  • Your regulatory compliance obligations
  • Your organization's security standards and risk management plan
  • Security requirements of your customers and your vendors

Defined by workloads

Traditionally, responsibilities are defined based on the type of workload that you're running and the cloud services that you require. Cloud services include the following categories:

Cloud service Description
Infrastructure as a service (IaaS) IaaS services include Compute Engine, Cloud Storage, and networking services such as Cloud VPN, Cloud Load Balancing, and Cloud DNS.

IaaS provides compute, storage, and network services on demand with pay-as-you-go pricing. You can use IaaS if you plan on migrating an existing on-premises workload to the cloud using lift-and-shift, or if you want to run your application on particular VMs, using specific databases or network configurations.

In IaaS, the bulk of the security responsibilities are yours, and our responsibilities are focused on the underlying infrastructure and physical security.

Platform as a service (PaaS) PaaS services include App Engine, Google Kubernetes Engine (GKE), and BigQuery.

PaaS provides the runtime environment that you can develop and run your applications in. You can use PaaS if you're building an application (such as a website), and want to focus on development not on the underlying infrastructure.

In PaaS, we're responsible for more controls than in IaaS. Typically, this will vary by the services and features that you use. You share responsibility with us for application-level controls and IAM management. You remain responsible for your data security and client protection.

Software as a service (SaaS) SaaS applications include Google Workspace, Google Security Operations, and third-party SaaS applications that are available in Google Cloud Marketplace.

SaaS provides online applications that you can subscribe to or pay for in some way. You can use SaaS applications when your enterprise doesn't have the internal expertise or business requirement to build the application themselves, but does require the ability to process workloads.

In SaaS, we own the bulk of the security responsibilities. You remain responsible for your access controls and the data that you choose to store in the application.

Function as a service (FaaS) or serverless

FaaS provides the platform for developers to run small, single-purpose code (called functions) that run in response to particular events. You would use FaaS when you want particular things to occur based on a particular event. For example, you might create a function that runs whenever data is uploaded to Cloud Storage so that it can be classified.

FaaS has a similar shared responsibility list as SaaS. Cloud Run functions is a FaaS application.

The following diagram shows the cloud services and defines how responsibilities are shared between the cloud provider and customer.

Shared security responsibilities

As the diagram shows, the cloud provider always remains responsible for the underlying network and infrastructure, and customers always remain responsible for their access policies and data.

Defined by industry and regulatory framework

Various industries have regulatory frameworks that define the security controls that must be in place. When you move your workloads to the cloud, you must understand the following:

  • Which security controls are your responsibility
  • Which security controls are available as part of the cloud offering
  • Which default security controls are inherited

Inherited security controls (such as our default encryption and infrastructure controls) are controls that you can provide as part of your evidence of your security posture to auditors and regulators. For example, the Payment Card Industry Data Security Standard (PCI DSS) defines regulations for payment processors. When you move your business to the cloud, these regulations are shared between you and your CSP. To understand how PCI DSS responsibilities are shared between you and Google Cloud, see Google Cloud: PCI DSS Shared Responsibility Matrix.

As another example, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) has set standards for handling electronic personal health information (PHI). These responsibilities are also shared between the CSP and you. For more information on how Google Cloud meets our responsibilities under HIPAA, see HIPAA - Compliance.

Other industries (for example, finance or manufacturing) also have regulations that define how data can be gathered, processed, and stored. For more information about shared responsibility related to these, and how Google Cloud meets our responsibilities, see Compliance resource center.

Defined by location

Depending on your business scenario, you might need to consider your responsibilities based on the location of your business offices, your customers, and your data. Different countries and regions have created regulations that inform how you can process and store your customer's data. For example, if your business has customers who reside in the European Union, your business might need to abide by the requirements that are described in the General Data Protection Regulation (GDPR), and you might be obligated to keep your customer data in the EU itself. In this circumstance, you are responsible for ensuring that the data that you collect remains in the Google Cloud regions in the EU. For more information about how we meet our GDPR obligations, see GDPR and Google Cloud.

For information about the requirements related to your region, see Compliance offerings. If your scenario is particularly complicated, we recommend speaking with our sales team or one of our partners to help you evaluate your security responsibilities.

Challenges for shared responsibility

Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios:

  • Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance's Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration.
  • Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you're migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products.
  • Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly.
  • How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you're running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you're processing and storing.
  • Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren't easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging.
  • Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn't have a large security team.

Shared fate

We developed shared fate in Google Cloud to start addressing the challenges that the shared responsibility model doesn't address. Shared fate focuses on how all parties can better interact to continuously improve security. Shared fate builds on the shared responsibility model because it views the relationship between cloud provider and customer as an ongoing partnership to improve security.

Shared fate is about us taking responsibility for making Google Cloud more secure. Shared fate includes helping you get started with a secured landing zone and being clear, opinionated, and transparent about recommended security controls, settings, and associated best practices. It includes helping you better quantify and manage your risk with cyber-insurance, using our Risk Protection Program. Using shared fate, we want to evolve from the standard shared responsibility framework to a better model that helps you secure your business and build trust in Google Cloud.

The following sections describe various components of shared fate.

Help getting started

A key component of shared fate is the resources that we provide to help you get started, in a secure configuration in Google Cloud. Starting with a secure configuration helps reduce the issue of misconfigurations which is the root cause of most security breaches.

Our resources include the following:

  • Enterprise foundations blueprint that discuss top security concerns and our top recommendations.
  • Secure blueprints that let you deploy and maintain secure solutions using infrastructure as code (IaC). Blueprints have our security recommendations enabled by default. Many blueprints are created by Google security teams and managed as products. This support means that they're updated regularly, go through a rigorous testing process, and receive attestations from third-party testing groups. Blueprints include the enterprise foundations blueprint and the secured data warehouse blueprint.

  • Architecture Framework best practices that address the top recommendations for building security into your designs. The Architecture Framework includes a security section and a community zone that you can use to connect with experts and peers.

  • Landing zone navigation guides that step you through the top decisions that you need to make to build a secure foundation for your workloads, including resource hierarchy, identity onboarding, security and key management, and network structure.

Risk Protection Program

Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality.

The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you're looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview.

Help with deployment and governance

Shared fate also helps with your continued governance of your environment. For example, we focus efforts on products such as the following:

Putting shared responsibility and shared fate into practice

As part of your planning process, consider the following actions to help you understand and implement appropriate security controls:

  • Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider.
  • Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements.
  • Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture.
  • Use the landing zone documentation and the recommendations in the enterprise foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process.
  • After you deploy your workloads, verify that you're meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium.

For more information, see the CISO's Guide to Cloud Transformation paper.

What's next

Security principles

This document in the Google Cloud Architecture Framework explains core principles for running secure and compliant services on Google Cloud. Many of the security principles that you're familiar with in your on-premises environment apply to cloud environments.

Build a layered security approach

Implement security at each level in your application and infrastructure by applying a defense-in-depth approach. Use the features in each product to limit access and configure encryption where appropriate.

Design for secured decoupled systems

Simplify system design to accommodate flexibility where possible, and document security requirements for each component. Incorporate a robust secured mechanism to account for resiliency and recovery.

Automate deployment of sensitive tasks

Take humans out of the workstream by automating deployment and other admin tasks.

Automate security monitoring

Use automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines.

Meet the compliance requirements for your regions

Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts. For example, use Sensitive Data Protection and Dataflow to automate the PII redaction job before new data is stored in the system.

Comply with data residency and sovereignty requirements

You might have internal (or external) requirements that require you to control the locations of data storage and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture. Data residency describes where your data is stored. To help comply with data residency requirements, Google Cloud lets you control where data is stored, how data is accessed, and how it's processed.

Shift security left

DevOps and deployment automation let your organization increase the velocity of delivering products. To help ensure that your products remain secure, incorporate security processes from the start of the development process. For example, you can do the following:

  • Test for security issues in code early in the deployment pipeline.
  • Scan container images and the cloud infrastructure on an ongoing basis.
  • Automate detection of misconfiguration and security anti-patterns. For example, use automation to look for secrets that are hard-coded in applications or in configuration.

What's next

Learn more about core security principles with the following resources:

Manage risk with controls

This document in the Google Cloud Architecture Framework describes best practices for managing risks in a cloud deployment. Performing a careful analysis of the risks that apply to your organization allows you to determine the security controls that you require. You should complete risk analysis before you deploy workloads on Google Cloud, and regularly afterwards as your business needs, regulatory requirements, and the threats relevant to your organization change.

Identify risks to your organization

Before you create and deploy resources on Google Cloud, complete a risk assessment to determine what security features you need in order to meet your internal security requirements and external regulatory requirements. Your risk assessment provides you with a catalog of risks that are relevant to you, and tells you how capable your organization is in detecting and counteracting security threats.

Your risks in a cloud environment differ from your risks in an on-premises environment due to the shared responsibility arrangement that you enter with your cloud provider. For example, in an on-premises environment you need to mitigate vulnerabilities to the hardware stack. In contrast, in a cloud environment these risks are borne by the cloud provider.

In addition, your risks differ depending on how you plan on using Google Cloud. Are you transferring some of your workloads to Google Cloud, or all of them? Are you using Google Cloud only for disaster recovery purposes? Are you setting up a hybrid cloud environment?

We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can check our partner directory for a list of experts in conducting risk assessments for Google Cloud.

To help catalog your risks, consider Risk Manager, which is part of the Risk Protection Program. (This program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline. In addition, you can use Risk Manager reports to compare your risks against the risks outlined in the Center for Internet Security (CIS) Benchmark.

After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate them. The following section describes mitigation controls.

Mitigate your risks

You can mitigate risks using technical controls, contractual protections, and third-party verifications or attestations. The following table lists how you can use these mitigations when you adopt new public cloud services.

MitigationDescription
Technical controls Technical controls refer to the features and technologies that you use to protect your environment. These include built-in cloud security controls, such as firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy.

There are two categories of technical controls:
  • Google Cloud includes various security controls to let you mitigate the risks that apply to you. For example, if you have an on-premises environment, you can use Cloud VPN and Cloud Interconnect to secure the connection between your on-premises and your cloud resources.
  • Google has robust internal controls and auditing to protect against insider access to customer data. Our audit logs provide our customers with near real-time logs of Google administrator access on Google Cloud.
Contractual protections Contractual protections refer to the legal commitments made by us regarding Google Cloud services.

Google is committed to maintaining and expanding our compliance portfolio. The Cloud Data Processing Addendum (CDPA) document defines our commitment to maintaining our ISO 27001, 27017, and 27018 certifications and to updating our SOC 2 and SOC 3 reports every 12 months.

The DPST document also outlines the access controls that are in place to limit access by Google support engineers to customers' environments, and it describes our rigorous logging and approval process.

We recommend that you review Google Cloud's contractual controls with your legal and regulatory experts and verify that they meet your requirements. If you need more information, contact your technical account representative.
Third-party verifications or attestations Third-party verifications or attestations refers to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance.

You can see the current Google Cloud certifications and letters of attestation at the Compliance Resource Center.

What's next

Learn more about risk management with the following resources:

Manage your assets

This document in the Google Cloud Architecture Framework provides best practices for managing assets.

Asset management is an important part of your business requirements analysis. You must know what assets you have, and you must have a good understanding of all your assets, their value, and any critical paths or processes related to them. You must have an accurate asset inventory before you can design any sort of security controls to protect your assets.

To manage security incidents and meet your organization's regulatory requirements, you need an accurate and up-to-date asset inventory that includes a way to analyze historical data. You must be able to track your assets, including how their risk exposure might change over time.

Moving to Google Cloud means that you need to modify your asset management processes to adapt to a cloud environment. For example, one of the benefits of moving to the cloud is that you increase your organization's ability to scale quickly. However, the ability to scale quickly can cause shadow IT issues, in which your employees create cloud resources that aren't properly managed and secured. Therefore, your asset management processes must provide sufficient flexibility for employees to get their work done while also providing for appropriate security controls.

Use cloud asset management tools

Google Cloud asset management tools are tailored specifically to our environment and to top customer use cases.

One of these tools is Cloud Asset Inventory, which provides you with both real-time information on the current state of your resources and with a five-week history. By using this service, you can get an organization-wide snapshot of your inventory for a wide variety of Google Cloud resources and policies. Automation tools can then use the snapshot for monitoring or for policy enforcement, or the tools can archive the snapshot for compliance auditing. If you want to analyze changes to the assets, asset inventory also lets you export metadata history.

For more information about Cloud Asset Inventory, see Custom solution to respond to asset changes and Detective controls.

Automate asset management

Automation lets you quickly create and manage assets based on the security requirements that you specify. You can automate aspects of the asset lifecycle in the following ways:

  • Deploy your cloud infrastructure using automation tools such as Terraform. Google Cloud provides the enterprise foundations blueprint, which helps you set up infrastructure resources that meet security best practices. In addition, it configures asset changes and policy compliance notifications in Cloud Asset Inventory.
  • Deploy your applications using automation tools such as Cloud Run and the Artifact Registry.

Monitor for deviations from your compliance policies

Deviations from policies can occur during all phases of the asset lifecycle. For example, assets might be created without the proper security controls, or their privileges might be escalated. Similarly, assets might be abandoned without the appropriate end-of-life procedures being followed.

To help avoid these scenarios, we recommend that you monitor assets for deviation from compliance. Which set of assets that you monitor depends on the results of your risk assessment and business requirements. For more information about monitoring assets, see Monitoring asset changes.

Integrate with your existing asset management monitoring systems

If you already use a SIEM system or other monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. For more information, see Export Google Cloud security data to your SIEM system and Scenarios for exporting Cloud Logging data: Splunk.

Use data analysis to enrich your monitoring

You can export your inventory to a BigQuery table or Cloud Storage bucket for additional analysis.

What's next

Learn more about managing your assets with the following resources:

Manage identity and access

This document in the Google Cloud Architecture Framework provides best practices for managing identity and access.

The practice of identity and access management (generally referred to as IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization:

  • Account management, including provisioning
  • Identity governance
  • Authentication
  • Access control (authorization)
  • Identity federation

Managing IAM can be challenging when you have different environments or you use multiple identity providers. However, it's critical that you set up a system that can meet your business requirements while mitigating risks.

The recommendations in this document help you review your current IAM policies and procedures and determine which of those you might need to modify for your workloads in Google Cloud. For example, you must review the following:

  • Whether you can use existing groups to manage access or whether you need to create new ones.
  • Your authentication requirements (such as multi-factor authentication (MFA) using a token).
  • The impact of service accounts on your current policies.
  • If you're using Google Cloud for disaster recovery, maintaining appropriate separation of duties.

Within Google Cloud, you use Cloud Identity to authenticate your users and resources and Google's Identity and Access Management (IAM) product to dictate resource access. Administrators can restrict access at the organization, folder, project, and resource level. Google IAM policies dictate who can do what on which resources. Correctly configured IAM policies help secure your environment by preventing unauthorized access to resources.

For more information, see Overview of identity and access management.

Use a single identity provider

Many of our customers have user accounts that are managed and provisioned by identity providers outside of Google Cloud. Google Cloud supports federation with most identity providers and with on-premises directories such as Active Directory.

Most identity providers let you enable single sign-on (SSO) for your users and groups. For applications that you deploy on Google Cloud and that use your external identity provider, you can extend your identity provider to Google Cloud. For more information, see Reference architectures and Patterns for authentication corporate users in a hybrid environment.

If you don't have an existing identity provider, you can use either Cloud Identity Premium or Google Workspace to manage identities for your employees.

Protect the super admin account

The super admin account (managed by Google Workspace or Cloud Identity) lets you create your Google Cloud organization. This admin account is therefore highly privileged. Best practices for this account include the following:

  • Create a new account for this purpose; don't use an existing user account.
  • Create and protect backup accounts.
  • Enable MFA.

For more information, see Super administrator account best practices.

Plan your use of service accounts

A service account is a Google account that applications can use to call the Google API of a service.

Unlike your user accounts, service accounts are created and managed within Google Cloud. Service accounts also authenticate differently than user accounts:

  • To let an application running on Google Cloud authenticate using a service account, you can attach a service account to the compute resource the application runs on.
  • To let an application running on GKE authenticate using a service account, you can use Workload Identity.
  • To let applications running outside of Google Cloud authenticate using a service account, you can use Workload identity federation

When you use service accounts, you must consider an appropriate segregation of duties during your design process. Note the API calls that you must make, and determine the service accounts and associated roles that the API calls require. For example, if you're setting up a BigQuery data warehouse, you probably need identities for at least the following processes and services:

  • Cloud Storage or Pub/Sub, depending on whether you're providing a batch file or creating a streaming service.
  • Dataflow and Sensitive Data Protection to de-identify sensitive data.

For more information, see Best practices for working with service accounts.

Update your identity processes for the cloud

Identity governance lets you track access, risks, and policy violations so that you can support your regulatory requirements. This governance requires that you have processes and policies in place so that you can grant and audit access control roles and permissions to users. Your processes and policies must reflect the requirements of your environments—for example, test, development, and production.

Before you deploy workloads on Google Cloud, review your current identity processes and update them if appropriate. Ensure that you appropriately plan for the types of accounts that your organization needs and that you have a good understanding of their role and access requirements.

To help you audit Google IAM activities, Google Cloud creates audit logs, which include the following:

  • Administrator activity. This logging can't be disabled.
  • Data access activity. You must enable this logging.

If necessary for compliance purposes, or if you want to set up log analysis (for example, with your SIEM system), you can export the logs. Because logs can increase your storage requirements, they might affect your costs. Ensure that you log only the actions that you require, and set appropriate retention schedules.

Set up SSO and MFA

Your identity provider manages user account authentication. Federated identities can authenticate to Google Cloud using SSO. For privileged accounts, such as super admins, you should configure MFA. Titan Security Keys are physical tokens that you can use for two-factor authentication (2FA) to help prevent phishing attacks.

Cloud Identity supports MFA using various methods. For more information, see Enforce uniform MFA to company-owned resources.

Google Cloud supports authentication for workload identities using the OAuth 2.0 protocol or signed JSON Web Tokens (JWT). For more information about workload authentication, see Authentication overview.

Implement least privilege and separation of duties

You must ensure that the right individuals get access only to the resources and services that they need in order to perform their jobs. That is, you should follow the principle of least privilege. In addition, you must ensure there is an appropriate separation of duties.

Overprovisioning user access can increase the risk of insider threat, misconfigured resources, and non-compliance with audits. Underprovisioning permissions can prevent users from being able to access the resources they need in order to complete their tasks.

One way to avoid overprovisioning is to implement just-in-time privileged access — that is, to provide privileged access only as needed, and to only grant it temporarily.

Be aware that when a Google Cloud organization is created, all users in your domain are granted the Billing Account Creator and Project Creator roles by default. Identify the users who will perform these duties, and revoke these roles from other users. For more information, see Creating and managing organizations.

For more information about how roles and permissions work in Google Cloud, see Overview and Understanding roles in the IAM documentation. For more information about enforcing least privilege, see Enforce least privilege with role recommendations.

Audit access

To monitor the activities of privileged accounts for deviations from approved conditions, use Cloud Audit Logs. Cloud Audit Logs records the actions that members in your Google Cloud organization have taken in your Google Cloud resources. You can work with various audit log types across Google services. For more information, see Using Cloud Audit Logs to Help Manage Insider Risk (video).

Use IAM recommender to track usage and to adjust permissions where appropriate. The roles that are recommended by IAM recommender can help you determine which roles to grant to a user based on the user's past behavior and on other criteria. For more information, see Best practices for role recommendations.

To audit and control access to your resources by Google support and engineering personnel, you can use Access Transparency. Access Transparency records the actions taken by Google personnel. Use Access Approval, which is part of Access Transparency, to grant explicit approval every time customer content is accessed. For more information, see Control cloud administrators' access to your data.

Automate your policy controls

Set access permissions programmatically whenever possible. For best practices, see Organization policy constraints. The Terraform scripts for the enterprise foundations blueprint are in the example foundation repository.

Google Cloud includes Policy Intelligence, which lets you automatically review and update your access permissions. Policy Intelligence includes the Recommender, Policy Troubleshooter, and Policy Analyzer tools, which do the following:

  • Provide recommendations for IAM role assignment.
  • Monitor and help prevent overly permissive IAM policies.
  • Assist with troubleshooting access-control-related issues.

Set restrictions on resources

Google IAM focuses on who, and it lets you authorize who can act on specific resources based on permissions. The Organization Policy Service focuses on what, and it lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following:

In addition to using organizational policies for these tasks, you can restrict access to resources using one of the following methods:

  • Use tags to manage access to your resources without defining the access permissions on each resource. Instead, you add the tag and then set the access definition for the tag itself.
  • Use IAM Conditions for conditional, attribute-based control of access to resources.
  • Implement defense-in-depth using VPC Service Controls to further restrict access to resources.

For more information about resource management, see Decide a resource hierarchy for your Google Cloud landing zone.

What's next

Learn more about IAM with the following resources:

Implement compute and container security

Google Cloud includes controls to protect your compute resources and Google Kubernetes Engine (GKE) container resources. This document in the Google Cloud Architecture Framework describes key controls and best practices for using them.

Use hardened and curated VM images

Google Cloud includes Shielded VM, which allows you to harden your VM instances. Shielded VM is designed to prevent malicious code from being loaded during the boot cycle. It provides boot security, monitors integrity, and uses the Virtual Trusted Platform Module (vTPM). Use Shielded VM for sensitive workloads.

In addition to using Shielded VM, you can use Google Cloud partner solutions to further protect your VMs. Many partner solutions offered on Google Cloud integrate with Security Command Center, which provides event threat detection and health monitoring. You can use partners for advanced threat analysis or extra runtime security.

Use Confidential Computing for processing sensitive data

By default, Google Cloud encrypts data at rest and in transit across the network, but data isn't encrypted while it's in use in memory. If your organization handles confidential data, you need to mitigate against threats that undermine the confidentiality and integrity of either the application or the data in system memory. Confidential data includes personally identifiable information (PII), financial data, and health information.

Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment. This type of secure and isolated environment helps prevent unauthorized access or modification of applications and data while that data is in use. A trusted execution environment also increases the security assurances for organizations that manage sensitive and regulated data.

In Google Cloud, you can enable Confidential Computing by running Confidential VMs or Confidential GKE nodes. Turn on Confidential Computing when you're processing confidential workloads, or when you have confidential data (for example, secrets) that must be exposed while they are processed. For more information, see the Confidential Computing Consortium.

Protect VMs and containers

OS Login lets your employees connect to your VMs using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. You therefore don't have to manage SSH keys throughout your organization. OS Login ties an administrator's access to their employee lifecycle, which means that if employees move to another role or leave your organization, their access is revoked with their account. OS Login also supports two-factor authentication, which adds an extra layer of security from account takeover attacks.

In GKE, App Engine runs application instances within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The principle of immutability means that your employees do not modify the container or access it interactively. If it must be changed, you build a new image and redeploy. Enable SSH access to the underlying containers only in specific debugging scenarios.

Disable external IP addresses unless they're necessary

To disable external IP address allocation (video) for your production VMs and to prevent the use of external load balancers, you can use organization policies. If you require your VMs to reach the internet or your on-premises data center, you can enable a Cloud NAT gateway.

You can deploy private clusters in GKE. In a private cluster, nodes have only internal IP addresses, which means that nodes and Pods are isolated from the internet by default. You can also define a network policy to manage Pod-to-Pod communication in the cluster. For more information, see Private access options for services.

Monitor your compute instance and GKE usage

Cloud Audit Logs are automatically enabled for Compute Engine and GKE. Audit logs let you automatically capture all activities with your cluster and monitor for any suspicious activity.

You can integrate GKE with partner products for runtime security. You can integrate these solutions with the Security Command Center to provide you with a single interface for monitoring your applications.

Keep your images and clusters up to date

Google Cloud provides curated OS images that are patched regularly. You can bring custom images and run them on Compute Engine, but if you do, you have to patch them yourself. Google Cloud regularly updates OS images to mitigate new vulnerabilities as described in security bulletins and provides remediation to fix vulnerabilities for existing deployments.

If you're using GKE, we recommend that you enable node auto-upgrade to have Google update your cluster nodes with the latest patches. Google manages GKE control planes, which are automatically updated and patched. In addition, use Google-curated container-optimized images for your deployment. Google regularly patches and updates these images.

Control access to your images and clusters

It's important to know who can create and launch instances. You can control this access using IAM. For information about how to determine what access workloads need, see Plan your workload identities.

In addition, you can use VPC Service Controls to define custom quotas on projects so that you can limit who can launch images. For more information, see the Secure your network section.

To provide infrastructure security for your cluster, GKE lets you use IAM with role-based access control (RBAC) to manage access to your cluster and namespaces.

Isolate containers in a sandbox

Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.

GKE Sandbox was built for applications that have low I/O requirements but that are highly scaled. These containerized workloads need to maintain their speed and performance, but might also involve untrusted code that demands added security. Use gVisor, a container runtime sandbox, to provide additional security isolation between applications and the host kernel. gVisor provides additional integrity checks and limits the scope of access for a service. It's not a container hardening service to protect against external threats. For more inforamtion about gVisor, see gVisor: Protecting GKE and serverless users in the real world.

What's next

Learn more about compute and container security with the following resources:

Secure your network

This document in the Google Cloud Architecture Framework provides best practices for securing your network.

Extending your existing network to include cloud environments has many implications for security. Your on-premises approach to multi-layered defenses likely involves a distinct perimeter between the internet and your internal network. You probably protect the perimeter by using mechanisms like physical firewalls, routers, and intrusion detection systems. Because the boundary is clearly defined, you can monitor for intrusions and respond accordingly.

When you move to the cloud (either completely or in a hybrid approach), you move beyond your on-premises perimeter. This document describes ways that you can continue to secure your organization's data and workloads on Google Cloud. As mentioned in Manage risks with controls, how you set up and secure your Google Cloud network depends on your business requirements and risk appetite.

This section assumes that you've already created a basic architecture diagram of your Google Cloud network components. For an example diagram, see Hub-and-spoke.

Deploy zero trust networks

Moving to the cloud means that your network trust model must change. Because your users and your workloads are no longer behind your on-premises perimeter, you can't use perimeter protections in the same way to create a trusted, inner network. The zero trust security model means that no one is trusted by default, whether they are inside or outside of your organization's network. When verifying access requests, the zero trust security model requires you to check both the user's identity and context. Unlike a VPN, you shift access controls from the network perimeter to the users and devices.

In Google Cloud, you can use Chrome Enterprise Premium as your zero trust solution. Chrome Enterprise Premium provides threat and data protection and additional access controls. For more information about how to set it up, see Getting started with Chrome Enterprise Premium.

In addition to Chrome Enterprise Premium, Google Cloud includes Identity-Aware Proxy (IAP). IAP lets you extend zero trust security to your applications both within Google Cloud and on-premises. IAP uses access control policies to provide authentication and authorization for users who access your applications and resources.

Secure connections to your on-premises or multicloud environments

Many organizations have workloads both in cloud environments and on-premises. In addition, for resiliency, some organizations use multicloud solutions. In these scenarios, it's critical to secure your connectivity between all of your environments.

Google Cloud includes private access methods for VMs that are supported by Cloud VPN or Cloud Interconnect, including the following:

For a comparison between the products, see Choosing a Network Connectivity product.

Disable default networks

When you create a new Google Cloud project, a default Google Cloud VPC network with auto mode IP addresses and pre-populated firewall rules is automatically provisioned. For production deployments, we recommend that you delete the default networks in existing projects, and disable the creation of default networks in new projects.

Virtual Private Cloud networks let you use any internal IP address. To avoid IP address conflicts, we recommend that you first plan your network and IP address allocation across your connected deployments and across your projects. A project allows multiple VPC networks, but it's usually a best practice to limit these networks to one per project in order to enforce access control effectively.

Secure your perimeter

In Google Cloud, you can use various methods to segment and secure your cloud perimeter, including firewalls and VPC Service Controls.

Use Shared VPC to build a production deployment that gives you a single shared network and that isolates workloads into individual projects that can be managed by different teams. Shared VPC provides centralized deployment, management, and control of the network and network security resources across multiple projects. Shared VPC consists of host and service projects that perform the following functions:

  • A host project contains the networking and network security-related resources, such as VPC networks, subnets, firewall rules, and hybrid connectivity.
  • A service project attaches to a host project. It lets you isolate workloads and users at the project level by using Identity and Access Management (IAM), while it shares the networking resources from the centrally managed host project.

Define firewall policies and rules at the organization, folder, and VPC network level. You can configure firewall rules to permit or deny traffic to or from VM instances. For examples, see Global and regional network firewall policy examples and Hierarchical firewall policy examples. In addition to defining rules based on IP addresses, protocols, and ports, you can manage traffic and apply firewall rules based on the service account that's used by a VM instance or by using secure tags.

To control the movement of data in Google services and to set up context-based perimeter security, consider VPC Service Controls. VPC Service Controls provides an extra layer of security for Google Cloud services that's independent of IAM and firewall rules and policies. For example, VPC Service Controls lets you set up perimeters between confidential and non-confidential data so that you can apply controls that help prevent data exfiltration.

Use Google Cloud Armor security policies to allow, deny, or redirect requests to your external Application Load Balancer at the Google Cloud edge, as close as possible to the source of incoming traffic. These policies prevent unwelcome traffic from consuming resources or entering your network.

Use Secure Web Proxy to apply granular access policies to your egress web traffic and to monitor access to untrusted web services.

Inspect your network traffic

You can use Cloud Intrusion Detection System (Cloud IDS) and Packet Mirroring to help you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE).

Use Cloud IDS to get visibility in to the traffic moving into and out of your VPC networks. Cloud IDS creates a Google-managed peered network that has mirrored VMs. Palo Alto Networks threat protection technologies mirror and inspect the traffic. For more information, see Cloud IDS overview.

Packet Mirroring clones traffic of specified VM instances in your VPC network and forwards it for collection, retention, and examination. After you configure Packet Mirroring, you can use Cloud IDS or third-party tools to collect and inspect network traffic at scale. Inspecting network traffic in this way helps provide intrusion detection and application performance monitoring.

Use a web application firewall

For external web applications and services, you can enable Google Cloud Armor to provide distributed denial-of-service (DDoS) protection and web application firewall (WAF) capabilities. Google Cloud Armor supports Google Cloud workloads that are exposed using external HTTP(S) load balancing, TCP Proxy load balancing, or SSL Proxy load balancing.

Google Cloud Armor is offered in two service tiers, Standard and Managed Protection Plus. To take full advantage of advanced Google Cloud Armor capabilities, you should invest in Managed Protection Plus for your key workloads.

Automate infrastructure provisioning

Automation lets you create immutable infrastructure, which means that it can't be changed after provisioning. This measure gives your operations team a known good state, fast rollback, and troubleshooting capabilities. For automation, you can use tools such as Terraform, Jenkins, and Cloud Build.

To help you build an environment that uses automation, Google Cloud provides a series of security blueprints that are in turn built on the enterprise foundations blueprint. The security foundations blueprint provides Google's opinionated design for a secure application environment and describes step by step how to configure and deploy your Google Cloud estate. Using the instructions and the scripts that are part of the security foundations blueprint, you can configure an environment that meets our security best practices and guidelines. You can build on that blueprint with additional blueprints or design your own automation.

Monitor your network

Monitor your network and your traffic using telemetry.

VPC Flow Logs and Firewall Rules Logging provide near real-time visibility into the traffic and firewall usage in your Google Cloud environment. For example, Firewall Rules Logging logs traffic to and from Compute Engine VM instances. When you combine these tools with Cloud Logging and Cloud Monitoring, you can track, alert, and visualize traffic and access patterns to improve the operational security of your deployment.

Firewall Insights lets you review which firewall rules matched incoming and outgoing connections and whether the connections were allowed or denied. The shadowed rules feature helps you tune your firewall configuration by showing you which rules are never triggered because another rule is always triggered first.

Use Network Intelligence Center to see how your network topology and architecture are performing. You can get detailed insights into network performance and you can then optimize your deployment to eliminate any bottlenecks in your service. Connectivity Tests provide you with insights into the firewall rules and policies that are applied to the network path.

For more information about monitoring, see Implement logging and detective controls.

What's next

Learn more about network security with the following resources:

Implement data security

This document in the Google Cloud Architecture Framework provides best practices for implementing data security.

As part of your deployment architecture, you must consider what data you plan to process and store in Google Cloud, and the sensitivity of the data. Design your controls to help secure the data during its lifecycle, to identify data ownership and classification, and to help protect data from unauthorized use.

For a security blueprint that deploys a BigQuery data warehouse with the security best practices described in this document, see Secure a BigQuery data warehouse that stores confidential data.

Automatically classify your data

Perform data classification as early in the data management lifecycle as possible, ideally when the data is created. Usually, data classification efforts require only a few categories, such as the following:

  • Public: Data that has been approved for public access.
  • Internal: Non-sensitive data that isn't released to the public.
  • Confidential: Sensitive data that's available for general internal distribution.
  • Restricted: Highly sensitive or regulated data that requires restricted distribution.

Use Sensitive Data Protection to discover and classify data across your Google Cloud environment. Sensitive Data Protection has built-in support for scanning and classifying sensitive data in Cloud Storage, BigQuery, and Datastore. It also has a streaming API to support additional data sources and custom workloads.

Sensitive Data Protection can identify sensitive data using built-in infotypes. It can automatically classify, mask, tokenize, and transform sensitive elements (such as PII data) to let you manage the risk of collecting, storing, and using data. In other words, it can integrate with your data lifecycle processes to ensure that data in every stage is protected.

For more information, see De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection.

Manage data governance using metadata

Data governance is a combination of processes that ensure that data is secure, private, accurate, available, and usable. Although you are responsible for defining a data governance strategy for your organization, Google Cloud provides tools and technologies to help you put your strategy into practice. Google Cloud also provides a framework for data governance (PDF) in the cloud.

Use Data Catalog to find, curate, and use metadata to describe your data assets in the cloud. You can use Data Catalog to search for data assets, then tag the assets with metadata. To help accelerate your data classification efforts, integrate Data Catalog with Sensitive Data Protection to automatically identify confidential data. After data is tagged, you can use Google Identity and Access Management (IAM) to restrict which data users can query or use through Data Catalog views.

Use Dataproc Metastore or Hive metastore to manage metadata for workloads. Data Catalog has a hive connector that allows the service to discover metadata that's inside a hive metastore.

Use Dataprep by Trifacta to define and enforce data quality rules through a console. You can use Dataprep from within Cloud Data Fusion or use Dataprep as a standalone service.

Protect data according to its lifecycle phase and classification

After you define data within the context of its lifecycle and classify it based on its sensitivity and risk, you can assign the right security controls to protect it. You must ensure that your controls deliver adequate protections, meet compliance requirements, and reduce risk. As you move to the cloud, review your current strategy and where you might need to change your current processes.

The following table describes three characteristics of a data security strategy in the cloud.

Characteristic Description
Identification Understand the identity of users, resources, and applications as they create, modify, store, use, share, and delete data.

Use Cloud Identity and IAM to control access to data. If your identities require certificates, consider Certificate Authority Service.

For more information, see Manage identity and access.
Boundary and access Set up controls for how data is accessed, by whom, and under what circumstances. Access boundaries to data can be managed at these levels:

Visibility You can audit usage and create reports that demonstrate how data is controlled and accessed. Google Cloud Logging and Access Transparency provide insights into the activities of your own cloud administrators and Google personnel. For more information, see Monitor your data.

Encrypt your data

By default, Google Cloud encrypts customer data stored at rest, with no action required from you. In addition to default encryption, Google Cloud provides options for envelope encryption and encryption key management. For example, Compute Engine persistent disks are automatically encrypted, but you can supply or manage your own keys.

You must identify the solutions that best fit your requirements for key generation, storage, and rotation, whether you're choosing the keys for your storage, for compute, or for big data workloads.

Google Cloud includes the following options for encryption and key management:

  • Customer-managed encryption keys (CMEK). You can generate and manage your encryption keys using Cloud Key Management Service (Cloud KMS). Use this option if you have certain key management requirements, such as the need to rotate encryption keys regularly.
  • Customer-supplied encryption keys (CSEK). You can create and manage your own encryption keys, and then provide them to Google Cloud when necessary. Use this option if you generate your own keys using your on-premises key management system to bring your own key (BYOK). If you provide your own keys using CSEK, Google replicates them and makes them available to your workloads. However, the security and availability of CSEK is your responsibility because customer-supplied keys aren't stored in instance templates or in Google infrastructure. If you lose access to the keys, Google can't help you recover the encrypted data. Think carefully about which keys you want to create and manage yourself. You might use CSEK for only the most sensitive information. Another option is to perform client-side encryption on your data and then store the encrypted data in Google Cloud, where the data is encrypted again by Google.
  • Third-party key management system with Cloud External Key Manager (Cloud EKM). Cloud EKM protects your data at rest by using encryption keys that are stored and managed in a third-party key management system that you control outside of the Google infrastructure. When you use this method, you have high assurance that your data can't be accessed by anyone outside of your organization. Cloud EKM lets you achieve a secure hold-your-own-key (HYOK) model for key management. For compatibility information, see the Cloud EKM enabled services list.

Cloud KMS also lets you encrypt your data with either software-backed encryption keys or FIPS 140-2 Level 3 validated hardware security modules (HSMs). If you're using Cloud KMS, your cryptographic keys are stored in the region where you deploy the resource. Cloud HSM distributes your key management needs across regions, providing redundancy and global availability of keys.

For information on how envelope encryption works, see Encryption at rest in Google Cloud.

Control cloud administrators' access to your data

You can control access by Google support and engineering personnel to your environment on Google Cloud. Access Approval lets you explicitly approve before Google employees access your data or resources on Google Cloud. This product complements the visibility provided by Access Transparency, which generates logs when Google personnel interact with your data. These logs include the office location and the reason for the access.

Using these products together, you can deny Google the ability to decrypt your data for any reason.

Configure where your data is stored and where users can access it from

You can control the network locations from which users can access data by using VPC Service Controls. This product lets you limit access to users in a specific region. You can enforce this constraint even if the user is authorized according to your Google IAM policy. Using VPC Service Controls, you create a service perimeter which defines the virtual boundaries from which a service can be accessed, which prevents data from being moved outside those boundaries.

For more information, see the following:

Manage secrets using Secret Manager

Secret Manager lets you store all of your secrets in a centralized place. Secrets are configuration information such as database passwords, API keys, or TLS certificates. You can automatically rotate secrets, and you can configure applications to automatically use the latest version of a secret. Every interaction with Secret Manager generates an audit log, so you view every access to every secret.

Sensitive Data Protection also has a category of detectors to help you identify credentials and secrets in data that could be protected with Secret Manager.

Monitor your data

To view administrator activity and key use logs, use Cloud Audit Logs. To help secure your data, monitor logs using Cloud Monitoring to ensure proper use of your keys.

Cloud Logging captures Google Cloud events and lets you add additional sources if necessary. You can segment your logs by region, store them in buckets, and integrate custom code for processing logs. For an example, see Custom solution for automated log analysis.

You can also export logs to BigQuery to perform security and access analytics to help identify unauthorized changes and inappropriate access to your organization's data.

Security Command Center can help you identify and resolve insecure-access problems to sensitive organizational data that's stored in the cloud. Through a single management interface, you can scan for a wide variety of security vulnerabilities and risks to your cloud infrastructure. For example, you can monitor for data exfiltration, scan storage systems for confidential data, and detect which Cloud Storage buckets are open to the internet.

What's next

Learn more about data security with the following resources:

Deploy applications securely

This document in the Google Cloud Architecture Framework provides best practices for deploying applications securely.

To deploy secure applications, you must have a well-defined software development lifecycle, with appropriate security checks during the design, development, testing, and deployment stages. When you design an application, we recommend a layered system architecture that uses standardized frameworks for identity, authorization, and access control.

Automate secure releases

Without automated tools, it can be hard to deploy, update, and patch complex application environments to meet consistent security requirements. Therefore, we recommend that you build a CI/CD pipeline for these tasks, which can solve many of these issues. Automated pipelines remove manual errors, provide standardized development feedback loops, and enable fast product iterations. For example, Cloud Build private pools let you deploy a highly secure, managed CI/CD pipeline for highly regulated industries, including finance and healthcare.

You can use automation to scan for security vulnerabilities when artifacts are created. You can also define policies for different environments (development, test, production, and so on) so that only verified artifacts are deployed.

Ensure that application deployments follow approved processes

If an attacker compromises your CI/CD pipeline, your entire stack can be affected. To help secure the pipeline, you should enforce an established approval process before you deploy the code into production.

If you plan to use Google Kubernetes Engine (GKE) or GKE Enterprise, you can establish these checks and balances by using Binary Authorization. Binary Authorization attaches configurable signatures to container images. These signatures (also called attestations) help to validate the image. At deployment, Binary Authorization uses these attestations to determine that a process was completed earlier. For example, you can use Binary Authorization to do the following:

  • Verify that a specific build system or continuous integration (CI) pipeline created a container image.
  • Validate that a container image is compliant with a vulnerability signing policy.
  • Verify that a container image passes criteria for promotion to the next deployment environment, such as from development to QA.

Scan for known vulnerabilities before deployment

We recommend that you use automated tools that can continuously perform vulnerability scans on container images before the containers are deployed to production.

Use Artifact Analysis to automatically scan for vulnerabilities for containers that are stored in Artifact Registry. This process includes two tasks: scanning and continuous analysis.

To start, Artifact Analysis scans new images when they're uploaded to Artifact Registry. The scan extracts information about the system packages in the container.

Artifact Analysis then looks for vulnerabilities when you upload the image. After the initial scan, Artifact Analysis continuously monitors the metadata of scanned images in Artifact Registry for new vulnerabilities. When Artifact Analysis receives new and updated vulnerability information from vulnerability sources, it does the following:

  • Updates the metadata of the scanned images to keep them up to date.
  • Creates new vulnerability occurrences for new notes.
  • Deletes vulnerability occurrences that are no longer valid.

Monitor your application code for known vulnerabilities

It's a best practice to use automated tools that can constantly monitor your application code for known vulnerabilities such as the OWASP Top 10. For a description of Google Cloud products and features that support OWASP Top 10 mitigation techniques, see OWASP Top 10 mitigation options on Google Cloud.

Use Web Security Scanner to help identify security vulnerabilities in your App Engine, Compute Engine, and Google Kubernetes Engine web applications. The scanner crawls your application, following all links within the scope of your starting URLs, and attempts to exercise as many user inputs and event handlers as possible. It can automatically scan for and detect common vulnerabilities, including cross-site scripting (XSS), Flash injection, mixed content (HTTP in HTTPS), and outdated or insecure libraries. Web Security Scanner gives you early identification of these types of vulnerabilities with low false positive rates.

Control movement of data across perimeters

To control the movement of data across a perimeter, you can configure security perimeters around the resources of your Google-managed services. Use VPC Service Controls to place all components and services in your CI/CD pipeline (for example, Artifact Registry, Artifact Analysis, and Binary Authorization) inside a security perimeter.

VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transfer of data (data exfiltration) from Google-managed services. With VPC Service Controls, you configure security perimeters around the resources of your Google-managed services to control the movement of data across the perimeter boundary. When a service perimeter is enforced, requests that violate the perimeter policy are denied, such as requests that are made to protected services from outside a perimeter. When a service is protected by an enforced perimeter, VPC Service Controls ensures the following:

  • A service can't transmit data out of the perimeter. Protected services function as normal inside the perimeter, but can't send resources and data out of the perimeter. This restriction helps prevent malicious insiders who might have access to projects in the perimeter from exfiltrating data.
  • Requests that come from outside the perimeter to the protected service are honored only if the requests meet the criteria of access levels that are assigned to the perimeter.
  • A service can be made accessible to projects in other perimeters using perimeter bridges.

Encrypt your container images

In Google Cloud, you can encrypt your container images using customer-managed encryption keys (CMEK). CMEK keys are managed in Cloud Key Management Service (Cloud KMS). When you use CMEK, you can temporarily or permanently disable access to an encrypted container image by disabling or destroying the key.

What's next

Learn more about securing your supply chain and application security with the following resources:

Manage compliance obligations

This document in the Google Cloud Architecture Framework provides best practices for managing compliance obligations.

Your cloud regulatory requirements depend on a combination of factors, including the following:

  • The laws and regulations that apply your organization's physical locations.
  • The laws and regulations that apply to your customers' physical locations.
  • Your industry's regulatory requirements.

These requirements shape many of the decisions that you need to make about which security controls to enable for your workloads in Google Cloud.

A typical compliance journey goes through three stages: assessment, gap remediation, and continual monitoring. This section addresses the best practices that you can use during each stage.

Assess your compliance needs

Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides you with details on the following:

  • Service support for various regulations
  • Google Cloud certifications and attestations

You can ask for an engagement with a Google compliance specialist to better understand the compliance lifecycle at Google and how your requirements can be met.

For more information, see Assuring compliance in the cloud (PDF).

Deploy Assured Workloads

Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations. Assured Workloads lets you do the following:

  • Select your compliance regime. The tool then automatically sets the baseline personnel access controls.
  • Set the location for your data using organization policies so that your data at rest and your resources remain only in that region.
  • Select the key management option (such as the key rotation period) that best fits your security and compliance requirements.
  • For certain regulatory requirements such as FedRAMP Moderate, select the criteria for access by Google support personnel (for example, whether they have completed appropriate background checks).
  • Use Google-owned and Google-managed encryption keys that are FIPS-140-2 compliant and support FedRAMP Moderate compliance. For an added layer of control and separation of duties, you can use customer-managed encryption keys (CMEK). For more information about keys, see Encrypt your data.

Review blueprints for templates and best practices that apply to your compliance regime

Google has published blueprints and solutions guides that describe best practices and that provide Terraform modules to let you provision and configure an environment that helps you achieve compliance. The following table lists a selection of blueprints that address security and alignment with compliance requirements.

StandardDescription
PCI
FedRAMP
HIPAA

Monitor your compliance

Most regulations require you to monitor particular activities, including access controls. To help with your monitoring, you can use the following:

  • Access Transparency, which provides near real-time logs when Google Cloud admins access your content.
  • Firewall Rules Logging to record TCP and UDP connections inside a VPC network for any rules that you create yourself. These logs can be useful for auditing network access or for providing early warning that the network is being used in an unapproved manner.
  • VPC Flow Logs to record network traffic flows that are sent or received by VM instances.
  • Security Command Center Premium to monitor for compliance with various standards.
  • OSSEC (or another open source tool) to log the activity of individuals who have administrator access to your environment.
  • Key Access Justifications to view the reasons for a key access request.

Automate your compliance

To help you remain in compliance with changing regulations, determine if there are ways that you can automate your security policies by incorporating them into your infrastructure as code deployments. For example, consider the following:

  • Use security blueprints to build your security policies into your infrastructure deployments.

  • Configure Security Command Center to alert when non-compliance issues occur. For example, monitor for issues such as users disabling two-step verification or over-privileged service accounts. For more information, see Setting up finding notifications.

  • Set up automatic remediation to particular notifications.

Fore more information about compliance automation, see the Risk and Compliance as Code (RCaC) solution.

What's next

Learn more about compliance with the following resources:

Implement data residency and sovereignty requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing data residency and sovereignty requirements.

Data residency and sovereignty requirements are based on your regional and industry-specific regulations, and different organizations might have different data sovereignty requirements. For example, you might have the following requirements:

  • Control over all access to your data by Google Cloud, including what type of personnel can access the data and from which region they can access it.
  • Inspectability of changes to cloud infrastructure and services, which can have an impact on access to your data or the security of your data. Insight into these types of changes helps ensure that Google Cloud is unable to circumvent controls or move your data out of the region.
  • Survivability of your workloads for an extended time when you are unable to receive software updates from Google Cloud.

Manage your data sovereignty

Data sovereignty provides you with a mechanism to prevent Google from accessing your data. You approve access only for provider behaviors that you agree are necessary.

For example, you can manage your data sovereignty in the following ways:

Manage your operational sovereignty

Operational sovereignty provides you with assurances that Google personnel can't compromise your workloads.

For example, you can manage operational sovereignty in the following ways:

Manage software sovereignty

Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want, without depending on (or being locked in to) a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed.

For example, Google Cloud supports hybrid and multicloud deployments. In addition, GKE Enterprise lets you manage and deploy your applications in both cloud environments and on-premises environments.

Control data residency

Data residency describes where your data is stored at rest. Data residency requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and even culture.

Controlling data residency starts with the following:

  • Understanding the type of your data and its location.
  • Determining what risks exist to your data, and what laws and regulations apply.
  • Controlling where data is or where it goes.

To help comply with data residency requirements, Google Cloud lets you control where your data is stored, how it is accessed, and how it's processed. You can use resource location policies to restrict where resources are created and to limit where data is replicated between regions. You can use the location property of a resource to identify where the service deploys and who maintains it.

For supportability information, see Resource locations supported services.

What's next

Learn more about data residency and sovereignty with the following resources:

Implement privacy requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing privacy requirements.

Privacy regulations help define how you can obtain, process, store, and manage your users' data. Many privacy controls (for example, controls for cookies, session management, and obtaining user permission) are your responsibility because you own your data (including the data that you receive from your users).

Google Cloud includes the following controls that promote privacy:

  • Default encryption of all data when it's at rest, when it's in transit, and while it's being processed.
  • Safeguards against insider access.
  • Support for numerous privacy regulations.

For more information, see Google Cloud Privacy Commitments.

Classify your confidential data

You must define what data is confidential and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personal identifiable information (PII).

Using Sensitive Data Protection, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. For more information, see Automatically classify your data.

Lock down access to sensitive data

Place sensitive data in its own service perimeter using VPC Service Controls, and set Google Identity and Access Management (IAM) access controls for that data. Configure multi-factor authentication (MFA) for all users who require access to sensitive data.

For more information, see Control movement of data across perimeters and Set up SSO and MFA.

Monitor for phishing attacks

Ensure that your email system is configured to protect against phishing attacks, which are often used for fraud and malware attacks.

If your organization uses Gmail, you can use advanced phishing and malware protection. This collection of settings provides controls to quarantine emails, defends against anomalous attachment types, and helps protect against from inbound spoofing emails. Security Sandbox detects malware in attachments. Gmail is continually and automatically updated with the latest security improvements and protections to help keep your organization's email safe.

Extend zero trust security to your hybrid workforce

A zero trust security model means that no one is trusted implicitly, whether they are inside or outside of your organization's network. When your IAM systems verify access requests, a zero trust security posture means that the user's identity and context (for example, their IP address or location) are considered. Unlike a VPN, zero trust security shifts access controls from the network perimeter to users and their devices. Zero trust security allows users to work more securely from any location. For example, users can access your organization's resources from their laptops or mobile devices while at home.

On Google Cloud, you can configure Chrome Enterprise Premium and Identity-Aware Proxy (IAP) to enable zero trust for your Google Cloud resources. If your users use Google Chrome and you enable Chrome Enterprise Premium, you can integrate zero-trust security into your users browsers.

What's next

Learn more about security and privacy with the following resources:

Implement logging and detective controls

This document in the Google Cloud Architecture Framework provides best practices for implementing logging and detective controls.

Detective controls use telemetry to detect misconfigurations, vulnerabilities, and potentially malicious activity in a cloud environment. Google Cloud lets you create tailored monitoring and detective controls for your environment. This section describes these additional features and recommendations for their use.

Monitor network performance

Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and then use that information to optimize your deployment by eliminating bottlenecks on your services. Connectivity Tests provides you with insights into the firewall rules and policies that are applied to the network path.

Monitor and prevent data exfiltration

Data exfiltration is a key concern for organizations. Typically, it occurs when an authorized person extracts data from a secured system and then shares that data with an unauthorized party or moves it to an insecure system.

Google Cloud provides several features and tools that help you detect and prevent data exfiltration. For more information, see Preventing data exfiltration.

Centralize your monitoring

Security Command Center provides visibility into the resources that you have in Google Cloud and into their security state. Security Command Center helps you prevent, detect, and respond to threats. It provides a centralized dashboard that you can use to help identify security misconfigurations in virtual machines, in networks, in applications, and in storage buckets. You can address these issues before they result in business damage or loss. The built-in capabilities of Security Command Center can reveal suspicious activity in your Cloud Logging security logs or indicate compromised virtual machines.

You can respond to threats by following actionable recommendations or by exporting logs to your SIEM system for further investigation. For information about using a SIEM system with Google Cloud, see Security log analytics in Google Cloud.

Security Command Center also provides multiple detectors that help you analyze the security of your infrastructure. These detectors include the following:

Other Google Cloud services, such as Google Cloud Armor logs, also provide findings for display in Security Command Center.

Enable the services that you need for your workloads, and then only monitor and analyze important data. For more information about enabling logging on services, see the enable logs section in Security log analytics in Google Cloud.

Monitor for threats

Event Threat Detection is an optional managed service of Security Command Center Premium that detects threats in your log stream. By using Event Threat Detection, you can detect high-risk and costly threats such as malware, cryptomining, unauthorized access to Google Cloud resources, DDoS attacks, and brute-force SSH attacks. Using the tool's features to distill volumes of log data, your security teams can quickly identify high-risk incidents and focus on remediation.

To help detect potentially compromised user accounts in your organization, use the Sensitive Actions Cloud Platform logs to identify when sensitive actions are taken and to confirm that valid users took those actions for valid purposes. A sensitive action is an action, such as the addition of a highly privileged role, that could be damaging to your business if a malicious actor took the action. Use Cloud Logging to view, monitor, and query the Sensitive Actions Cloud Platform logs. You can also view the sensitive action log entries with the Sensitive Actions Service, a built-in service of Security Command Center Premium.

Google Security Operations can store and analyze all of your security data centrally. To help you see the entire span of an attack, Google SecOps can map logs into a common model, enrich them, and then link them together into timelines. Furthermore, you can use Google SecOps to create detection rules, set up indicators of compromise (IoC) matching, and perform threat-hunting activities. You write your detection rules in the YARA-L language. For sample threat detection rules in YARA-L, see the Community Security Analytics (CSA) repository. In addition to writing your own rules, you can take advantage of curated detections in Google SecOps. These curated detections are a set of predefined and managed YARA-L rules that can help you identify threats.

Another option to centralizing your logs for security analysis, audit, and investigation is to use BigQuery. In BigQuery, you monitor common threats or misconfigurations by using SQL queries (such as those in the CSA repository) to analyze permission changes, provisioning activity, workload usage, data access, and network activity. For more information about security log analytics in BigQuery from setup through analysis, see Security log analytics in Google Cloud.

The following diagram shows how to centralize your monitoring by using both the built-in threat detection capabilities of Security Command Center and the threat detection that you do in BigQuery, Google Security Operations, or a third-party SIEM.

How the various security analytics tools and content interact in Google Cloud.

As shown in the diagram, there are variety of security data sources that you should monitor. These data sources include logs from Cloud Logging, asset changes from Cloud Asset Inventory, Google Workspace logs, or events from hypervisor or a guest kernel. The diagram shows that you can use Security Command Center to monitor these data sources. This monitoring occurs automatically provided that you've enabled the appropriate features and threat detectors in Security Command Center. The diagram shows that you can also monitor for threats by exporting security data and Security Command Center findings to an analytics tool such as BigQuery, Google Security Operations, or a third-party SIEM. In your analytics tool, the diagram shows that you can perform further analysis and investigation by using and extending queries and rules like those available in CSA.

What's next

Learn more about logging and detection with the following resources:

Google Cloud Architecture Framework: Reliability

The reliability pillar in the Google Cloud Architecture Framework provides principles and recommendations to help you design, deploy, and manage reliable workloads in Google Cloud.

This document is intended for cloud architects, developers, platform engineers, administrators, and site reliability engineers.

Reliability is a system's ability to consistently perform its intended functions within the defined conditions and maintain uninterrupted service. Best practices for reliability include redundancy, fault-tolerant design, monitoring, and automated recovery processes.

As a part of reliability, resilience is the system's ability to withstand and recover from failures or unexpected disruptions, while maintaining performance. Google Cloud features, like multi-regional deployments, automated backups, and disaster recovery solutions, can help you improve your system's resilience.

Reliability is important to your cloud strategy for many reasons, including the following:

  • Minimal downtime: Downtime can lead to lost revenue, decreased productivity, and damage to reputation. Resilient architectures can help ensure that systems can continue to function during failures or recover efficiently from failures.
  • Enhanced user experience: Users expect seamless interactions with technology. Resilient systems can help maintain consistent performance and availability, and they provide reliable service even during high demand or unexpected issues.
  • Data integrity: Failures can cause data loss or data corruption. Resilient systems implement mechanisms such as backups, redundancy, and replication to protect data and ensure that it remains accurate and accessible.
  • Business continuity: Your business relies on technology for critical operations. Resilient architectures can help ensure continuity after a catastrophic failure, which enables business functions to continue without significant interruptions and supports a swift recovery.
  • Compliance: Many industries have regulatory requirements for system availability and data protection. Resilient architectures can help you to meet these standards by ensuring systems remain operational and secure.
  • Lower long-term costs: Resilient architectures require upfront investment, but resiliency can help to reduce costs over time by preventing expensive downtime, avoiding reactive fixes, and enabling more efficient resource use.

Organizational mindset

To make your systems reliable, you need a plan and an established strategy. This strategy must include education and the authority to prioritize reliability alongside other initiatives.

Set a clear expectation that the entire organization is responsible for reliability, including development, product management, operations, platform engineering, and site reliability engineering (SRE). Even the business-focused groups, like marketing and sales, can influence reliability.

Every team must understand the reliability targets and risks of their applications. The teams must be accountable to these requirements. Conflicts between reliability and regular product feature development must be prioritized and escalated accordingly.

Plan and manage reliability holistically, across all your functions and teams. Consider setting up a Cloud Centre of Excellence (CCoE) that includes a reliability pillar. For more information, see Optimize your organization's cloud journey with a Cloud Center of Excellence.

Focus areas for reliability

The activities that you perform to design, deploy, and manage a reliable system can be categorized in the following focus areas. Each of the reliability principles and recommendations in this pillar is relevant to one of these focus areas.

  • Scoping: To understand your system, conduct a detailed analysis of its architecture. You need to understand the components, how they work and interact, how data and actions flow through the system, and what could go wrong. Identify potential failures, bottlenecks, and risks, which helps you to take actions to mitigate those issues.
  • Observation: To help prevent system failures, implement comprehensive and continuous observation and monitoring. Through this observation, you can understand trends and identify potential problems proactively.
  • Response: To reduce the impact of failures, respond appropriately and recover efficiently. Automated responses can also help reduce the impact of failures. Even with planning and controls, failures can still occur.
  • Learning: To help prevent failures from recurring, learn from each experience, and take appropriate actions.

Core principles

The recommendations in the reliability pillar of the Architecture Framework are mapped to the following core principles:

Contributors

Authors:

Other contributors:

Define reliability based on user-experience goals

This principle in the reliability pillar of the Google Cloud Architecture Framework helps you to assess your users' experience, and then map the findings to reliability goals and metrics.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Observability tools provide large amounts of data, but not all of the data directly relates to the impacts on the users. For example, you might observe high CPU usage, slow server operations, or even crashed tasks. However, if these issues don't affect the user experience, then they don't constitute an outage.

To measure the user experience, you need to distinguish between internal system behavior and user-facing problems. Focus on metrics like the success ratio of user requests. Don't rely solely on server-centric metrics, like CPU usage, which can lead to misleading conclusions about your service's reliability. True reliability means that users can consistently and effectively use your application or service.

Recommendations

To help you measure user experience effectively, consider the recommendations in the following sections.

Measure user experience

To truly understand your service's reliability, prioritize metrics that reflect your users' actual experience. For example, measure the users' query success ratio, application latency, and error rates.

Ideally, collect this data directly from the user's device or browser. If this direct data collection isn't feasible, shift your measurement point progressively further away from the user in the system. For example, you can use the load balancer or frontend service as the measurement point. This approach helps you identify and address issues before those issues can significantly impact your users.

Analyze user journeys

To understand how users interact with your system, you can use tracing tools like Cloud Trace. By following a user's journey through your application, you can find bottlenecks and latency issues that might degrade the user's experience. Cloud Trace captures detailed performance data for each hop in your service architecture. This data helps you identify and address performance issues more efficiently, which can lead to a more reliable and satisfying user experience.

Set realistic targets for reliability

This principle in the reliability pillar of the Google Cloud Architecture Framework helps you define reliability goals that are technically feasible for your workloads in Google Cloud.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Design your systems to be just reliable enough for user happiness. It might seem counterintuitive, but a goal of 100% reliability is often not the most effective strategy. Higher reliability might result in a significantly higher cost, both in terms of financial investment and potential limitations on innovation. If users are already happy with the current level of service, then efforts to further increase happiness might yield a low return on investment. Instead, you can better spend resources elsewhere.

You need to determine the level of reliability at which your users are happy, and determine the point where the cost of incremental improvements begin to outweigh the benefits. When you determine this level of sufficient reliability, you can allocate resources strategically and focus on features and improvements that deliver greater value to your users.

Recommendations

To set realistic reliability targets, consider the recommendations in the following subsections.

Accept some failure and prioritize components

Aim for high availability such as 99.99% uptime, but don't set a target of 100% uptime. Acknowledge that some failures are inevitable.

The gap between 100% uptime and a 99.99% target is the allowance for failure. This gap is often called the error budget. The error budget can help you take risks and innovate, which is fundamental to any business to stay competitive.

Prioritize the reliability of the most critical components in the system. Accept that less critical components can have a higher tolerance for failure.

Balance reliability and cost

To determine the optimal reliability level for your system, conduct thorough cost-benefit analyses.

Consider factors like system requirements, the consequences of failures, and your organization's risk tolerance for the specific application. Remember to consider your disaster recovery metrics, such as the recovery time objective (RTO) and recovery point objective (RPO). Decide what level of reliability is acceptable within the budget and other constraints.

Look for ways to improve efficiency and reduce costs without compromising essential reliability features.

Build highly available systems through resource redundancy

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to plan, build, and manage resource redundancy, which can help you to avoid failures.

This principle is relevant to the scoping focus area of reliability.

Principle overview

After you decide the level of reliability that you need, you must design your systems to avoid any single points of failure. Every critical component in the system must be replicated across multiple machines, zones, and regions. For example, a critical database can't be located in only one region, and a metadata server can't be deployed in only one single zone or region. In those examples, if the sole zone or region has an outage, the system has a global outage.

Recommendations

To build redundant systems, consider the recommendations in the following subsections.

Identify failure domains and replicate services

Map out your system's failure domains, from individual VMs to regions, and design for redundancy across the failure domains.

To ensure high availability, distribute and replicate your services and applications across multiple zones and regions. Configure the system for automatic failover to make sure that the services and applications continue to be available in the event of zone or region outages.

For examples of multi-zone and multi-region architectures, see Design reliable infrastructure for your workloads in Google Cloud.

Detect and address issues promptly

Continuously track the status of your failure domains to detect and address issues promptly.

You can monitor the current status of Google Cloud services in all regions by using the Google Cloud Service Health dashboard. You can also view incidents relevant to your project by using Personalized Service Health. You can use load balancers to detect resource health and automatically route traffic to healthy backends. For more information, see Health checks overview.

Test failover scenarios

Like a fire drill, regularly simulate failures to validate the effectiveness of your replication and failover strategies.

For more information, see Simulate a zone outage for a regional MIG and Simulate a zone failure in GKE regional clusters.

Take advantage of horizontal scalability

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you use horizontal scalability. By using horizontal scalability, you can help ensure that your workloads in Google Cloud can scale efficiently and maintain performance.

This principle is relevant to the scoping focus area of reliability.

Principle overview

Re-architect your system to a horizontal architecture. To accommodate growth in traffic or data, you can add more resources. You can also remove resources when they're not in use.

To understand the value of horizontal scaling, consider the limitations of vertical scaling.

A common scenario for vertical scaling is to use a MySQL database as the primary database with critical data. As database usage increases, more RAM and CPU is required. Eventually, the database reaches the memory limit on the host machine, and needs to be upgraded. This process might need to be repeated several times. The problem is that there are hard limits on how much a database can grow. VM sizes are not unlimited. The database can reach a point when it's no longer possible to add more resources.

Even if resources were unlimited, a large VM can become a single point of failure. Any problem with the primary database VM can cause error responses or cause a system-wide outage that affects all users. Avoid single points of failure, as described in Build highly available systems through redundant resources.

Besides these scaling limits, vertical scaling tends to be more expensive. The cost can increase exponentially as machines with greater amounts of compute power and memory are acquired.

Horizontal scaling, by contrast, can cost less. The potential for horizontal scaling is virtually unlimited in a system that's designed to scale.

Recommendations

To transition from a single VM architecture to a horizontal multiple-machine architecture, you need to plan carefully and use the right tools. To help you achieve horizontal scaling, consider the recommendations in the following subsections.

Use managed services

Managed services remove the need to manually manage horizontal scaling. For example, with Compute Engine managed instance groups (MIGs), you can add or remove VMs to scale your application horizontally. For containerized applications, Cloud Run is a serverless platform that can automatically scale your stateless containers based on incoming traffic.

Promote modular design

Modular components and clear interfaces help you scale individual components as needed, instead of scaling the entire application. For more information, see Promote modular design in the performance optimization pillar.

Implement a stateless design

Design applications to be stateless, meaning no locally stored data. This lets you add or remove instances without worrying about data consistency.

Detect potential failures by using observability

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you proactively identify areas where errors and failures might occur.

This principle is relevant to the observation focus area of reliability.

Principle overview

To maintain and improve the reliability of your workloads in Google Cloud, you need to implement effective observability by using metrics, logs, and traces.

  • Metrics are numerical measurements of activities that you want to track for your application at specific time intervals. For example, you might want to track technical metrics like request rate and error rate, which can be used as service-level indicators (SLIs). You might also need to track application-specific business metrics like orders placed and payments received.
  • Logs are time-stamped records of discrete events that occur within an application or system. The event could be a failure, an error, or a change in state. Logs might include metrics, and you can also use logs for SLIs.
  • A trace represents the journey of a single user or transaction through a number of separate applications or the components of an application. For example, these components could be microservices. Traces help you to track what components were used in the journeys, where bottlenecks exist, and how long the journeys took.

Metrics, logs, and traces help you monitor your system continuously. Comprehensive monitoring helps you find out where and why errors occurred. You can also detect potential failures before errors occur.

Recommendations

To detect potential failures efficiently, consider the recommendations in the following subsections.

Gain comprehensive insights

To track key metrics like response times and error rates, use Cloud Monitoring and Cloud Logging. These tools also help you to ensure that the metrics consistently meet the needs of your workload.

To make data-driven decisions, analyze default service metrics to understand component dependencies and their impact on overall workload performance.

To customize your monitoring strategy, create and publish your own metrics by using the Google Cloud SDK.

Perform proactive troubleshooting

Implement robust error handling and enable logging across all of the components of your workloads in Google Cloud. Activate logs like Cloud Storage access logs and VPC Flow Logs.

When you configure logging, consider the associated costs. To control logging costs, you can configure exclusion filters on the log sinks to exclude certain logs from being stored.

Optimize resource utilization

Monitor CPU consumption, network I/O metrics, and disk I/O metrics to detect under-provisioned and over-provisioned resources in services like GKE, Compute Engine, and Dataproc. For a complete list of supported services, see Cloud Monitoring overview.

Prioritize alerts

For alerts, focus on critical metrics, set appropriate thresholds to minimize alert fatigue, and ensure timely responses to significant issues. This targeted approach lets you proactively maintain workload reliability. For more information, see Alerting overview.

Design for graceful degradation

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you to design your Google Cloud workloads to fail gracefully.

This principle is relevant to the response focus area of reliability.

Principle overview

Graceful degradation is a design approach where a system that experiences a high load continues to function, possibly with reduced performance or accuracy. Graceful degradation ensures continued availability of the system and prevents complete failure, even if the system's work isn't optimal. When the load returns to a manageable level, the system resumes full functionality.

For example, during periods of high load, Google Search prioritizes results from higher-ranked web pages, potentially sacrificing some accuracy. When the load decreases, Google Search recomputes the search results.

Recommendations

To design your systems for graceful degradation, consider the recommendations in the following subsections.

Implement throttling

Ensure that your replicas can independently handle overloads and can throttle incoming requests during high-traffic scenarios. This approach helps you to prevent cascading failures that are caused by shifts in excess traffic between zones.

Use tools like Apigee to control the rate of API requests during high-traffic times. You can configure policy rules to reflect how you want to scale back requests.

Drop excess requests early

Configure your systems to drop excess requests at the frontend layer to protect backend components. Dropping some requests prevents global failures and enables the system to recover more gracefully.With this approach, some users might experience errors. However, you can minimize the impact of outages, in contrast to an approach like circuit-breaking, where all traffic is dropped during an overload.

Handle partial errors and retries

Build your applications to handle partial errors and retries seamlessly. This design helps to ensure that as much traffic as possible is served during high-load scenarios.

Test overload scenarios

To validate that the throttle and request-drop mechanisms work effectively, regularly simulate overload conditions in your system. Testing helps ensure that your system is prepared for real-world traffic surges.

Monitor traffic spikes

Use analytics and monitoring tools to predict and respond to traffic surges before they escalate into overloads. Early detection and response can help maintain service availability during high-demand periods.

Perform testing for recovery from failures

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you design and run tests for recovery in the event of failures.

This principle is relevant to the learning focus area of reliability.

Principle overview

To be sure that your system can recover from failures, you must periodically run tests that include regional failovers, release rollbacks, and data restoration from backups.

This testing helps you to practice responses to events that pose major risks to reliability, such as the outage of an entire region. This testing also helps you verify that your system behaves as intended during a disruption.

In the unlikely event of an entire region going down, you need to fail over all traffic to another region. During normal operation of your workload, when data is modified, it needs to be synchronized from the primary region to the failover region. You need to verify that the replicated data is always very recent, so that users don't experience data loss or session breakage. The load balancing system must also be able to shift traffic to the failover region at any time without service interruptions. To minimize downtime after a regional outage, operations engineers also need to be able to manually and efficiently shift user traffic away from a region, in as less time as possible. This operation is sometimes called draining a region, which means you stop the inbound traffic to the region and move all the traffic elsewhere.

Recommendations

When you design and run tests for failure recovery, consider the recommendations in the following subsections.

Define the testing objectives and scope

Clearly define what you want to achieve from the testing. For example, your objectives can include the following:

  • Validate the recovery time objective (RTO) and the recovery point objective (RPO). For details, see Basics of DR planning.
  • Assess system resilience and fault tolerance under various failure scenarios.
  • Test the effectiveness of automated failover mechanisms.

Decide which components, services, or regions are in the testing scope. The scope can include specific application tiers like the frontend, backend, and database, or it can include specific Google Cloud resources like Cloud SQL instances or GKE clusters. The scope must also specify any external dependencies, such as third-party APIs or cloud interconnections.

Prepare the environment for testing

Choose an appropriate environment, preferably a staging or sandbox environment that replicates your production setup. If you conduct the test in production, ensure that you have safety measures ready, like automated monitoring and manual rollback procedures.

Create a backup plan. Take snapshots or backups of critical databases and services to prevent data loss during the test. Ensure that your team is prepared to do manual interventions if the automated failover mechanisms fail.

To prevent test disruptions, ensure that your IAM roles, policies, and failover configurations are correctly set up. Verify that the necessary permissions are in place for the test tools and scripts.

Inform stakeholders, including operations, DevOps, and application owners, about the test schedule, scope, and potential impact. Provide stakeholders with an estimated timeline and the expected behaviors during the test.

Simulate failure scenarios

Plan and execute failures by using tools like Chaos Monkey. You can use custom scripts to simulate failures of critical services such as a shutdown of a primary node in a multi-zone GKE cluster or a disabled Cloud SQL instance. You can also use scripts to simulate a region-wide network outage by using firewall rules or API restrictions based on your scope of test. Gradually escalate the failure scenarios to observe system behavior under various conditions.

Introduce load testing alongside failure scenarios to replicate real-world usage during outages. Test cascading failure impacts, such as how frontend systems behave when backend services are unavailable.

To validate configuration changes and to assess the system's resilience against human errors, test scenarios that involve misconfigurations. For example, run tests with incorrect DNS failover settings or incorrect IAM permissions.

Monitor system behavior

Monitor how load balancers, health checks, and other mechanisms reroute traffic. Use Google Cloud tools like Cloud Monitoring and Cloud Logging to capture metrics and events during the test.

Observe changes in latency, error rates, and throughput during and after the failure simulation, and monitor the overall performance impact. Identify any degradation or inconsistencies in the user experience.

Ensure that logs are generated and alerts are triggered for key events, such as service outages or failovers. Use this data to verify the effectiveness of your alerting and incident response systems.

Verify recovery against your RTO and RPO

Measure how long it takes for the system to resume normal operations after a failure, and then compare this data with the defined RTO and document any gaps.

Ensure that data integrity and availability align with the RPO. To test database consistency, compare snapshots or backups of the database before and after a failure.

Evaluate service restoration and confirm that all services are restored to a functional state with minimal user disruption.

Document and analyze results

Document each test step, failure scenario, and corresponding system behavior. Include timestamps, logs, and metrics for detailed analyses.

Highlight bottlenecks, single points of failure, or unexpected behaviors observed during the test. To help prioritize fixes, categorize issues by severity and impact.

Suggest improvements to the system architecture, failover mechanisms, or monitoring setups. Based on test findings, update any relevant failover policies and playbooks. Present a postmortem report to stakeholders. The report should summarize the outcomes, lessons learned, and next steps. For more information, see Conduct thorough postmortems.

Iterate and improve

To validate ongoing reliability and resilience, plan periodic testing (for example, quarterly).

Run tests under different scenarios, including infrastructure changes, software updates, and increased traffic loads.

Automate failover tests by using CI/CD pipelines to integrate reliability testing into your development lifecycle.

During the postmortem, use feedback from stakeholders and end users to improve the test process and system resilience.

Perform testing for recovery from data loss

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you design and run tests for recovery from data loss.

This principle is relevant to the learning focus area of reliability.

Principle overview

To ensure that your system can recover from situations where data is lost or corrupted, you need to run tests for those scenarios. Instances of data loss might be caused by a software bug or some type of natural disaster. After such events, you need to restore data from backups and bring all of the services back up again by using the freshly restored data.

We recommend that you use three criteria to judge the success or failure of this type of recovery test: data integrity, recovery time objective (RTO), and recovery point objective (RPO). For details about the RTO and RPO metrics, see Basics of DR planning.

The goal of data restoration testing is to periodically verify that your organization can continue to meet business continuity requirements. Besides measuring RTO and RPO, a data restoration test must include testing of the entire application stack and all the critical infrastructure services with the restored data. This is necessary to confirm that the entire deployed application works correctly in the test environment.

Recommendations

When you design and run tests for recovering from data loss, consider the recommendations in the following subsections.

Verify backup consistency and test restoration processes

You need to verify that your backups contain consistent and usable snapshots of data that you can restore to immediately bring applications back into service. To validate data integrity, set up automated consistency checks to run after each backup.

To test backups, restore them in a non-production environment. To ensure your backups can be restored efficiently and that the restored data meets application requirements, regularly simulate data recovery scenarios. Document the steps for data restoration, and train your teams to execute the steps effectively during a failure.

Schedule regular and frequent backups

To minimize data loss during restoration and to meet RPO targets, it's essential to have regularly scheduled backups. Establish a backup frequency that aligns with your RPO. For example, if your RPO is 15 minutes, schedule backups to run at least every 15 minutes. Optimize the backup intervals to reduce the risk of data loss.

Use Google Cloud tools like Cloud Storage, Cloud SQL automated backups, or Spanner backups to schedule and manage backups. For critical applications, use near-continuous backup solutions like point-in-time recovery (PITR) for Cloud SQL or incremental backups for large datasets.

Define and monitor RPO

Set a clear RPO based on your business needs, and monitor adherence to the RPO. If backup intervals exceed the defined RPO, use Cloud Monitoring to set up alerts.

Monitor backup health

Use Google Cloud Backup and DR service or similar tools to track the health of your backups and confirm that they are stored in secure and reliable locations. Ensure that the backups are replicated across multiple regions for added resilience.

Plan for scenarios beyond backup

Combine backups with disaster recovery strategies like active-active failover setups or cross-region replication for improved recovery time in extreme cases. For more information, see Disaster recovery planning guide.

Conduct thorough postmortems

This principle in the reliability pillar of the Google Cloud Architecture Framework provides recommendations to help you conduct effective postmortems after failures and incidents.

This principle is relevant to the learning focus area of reliability.

Principle overview

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve the incident, the root causes, and the follow-up actions to prevent the incident from recurring. The goal of a postmortem is to learn from mistakes and not assign blame.

The following diagram shows the workflow of a postmortem:

The workflow of a postmortem.

The workflow of a postmortem includes the following steps:

  • Create postmortem
  • Capture the facts
  • Identify and analyze the root causes
  • Plan for the future
  • Execute the plan

Conduct postmortem analyses after major events and non-major events like the following:

  • User-visible downtimes or degradations beyond a certain threshold.
  • Data losses of any kind.
  • Interventions from on-call engineers, such as a release rollback or rerouting of traffic.
  • Resolution times above a defined threshold.
  • Monitoring failures, which usually imply manual incident discovery.

Recommendations

Define postmortem criteria before an incident occurs so that everyone knows when a post mortem is necessary.

To conduct effective postmortems, consider the recommendations in the following subsections.

Conduct blameless postmortems

Effective postmortems focus on processes, tools, and technologies, and don't place blame on individuals or teams. The purpose of a postmortem analysis is to improve your technology and future, not to find who is guilty. Everyone makes mistakes. The goal should be to analyze the mistakes and learn from them.

The following examples show the difference between feedback that assigns blame and blameless feedback:

  • Feedback that assigns blame: "We need to rewrite the entire complicated backend system! It's been breaking weekly for the last three quarters and I'm sure we're all tired of fixing things piecemeal. Seriously, if I get paged one more time I'll rewrite it myself…"
  • Blameless feedback: "An action item to rewrite the entire backend system might actually prevent these pages from continuing to happen. The maintenance manual for this version is quite long and really difficult to be fully trained up on. I'm sure our future on-call engineers will thank us!"

Make the postmortem report readable by all the intended audiences

For each piece of information that you plan to include in the report, assess whether that information is important and necessary to help the audience understand what happened. You can move supplementary data and explanations to an appendix of the report. Reviewers who need more information can request it.

Avoid complex or over-engineered solutions

Before you start to explore solutions for a problem, evaluate the importance of the problem and the likelihood of a recurrence. Adding complexity to the system to solve problems that are unlikely to occur again can lead to increased instability.

Share the postmortem as widely as possible

To ensure that issues don't remain unresolved, publish the outcome of the postmortem to a wide audience and get support from management. The value of a postmortem is proportional to the learning that occurs after the postmortem. When more people learn from incidents, the likelihood of similar failures recurring is reduced.

Google Cloud Architecture Framework: Cost optimization

The cost optimization pillar in the Google Cloud Architecture Framework describes principles and recommendations to optimize the cost of your workloads in Google Cloud.

The intended audience includes the following:

  • CTOs, CIOs, CFOs, and other executives who are responsible for strategic cost management.
  • Architects, developers, administrators, and operators who make decisions that affect cost at all the stages of an organization's cloud journey.

The cost models for on-premises and cloud workloads differ significantly. On-premises IT costs include capital expenditure (CapEx) and operational expenditure (OpEx). On-premises hardware and software assets are acquired and the acquisition costs are depreciated over the operating life of the assets. In the cloud, the costs for most cloud resources are treated as OpEx, where costs are incurred when the cloud resources are consumed. This fundamental difference underscores the importance of the following core principles of cost optimization.

For cost optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Cost optimization in the Architecture Framework.

Core principles

The recommendations in the cost optimization pillar of the Architecture Framework are mapped to the following core principles:

  • Align cloud spending with business value: Ensure that your cloud resources deliver measurable business value by aligning IT spending with business objectives.
  • Foster a culture of cost awareness: Ensure that people across your organization consider the cost impact of their decisions and activities, and ensure that they have access to the cost information required to make informed decisions.
  • Optimize resource usage: Provision only the resources that you need, and pay only for the resources that you consume.
  • Optimize continuously: Continuously monitor your cloud resource usage and costs, and proactively make adjustments as needed to optimize your spending. This approach involves identifying and addressing potential cost inefficiencies before they become significant problems.

These principles are closely aligned with the core tenets of cloud FinOps. FinOps is relevant to any organization, regardless of its size or maturity in the cloud. By adopting these principles and following the related recommendations, you can control and optimize costs throughout your journey in the cloud.

Contributors

Author: Nicolas Pintaux | Customer Engineer, Application Modernization Specialist

Other contributors:

Align cloud spending with business value

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to align your use of Google Cloud resources with your organization's business goals.

Principle overview

To effectively manage cloud costs, you need to maximize the business value that the cloud resources provide and minimize the total cost of ownership (TCO). When you evaluate the resource options for your cloud workloads, consider not only the cost of provisioning and using the resources, but also the cost of managing them. For example, virtual machines (VMs) on Compute Engine might be a cost-effective option for hosting applications. However, when you consider the overhead to maintain, patch, and scale the VMs, the TCO can increase. On the other hand, serverless services like Cloud Run can offer greater business value. The lower operational overhead lets your team focus on core activities and helps to increase agility.

To ensure that your cloud resources deliver optimal value, evaluate the following factors:

  • Provisioning and usage costs: The expenses incurred when you purchase, provision, or consume resources.
  • Management costs: The recurring expenses for operating and maintaining resources, including tasks like patching, monitoring and scaling.
  • Indirect costs: The costs that you might incur to manage issues like downtime, data loss, or security breaches.
  • Business impact: The potential benefits from the resources, like increased revenue, improved customer satisfaction, and faster time to market.

By aligning cloud spending with business value, you get the following benefits:

  • Value-driven decisions: Your teams are encouraged to prioritize solutions that deliver the greatest business value and to consider both short-term and long-term cost implications.
  • Informed resource choice: Your teams have the information and knowledge that they need to assess the business value and TCO of various deployment options, so they choose resources that are cost-effective.
  • Cross-team alignment: Cross-functional collaboration between business, finance, and technical teams ensures that cloud decisions are aligned with the overall objectives of the organization.

Recommendations

To align cloud spending with business objectives, consider the following recommendations.

Prioritize managed services and serverless products

Whenever possible, choose managed services and serverless products to reduce operational overhead and maintenance costs. This choice lets your teams concentrate on their core business activities. They can accelerate the delivery of new features and functionalities, and help drive innovation and value.

The following are examples of how you can implement this recommendation:

  • To run PostgreSQL, MySQL, or Microsoft SQL Server server databases, use Cloud SQL instead of deploying those databases on VMs.
  • To run and manage Kubernetes clusters, use Google Kubernetes Engine (GKE) Autopilot instead of deploying containers on VMs.
  • For your Apache Hadoop or Apache Spark processing needs, use Dataproc and Dataproc Serverless. Per-second billing can help to achieve significantly lower TCO when compared to on-premises data lakes.

Balance cost efficiency with business agility

Controlling costs and optimizing resource utilization are important goals. However, you must balance these goals with the need for flexible infrastructure that lets you innovate rapidly, respond quickly to changes, and deliver value faster. The following are examples of how you can achieve this balance:

  • Adopt DORA metrics for software delivery performance. Metrics like change failure rate (CFR), time to detect (TTD), and time to restore (TTR) can help to identify and fix bottlenecks in your development and deployment processes. By reducing downtime and accelerating delivery, you can achieve both operational efficiency and business agility.
  • Follow Site Reliability Engineering (SRE) practices to improve operational reliability. SRE's focus on automation, observability, and incident response can lead to reduced downtime, lower recovery time, and higher customer satisfaction. By minimizing downtime and improving operational reliability, you can prevent revenue loss and avoid the need to overprovision resources as a safety net to handle outages.

Enable self-service optimization

Encourage a culture of experimentation and exploration by providing your teams with self-service cost optimization tools, observability tools, and resource management platforms. Enable them to provision, manage, and optimize their cloud resources autonomously. This approach helps to foster a sense of ownership, accelerate innovation, and ensure that teams can respond quickly to changing needs while being mindful of cost efficiency.

Adopt and implement FinOps

Adopt FinOps to establish a collaborative environment where everyone is empowered to make informed decisions that balance cost and value. FinOps fosters financial accountability and promotes effective cost optimization in the cloud.

Promote a value-driven and TCO-aware mindset

Encourage your team members to adopt a holistic attitude toward cloud spending, with an emphasis on TCO and not just upfront costs. Use techniques like value stream mapping to visualize and analyze the flow of value through your software delivery process and to identify areas for improvement. Implement unit costing for your applications and services to gain a granular understanding of cost drivers and discover opportunities for cost optimization. For more information, see Maximize business value with cloud FinOps.

Foster a culture of cost awareness

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to promote cost awareness across your organization and ensure that team members have the cost information that they need to make informed decisions.

Conventionally, the responsibility for cost management might be centralized to a few select stakeholders and primarily focused on initial project architecture decisions. However, team members across all cloud user roles (analyst, architect, developer, or administrator) can help to reduce the cost of your resources in Google Cloud. By sharing cost data appropriately, you can empower team members to make cost-effective decisions throughout their development and deployment processes.

Principle overview

Stakeholders across various roles – product owners, developers, deployment engineers, administrators, and financial analysts – need visibility into relevant cost data and its relationship to business value. When provisioning and managing cloud resources, they need the following data:

  • Projected resource costs: Cost estimates at the time of design and deployment.
  • Real-time resource usage costs: Up-to-date cost data that can be used for ongoing monitoring and budget validation.
  • Costs mapped to business metrics: Insights into how cloud spending affects key performance indicators (KPIs), to enable teams to identify cost-effective strategies.

Every individual might not need access to raw cost data. However, promoting cost awareness across all roles is crucial because individual decisions can affect costs.

By promoting cost visibility and ensuring clear ownership of cost management practices, you ensure that everyone is aware of the financial implications of their choices and everyone actively contributes to the organization's cost optimization goals. Whether through a centralized FinOps team or a distributed model, establishing accountability is crucial for effective cost optimization efforts.

Recommendations

To promote cost awareness and ensure that your team members have the cost information that they need to make informed decisions, consider the following recommendations.

Provide organization-wide cost visibility

To achieve organization-wide cost visibility, the teams that are responsible for cost management can take the following actions:

  • Standardize cost calculation and budgeting: Use a consistent method to determine the full costs of cloud resources, after factoring in discounts and shared costs. Establish clear and standardized budgeting processes that align with your organization's goals and enable proactive cost management.
  • Use standardized cost management and visibility tools: Use appropriate tools that provide real-time insights into cloud spending and generate regular (for example, weekly) cost progression snapshots. These tools enable proactive budgeting, forecasting, and identification of optimization opportunities. The tools could be cloud provider tools (like the Google Cloud Billing dashboard), third-party solutions, or open-source solutions like the Cost Attribution solution.
  • Implement a cost allocation system: Allocate a portion of the overall cloud budget to each team or project. Such an allocation gives the teams a sense of ownership over cloud spending and encourages them to make cost-effective decisions within their allocated budget.
  • Promote transparency: Encourage teams to discuss cost implications during the design and decision-making processes. Create a safe and supportive environment for sharing ideas and concerns related to cost optimization. Some organizations use positive reinforcement mechanisms like leaderboards or recognition programs. If your organization has restrictions on sharing raw cost data due to business concerns, explore alternative approaches for sharing cost information and insights. For example, consider sharing aggregated metrics (like the total cost for an environment or feature) or relative metrics (like the average cost per transaction or user).

Understand how cloud resources are billed

Pricing for Google Cloud resources might vary across regions. Some resources are billed monthly at a fixed price, and others might be billed based on usage. To understand how Google Cloud resources are billed, use the Google Cloud pricing calculator and product-specific pricing information (for example, Google Kubernetes Engine (GKE) pricing).

Understand resource-based cost optimization options

For each type of cloud resource that you plan to use, explore strategies to optimize utilization and efficiency. The strategies include rightsizing, autoscaling, and adopting serverless technologies where appropriate. The following are examples of cost optimization options for a few Google Cloud products:

  • Cloud Run lets you configure always-allocated CPUs to handle predictable traffic loads at a fraction of the price of the default allocation method (that is, CPUs allocated only during request processing).
  • You can purchase BigQuery slot commitments to save money on data analysis.
  • GKE provides detailed metrics to help you understand cost optimization options.
  • Understand how network pricing can affect the cost of data transfers and how you can optimize costs for specific networking services. For example, you can reduce the data transfer costs for external Application Load Balancers by using Cloud CDN or Google Cloud Armor. For more information, see Ways to lower external Application Load Balancer costs.

Understand discount-based cost optimization options

Familiarize yourself with the discount programs that Google Cloud offers, such as the following examples:

  • Committed use discounts (CUDs): CUDs are suitable for resources that have predictable and steady usage. CUDs let you get significant reductions in price in exchange for committing to specific resource usage over a period (typically one to three years). You can also use CUD auto-renewal to avoid having to manually repurchase commitments when they expire.
  • Sustained use discounts: For certain Google Cloud products like Compute Engine and GKE, you can get automatic discount credits after continuous resource usage beyond specific duration thresholds.
  • Spot VMs: For fault-tolerant and flexible workloads, Spot VMs can help to reduce your Compute Engine costs. The cost of Spot VMs is significantly lower than regular VMs. However, Compute Engine might preemptively stop or delete Spot VMs to reclaim capacity. Spot VMs are suitable for batch jobs that can tolerate preemption and don't have high availability requirements.
  • Discounts for specific product options: Some managed services like BigQuery offer discounts when you purchase dedicated or autoscaling query processing capacity.

Evaluate and choose the discounts options that align with your workload characteristics and usage patterns.

Incorporate cost estimates into architecture blueprints

Encourage teams to develop architecture blueprints that include cost estimates for different deployment options and configurations. This practice empowers teams to compare costs proactively and make informed decisions that align with both technical and financial objectives.

Use a consistent and standard set of labels for all your resources

You can use labels to track costs and to identify and classify resources. Specifically, you can use labels to allocate costs to different projects, departments, or cost centers. Defining a formal labeling policy that aligns with the needs of the main stakeholders in your organization helps to make costs visible more widely. You can also use labels to filter resource cost and usage data based on target audience.

Use automation tools like Terraform to enforce labeling on every resource that is created. To enhance cost visibility and attribution further, you can use the tools provided by the open-source cost attribution solution.

Share cost reports with team members

By sharing cost reports with your team members, you empower them to take ownership of their cloud spending. This practice enables cost-effective decision making, continuous cost optimization, and systematic improvements to your cost allocation model.

Cost reports can be of several types, including the following:

  • Periodic cost reports: Regular reports inform teams about their current cloud spending. Conventionally, these reports might be spreadsheet exports. More effective methods include automated emails and specialized dashboards. To ensure that cost reports provide relevant and actionable information without overwhelming recipients with unnecessary detail, the reports must be tailored to the target audiences. Setting up tailored reports is a foundational step toward more real-time and interactive cost visibility and management.
  • Automated notifications: You can configure cost reports to proactively notify relevant stakeholders (for example, through email or chat) about cost anomalies, budget thresholds, or opportunities for cost optimization. By providing timely information directly to those who can act on it, automated alerts encourage prompt action and foster a proactive approach to cost optimization.
  • Google Cloud dashboards: You can use the built-in billing dashboards in Google Cloud to get insights into cost breakdowns and to identify opportunities for cost optimization. Google Cloud also provides FinOps hub to help you monitor savings and get recommendations for cost optimization. An AI engine powers the FinOps hub to recommend cost optimization opportunities for all the resources that are currently deployed. To control access to these recommendations, you can implement role-based access control (RBAC).
  • Custom dashboards: You can create custom dashboards by exporting cost data to an analytics database, like BigQuery. Use a visualization tool like Looker Studio to connect to the analytics database to build interactive reports and enable fine-grained access control through role-based permissions.
  • Multicloud cost reports: For multicloud deployments, you need a unified view of costs across all the cloud providers to ensure comprehensive analysis, budgeting, and optimization. Use tools like BigQuery to centralize and analyze cost data from multiple cloud providers, and use Looker Studio to build team-specific interactive reports.

Optimize resource usage

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you plan and provision resources to match the requirements and consumption patterns of your cloud workloads.

Principle overview

To optimize the cost of your cloud resources, you need to thoroughly understand your workloads resource requirements and load patterns. This understanding is the basis for a well defined cost model that lets you forecast the total cost of ownership (TCO) and identify cost drivers throughout your cloud adoption journey. By proactively analyzing and forecasting cloud spending, you can make informed choices about resource provisioning, utilization, and cost optimization. This approach lets you control cloud spending, avoid overprovisioning, and ensure that cloud resources are aligned with the dynamic needs of your workloads and environments.

Recommendations

To effectively optimize cloud resource usage, consider the following recommendations.

Choose environment-specific resources

Each deployment environment has different requirements for availability, reliability and scalability. For example, developers might prefer an environment that lets them rapidly deploy and run applications for short durations, but might not need high availability. On the other hand, a production environment typically needs high availability. To maximize the utilization of your resources, define environment-specific requirements based on your business needs. The following table lists examples of environment-specific requirements.

Environment Requirements
Production
  • High availability
  • Predictable performance
  • Operational stability
  • Security with robust resources
Development and testing
  • Cost efficiency
  • Flexible infrastructure with burstable capacity
  • Ephemeral infrastructure when data persistence is not necessary
Other environments (like staging and QA)
  • Tailored resource allocation based on environment-specific requirements

Choose workload-specific resources

Each of your cloud workloads might have different requirements for availability, scalability, security, and performance. To optimize costs, you need to align resource choices with the specific requirements of each workload. For example, a stateless application might not require the same level of availability or reliability as a stateful backend. The following table lists more examples of workload-specific requirements.

Workload type Workload requirements Resource options
Mission-critical Continuous availability, robust security, and high performance Premium resources and managed services like Spanner for high availability and global consistency of data.
Non-critical Cost-efficient and autoscaling infrastructure Resources with basic features and ephemeral resources like Spot VMs.
Event-driven Dynamic scaling based on the current demand for capacity and performance Serverless services like Cloud Run and Cloud Run functions.
Experimental workloads Low cost and flexible environment for rapid development, iteration, testing, and innovation Resources with basic features, ephemeral resources like Spot VMs, and sandbox environments with defined spending limits.

A benefit of the cloud is the opportunity to take advantage of the most appropriate computing power for a given workload. Some workloads are developed to take advantage of processor instruction sets, and others might not be designed in this way. Benchmark and profile your workloads accordingly. Categorize your workloads and make workload-specific resource choices (for example, choose appropriate machine families for Compute Engine VMs). This practice helps to optimize costs, enable innovation, and maintain the level of availability and performance that your workloads need.

The following are examples of how you can implement this recommendation:

  • For mission-critical workloads that serve globally distributed users, consider using Spanner. Spanner removes the need for complex database deployments by ensuring reliability and consistency of data in all regions.
  • For workloads with fluctuating load levels, use autoscaling to ensure that you don't incur costs when the load is low and yet maintain sufficient capacity to meet the current load. You can configure autoscaling for many Google Cloud services, including Compute Engine VMs, Google Kubernetes Engine (GKE) clusters, and Cloud Run. When you set up autoscaling, you can configure maximum scaling limits to ensure that costs remain within specified budgets.

Select regions based on cost requirements

For your cloud workloads, carefully evaluate the available Google Cloud regions and choose regions that align with your cost objectives. The region with lowest cost might not offer optimal latency or it might not meet your sustainability requirements. Make informed decisions about where to deploy your workloads to achieve the desired balance. You can use the Google Cloud Region Picker to understand the trade-offs between cost, sustainability, latency, and other factors.

Use built-in cost optimization options

Google Cloud products provide built-in features to help you optimize resource usage and control costs. The following table lists examples of cost optimization features that you can use in some Google Cloud products:

Product Cost optimization feature
Compute Engine
  • Automatically add or remove VMs based on the current load by using autoscaling.
  • Avoid overprovisioning by creating and using custom machine types
  • that match your workload's requirements.
  • For non-critical or fault-tolerant workloads, reduce costs by using Spot VMs.
  • In development environments, reduce costs by limiting the run time of VMs or by suspending or stopping VMs when you don't need them.
GKE
  • Automatically adjust the size of GKE clusters based on the current load by using cluster autoscaler.
  • Automatically create and manage node pools based on workload requirements and ensure optimal resource utilization by using node auto-provisioning.
Cloud Storage
  • Automatically transition data to lower-cost storage classes based on the age of data or based on access patterns by using Object Lifecycle Management.
  • Dynamically move data to the most cost-effective storage class based on usage patterns by using Autoclass.
BigQuery
  • Reduce query processing costs for steady-state workloads by using capacity-based pricing.
  • Optimize query performance and costs by using partitioning and clustering techniques.
Google Cloud VMware Engine

Optimize resource sharing

To maximize the utilization of cloud resources, you can deploy multiple applications or services on the same infrastructure, while still meeting the security and other requirements of the applications. For example, in development and testing environments, you can use the same cloud infrastructure to test all the components of an application. For the production environment, you can deploy each component on a separate set of resources to limit the extent of impact in case of incidents.

The following are examples of how you can implement this recommendation:

  • Use a single Cloud SQL instance for multiple non-production environments.
  • Enable multiple development teams to share a GKE cluster by using the fleet team management feature in GKE Enterprise with appropriate access controls.
  • Use GKE Autopilot to take advantage of cost-optimization techniques like bin packing and autoscaling that GKE implements by default.
  • For AI and ML workloads, save GPU costs by using GPU-sharing strategies like multi-instance GPUs, time-sharing GPUs, and NVIDIA MPS.

Develop and maintain reference architectures

Create and maintain a repository of reference architectures that are tailored to meet the requirements of different deployment environments and workload types. To streamline the design and implementation process for individual projects, the blueprints can be centrally managed by a team like a Cloud Center of Excellence (CCoE). Project teams can choose suitable blueprints based on clearly defined criteria, to ensure architectural consistency and adoption of best practices. For requirements that are unique to a project, the project team and the central architecture team should collaborate to design new reference architectures. You can share the reference architectures across the organization to foster knowledge sharing and expand the repository of available solutions. This approach ensures consistency, accelerates development, simplifies decision-making, and promotes efficient resource utilization.

Review the reference architectures provided by Google for various use cases and technologies. These reference architectures incorporate best practices for resource selection, sizing, configuration, and deployment. By using these reference architectures, you can accelerate your development process and achieve cost savings from the start.

Enforce cost discipline by using organization policies

Consider using organization policies to limit the available Google Cloud locations and products that team members can use. These policies help to ensure that teams adhere to cost-effective solutions and provision resources in locations that are aligned with your cost optimization goals.

Estimate realistic budgets and set financial boundaries

Develop detailed budgets for each project, workload, and deployment environment. Make sure that the budgets cover all aspects of cloud operations, including infrastructure costs, software licenses, personnel, and anticipated growth. To prevent overspending and ensure alignment with your financial goals, establish clear spending limits or thresholds for projects, services, or specific resources. Monitor cloud spending regularly against these limits. You can use proactive quota alerts to identify potential cost overruns early and take timely corrective action.

In addition to setting budgets, you can use quotas and limits to help enforce cost discipline and prevent unexpected spikes in spending. You can exercise granular control over resource consumption by setting quotas at various levels, including projects, services, and even specific resource types.

The following are examples of how you can implement this recommendation:

  • Project-level quotas: Set spending limits or resource quotas at the project level to establish overall financial boundaries and control resource consumption across all the services within the project.
  • Service-specific quotas: Configure quotas for specific Google Cloud services like Compute Engine or BigQuery to limit the number of instances, CPUs, or storage capacity that can be provisioned.
  • Resource type-specific quotas: Apply quotas to individual resource types like Compute Engine VMs, Cloud Storage buckets, Cloud Run instances, or GKE nodes to restrict their usage and prevent unexpected cost overruns.
  • Quota alerts: Get notifications when your quota usage (at the project level) reaches a percentage of the maximum value.

By using quotas and limits in conjunction with budgeting and monitoring, you can create a proactive and multi-layered approach to cost control. This approach helps to ensure that your cloud spending remains within defined boundaries and aligns with your business objectives. Remember, these cost controls are not permanent or rigid. To ensure that the cost controls remain aligned with current industry standards and reflect your evolving business needs, you must review the controls regularly and adjust them to include new technologies and best practices.

Optimize continuously

This principle in the cost optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your cloud deployments based on constantly changing and evolving business goals.

As your business grows and evolves, your cloud workloads need to adapt to changes in resource requirements and usage patterns. To derive maximum value from your cloud spending, you must maintain cost-efficiency while continuing to support business objectives. This requires a proactive and adaptive approach that focuses on continuous improvement and optimization.

Principle overview

To optimize cost continuously, you must proactively monitor and analyze your cloud environment and make suitable adjustments to meet current requirements. Focus your monitoring efforts on key performance indicators (KPIs) that directly affect your end users' experience, align with your business goals, and provide insights for continuous improvement. This approach lets you identify and address inefficiencies, adapt to changing needs, and continuously align cloud spending with strategic business goals. To balance comprehensive observability with cost effectiveness, understand the costs and benefits of monitoring resource usage and use appropriate process-improvement and optimization strategies.

Recommendations

To effectively monitor your Google Cloud environment and optimize cost continuously, consider the following recommendations.

Focus on business-relevant metrics

Effective monitoring starts with identifying the metrics that are most important for your business and customers. These metrics include the following:

  • User experience metrics: Latency, error rates, throughput, and customer satisfaction metrics are useful for understanding your end users' experience when using your applications.
  • Business outcome metrics: Revenue, customer growth, and engagement can be correlated with resource usage to identify opportunities for cost optimization.
  • DevOps Research & Assessment (DORA) metrics: Metrics like deployment frequency, lead time for changes, change failure rate, and time to restore provide insights into the efficiency and reliability of your software delivery process. By improving these metrics, you can increase productivity, reduce downtime, and optimize cost.
  • Site Reliability Engineering (SRE) metrics: Error budgets help teams to quantify and manage the acceptable level of service disruption. By establishing clear expectations for reliability, error budgets empower teams to innovate and deploy changes more confidently, knowing their safety margin. This proactive approach promotes a balance between innovation and stability, helping prevent excessive operational costs associated with major outages or prolonged downtime.

Use observability for resource optimization

The following are recommendations to use observability to identify resource bottlenecks and underutilized resources in your cloud deployments:

  • Monitor resource utilization: Use resource utilization metrics to identify Google Cloud resources that are underutilized. For example, use metrics like CPU and memory utilization to identify idle VM resources. For Google Kubernetes Engine (GKE), you can view a detailed breakdown of costs and cost-related optimization metrics. For Google Cloud VMware Engine, review resource utilization to optimize CUDs, storage consumption, and ESXi right-sizing.
  • Use cloud recommendations: Active Assist is a portfolio of intelligent tools that help you optimize your cloud operations. These tools provide actionable recommendations to reduce costs, increase performance, improve security and even make sustainability-focused decisions. For example, VM rightsizing insights can help to optimize resource allocation and avoid unnecessary spending.
  • Correlate resource utilization with performance: Analyze the relationship between resource utilization and application performance to determine whether you can downgrade to less expensive resources without affecting the user experience.

Balance troubleshooting needs with cost

Detailed observability data can help with diagnosing and troubleshooting issues. However, storing excessive amounts of observability data or exporting unnecessary data to external monitoring tools can lead to unnecessary costs. For efficient troubleshooting, consider the following recommendations:

  • Collect sufficient data for troubleshooting: Ensure that your monitoring solution captures enough data to efficiently diagnose and resolve issues when they arise. This data might include logs, traces, and metrics at various levels of granularity.
  • Use sampling and aggregation: Balance the need for detailed data with cost considerations by using sampling and aggregation techniques. This approach lets you collect representative data without incurring excessive storage costs.
  • Understand the pricing models of your monitoring tools and services: Evaluate different monitoring solutions and choose options that align with your project's specific needs, budget, and usage patterns. Consider factors like data volume, retention requirements, and the required features when making your selection.
  • Regularly review your monitoring configuration: Avoid collecting excessive data by removing unnecessary metrics or logs.

Tailor data collection to roles and set role-specific retention policies

Consider the specific data needs of different roles. For example, developers might primarily need access to traces and application-level logs, whereas IT administrators might focus on system logs and infrastructure metrics. By tailoring data collection, you can reduce unnecessary storage costs and avoid overwhelming users with irrelevant information.

Additionally, you can define retention policies based on the needs of each role and any regulatory requirements. For example, developers might need access to detailed logs for a shorter period, while financial analysts might require longer-term data.

Consider regulatory and compliance requirements

In certain industries, regulatory requirements mandate data retention. To avoid legal and financial risks, you need to ensure that your monitoring and data retention practices help you adhere to relevant regulations. At the same time, you need to maintain cost efficiency. Consider the following recommendations:

  • Determine the specific data retention requirements for your industry or region, and ensure that your monitoring strategy meets the requirements of those requirements.
  • Implement appropriate data archival and retrieval mechanisms to meet audit and compliance needs while minimizing storage costs.

Implement smart alerting

Alerting helps to detect and resolve issues in a timely manner. However, a balance is necessary between an approach that keeps you informed, and one that overwhelms you with notifications. By designing intelligent alerting systems, you can prioritize critical issues that have higher business impact. Consider the following recommendations:

  • Prioritize issues that affect customers: Design alerts that trigger rapidly for issues that directly affect the customer experience, like website outages, slow response times, or transaction failures.
  • Tune for temporary problems: Use appropriate thresholds and delay mechanisms to avoid unnecessary alerts for temporary problems or self-healing system issues that don't affect customers.
  • Customize alert severity: Ensure that the most urgent issues receive immediate attention by differentiating between critical and noncritical alerts.
  • Use notification channels wisely: Choose appropriate channels for alert notifications (email, SMS, or paging) based on the severity and urgency of the alerts.

Google Cloud Architecture Framework: Performance optimization

This pillar in the Google Cloud Architecture Framework provides recommendations to optimize the performance of workloads in Google Cloud.

This document is intended for architects, developers, and administrators who plan, design, deploy, and manage workloads in Google Cloud.

The recommendations in this pillar can help your organization to operate efficiently, improve customer satisfaction, increase revenue, and reduce cost. For example, when the backend processing time of an application decreases, users experience faster response times, which can lead to higher user retention and more revenue.

The performance optimization process can involve a trade-off between performance and cost. However, optimizing performance can sometimes help you reduce costs. ​​For example, when the load increases, autoscaling can help to provide predictable performance by ensuring that the system resources aren't overloaded. Autoscaling also helps you to reduce costs by removing unused resources during periods of low load.

Performance optimization is a continuous process, not a one-time activity. The following diagram shows the stages in the performance optimization process:

Performance optimization process

The performance optimization process is an ongoing cycle that includes the following stages:

  1. Define requirements: Define granular performance requirements for each layer of the application stack before you design and develop your applications. To plan resource allocation, consider the key workload characteristics and performance expectations.
  2. Design and deploy: Use elastic and scalable design patterns that can help you meet your performance requirements.
  3. Monitor and analyze: Monitor performance continually by using logs, tracing, metrics, and alerts.
  4. Optimize: Consider potential redesigns as your applications evolve. Rightsize cloud resources and use new features to meet changing performance requirements.

    As shown in the preceding diagram, continue the cycle of monitoring, re-assessing requirements, and adjusting the cloud resources.

For performance optimization principles and recommendations that are specific to AI and ML workloads, see AI and ML perspective: Performance optimization in the Architecture Framework.

Core principles

The recommendations in the performance optimization pillar of the Architecture Framework are mapped to the following core principles:

Contributors

Authors:

Other contributors:

Plan resource allocation

This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you plan resources for your workloads in Google Cloud. It emphasizes the importance of defining granular requirements before you design and develop applications for cloud deployment or migration.

Principle overview

To meet your business requirements, it's important that you define the performance requirements for your applications, before design and development. Define these requirements as granularly as possible for the application as a whole and for each layer of the application stack. For example, in the storage layer, you must consider the throughput and I/O operations per second (IOPS) that the applications need.

From the beginning, plan application designs with performance and scalability in mind. Consider factors such as the number of users, data volume, and potential growth over time.

Performance requirements for each workload vary and depend on the type of workload. Each workload can contain a mix of component systems and services that have unique sets of performance characteristics. For example, a system that's responsible for periodic batch processing of large datasets has different performance demands than an interactive virtual desktop solution. Your optimization strategies must address the specific needs of each workload.

Select services and features that align with the performance goals of each workload. For performance optimization, there's no one-size-fits-all solution. When you optimize each workload, the entire system can achieve optimal performance and efficiency.

Consider the following workload characteristics that can influence your performance requirements:

  • Deployment archetype: The deployment archetype that you select for an application can influence your choice of products and features, which then determine the performance that you can expect from your application.
  • Resource placement: When you select a Google Cloud region for your application resources, we recommend that you prioritize low latency for end users, adhere to data-locality regulations, and ensure the availability of required Google Cloud products and services.
  • Network connectivity: Choose networking services that optimize data access and content delivery. Take advantage of Google Cloud's global network, high-speed backbones, interconnect locations, and caching services.
  • Application hosting options: When you select a hosting platform, you must evaluate the performance advantages and disadvantages of each option. For example, consider bare metal, virtual machines, containers, and serverless platforms.
  • Storage strategy: Choose an optimal storage strategy that's based on your performance requirements.
  • Resource configurations: The machine type, IOPS, and throughput can have a significant impact on performance. Additionally, early in the design phase, you must consider appropriate security capabilities and their impact on resources. When you plan security features, be prepared to accommodate the necessary performance trade-offs to avoid any unforeseen effects.

Recommendations

To ensure optimal resource allocation, consider the recommendations in the following sections.

Configure and manage quotas

Ensure that your application uses only the necessary resources, such as memory, storage, and processing power. Over-allocation can lead to unnecessary expenses, while under-allocation might result in performance degradation.

To accommodate elastic scaling and to ensure that adequate resources are available, regularly monitor the capacity of your quotas. Additionally, track quota usage to identify potential scaling constraints or over-allocation issues, and then make informed decisions about resource allocation.

Educate and promote awareness

Inform your users about the performance requirements and provide educational resources about effective performance management techniques.

To evaluate progress and to identify areas for improvement, regularly document the target performance and the actual performance. Load test your application to find potential breakpoints and to understand how you can scale the application.

Monitor performance metrics

Use Cloud Monitoring to analyze trends in performance metrics, to analyze the effects of experiments, to define alerts for critical metrics, and to perform retrospective analyses.

Active Assist is a set of tools that can provide insights and recommendations to help optimize resource utilization. These recommendations can help you to adjust resource allocation and improve performance.

Take advantage of elasticity

This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you incorporate elasticity, which is the ability to adjust resources dynamically based on changes in workload requirements.

Elasticity allows different components of a system to scale independently. This targeted scaling can help improve performance and cost efficiency by allocating resources precisely where they're needed, without over provisioning or under provisioning your resources.

Principle overview

The performance requirements of a system directly influence when and how the system scales vertically or scales horizontally. You need to evaluate the system's capacity and determine the load that the system is expected to handle at baseline. Then, you need to determine how you want the system to respond to increases and decreases in the load.

When the load increases, the system must scale out horizontally, scale up vertically, or both. For horizontal scaling, add replica nodes to ensure that the system has sufficient overall capacity to fulfill the increased demand. For vertical scaling, replace the application's existing components with components that contain more capacity, more memory, and more storage.

When the load decreases, the system must scale down (horizontally, vertically, or both).

Define the circumstances in which the system scales up or scales down. Plan to manually scale up systems for known periods of high traffic. Use tools like autoscaling, which responds to increases or decreases in the load.

Recommendations

To take advantage of elasticity, consider the recommendations in the following sections.

Plan for peak load periods

You need to plan an efficient scaling path for known events, such as expected periods of increased customer demand.

Consider scaling up your system ahead of known periods of high traffic. For example, if you're a retail organization, you expect demand to increase during seasonal sales. We recommend that you manually scale up or scale out your systems before those sales to ensure that your system can immediately handle the increased load or immediately adjust existing limits. Otherwise, the system might take several minutes to add resources in response to real-time changes. Your application's capacity might not increase quickly enough and cause some users to experience delays.

For unknown or unexpected events, such as a sudden surge in demand or traffic, you can use autoscaling features to trigger elastic scaling that's based on metrics. These metrics can include CPU utilization, load balancer serving capacity, latency, and even custom metrics that you define in Cloud Monitoring.

For example, consider an application that runs on a Compute Engine managed instance group (MIG). This application has a requirement that each instance performs optimally until the average CPU utilization reaches 75%. In this example, you might define an autoscaling policy that creates more instances when the CPU utilization reaches the threshold. These newly-created instances help absorb the load, which helps ensure that the average CPU utilization remains at an optimal rate until the maximum number of instances that you've configured for the MIG is reached. When the demand decreases, the autoscaling policy removes the instances that are no longer needed.

Plan resource slot reservations in BigQuery or adjust the limits for autoscaling configurations in Spanner by using the managed autoscaler.

Use predictive scaling

If your system components include Compute Engine, you must evaluate whether predictive autoscaling is suitable for your workload. Predictive autoscaling forecasts the future load based on your metrics' historical trends—for example, CPU utilization. Forecasts are recomputed every few minutes, so the autoscaler rapidly adapts its forecast to very recent changes in load. Without predictive autoscaling, an autoscaler can only scale a group reactively, based on observed real-time changes in load. Predictive autoscaling works with both real-time data and historical data to respond to both the current and the forecasted load.

Implement serverless architectures

Consider implementing a serverless architecture with serverless services that are inherently elastic, such as the following:

Unlike autoscaling in other services that require fine-tuning rules (for example, Compute Engine), serverless autoscaling is instant and can scale down to zero resources.

Use Autopilot mode for Kubernetes

For complex applications that require greater control over Kubernetes, consider Autopilot mode in Google Kubernetes Engine (GKE). Autopilot mode provides automation and scalability by default. GKE automatically scales nodes and resources based on traffic. GKE manages nodes, creates new nodes for your applications, and configures automatic upgrades and repairs.

Promote modular design

This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you promote a modular design. Modular components and clear interfaces can enable flexible scaling, independent updates, and future component separation.

Principle overview

Understand the dependencies between the application components and the system components to design a scalable system.

Modular design enables flexibility and resilience, regardless of whether a monolithic or microservices architecture was initially deployed. By decomposing the system into well-defined, independent modules with clear interfaces, you can scale individual components to meet specific demands.

Targeted scaling can help optimize resource utilization and reduce costs in the following ways:

  • Provisions only the necessary resources to each component, and allocates fewer resources to less-demanding components.
  • Adds more resources during high-traffic periods to maintain the user experience.
  • Removes under-utilized resources without compromising performance.

Modularity also enhances maintainability. Smaller, self-contained units are easier to understand, debug, and update, which can lead to faster development cycles and reduced risk.

While modularity offers significant advantages, you must evaluate the potential performance trade-offs. The increased communication between modules can introduce latency and overhead. Strive for a balance between modularity and performance. A highly modular design might not be universally suitable. When performance is critical, a more tightly coupled approach might be appropriate. System design is an iterative process, in which you continuously review and refine your modular design.

Recommendations

To promote modular designs, consider the recommendations in the following sections.

Design for loose coupling

Design a loosely coupled architecture. Independent components with minimal dependencies can help you build scalable and resilient applications. As you plan the boundaries for your services, you must consider the availability and scalability requirements. For example, if one component has requirements that are different from your other components, you can design the component as a standalone service. Implement a plan for graceful failures for less-important subprocesses or services that don't impact the response time of the primary services.

Design for concurrency and parallelism

Design your application to support multiple tasks concurrently, like processing multiple user requests or running background jobs while users interact with your system. Break large tasks into smaller chunks that can be processed at the same time by multiple service instances. Task concurrency lets you use features like autoscaling to increase the resource allocation in products like the following:

Balance modularity for flexible resource allocation

Where possible, ensure that each component uses only the necessary resources (like memory, storage, and processing power) for specific operations. Resource over-allocation can result in unnecessary costs, while resource under-allocation can compromise performance.

Use well-defined interfaces

Ensure modular components communicate effectively through clear, standardized interfaces (like APIs and message queues) to reduce overhead from translation layers or from extraneous traffic.

Use stateless models

A stateless model can help ensure that you can handle each request or interaction with the service independently from previous requests. This model facilitates scalability and recoverability, because you can grow, shrink, or restart the service without losing the data necessary for in-progress requests or processes.

Choose complementary technologies

Choose technologies that complement the modular design. Evaluate programming languages, frameworks, and databases for their modularity support.

For more information, see the following resources:

Continuously monitor and improve performance

This principle in the performance optimization pillar of the Google Cloud Architecture Framework provides recommendations to help you continuously monitor and improve performance.

After you deploy applications, continuously monitor their performance by using logs, tracing, metrics, and alerts. As your applications grow and evolve, you can use the trends in these data points to re-assess your performance requirements. You might eventually need to redesign parts of your applications to maintain or improve their performance.

Principle overview

The process of continuous performance improvement requires robust monitoring tools and strategies. Cloud observability tools can help you to collect key performance indicators (KPIs) such as latency, throughput, error rates, and resource utilization. Cloud environments offer a variety of methods to conduct granular performance assessments across the application, the network, and the end-user experience.

Improving performance is an ongoing effort that requires a multi-faceted approach. The following key mechanisms and processes can help you to boost performance:

  • To provide clear direction and help track progress, define performance objectives that align with your business goals. Set SMART goals: specific, measurable, achievable, relevant, and time-bound.
  • To measure performance and identify areas for improvement, gather KPI metrics.
  • To continuously monitor your systems for issues, use visualized workflows in monitoring tools. Use architecture process mapping techniques to identify redundancies and inefficiencies.
  • To create a culture of ongoing improvement, provide training and programs that support your employees' growth.
  • To encourage proactive and continuous improvement, incentivize your employees and customers to provide ongoing feedback about your application's performance.

Recommendations

To promote modular designs, consider the recommendations in the following sections.

Define clear performance goals and metrics

Define clear performance objectives that align with your business goals. This requires a deep understanding of your application's architecture and the performance requirements of each application component.

As a priority, optimize the most critical components that directly influence your core business functions and user experience. To help ensure that these components continue to run efficiently and meet your business needs, set specific and measurable performance targets. These targets can include response times, error rates, and resource utilization thresholds.

This proactive approach can help you to identify and address potential bottlenecks, optimize resource allocation, and ultimately deliver a seamless and high-performing experience for your users.

Monitor performance

Continuously monitor your cloud systems for performance issues and set up alerts for any potential problems. Monitoring and alerts can help you to catch and fix issues before they affect users. Application profiling can help to identify bottlenecks and can help to optimize resource use.

You can use tools that facilitate effective troubleshooting and network optimization. Use Google Cloud Observability to identify areas that have high CPU consumption, memory consumption, or network consumption. These capabilities can help developers improve efficiency, reduce costs, and enhance the user experience. Network Intelligence Center shows visualizations of the topology of your network infrastructure, and can help you to identify high-latency paths.

Incentivize continuous improvement

Create a culture of ongoing improvement that can benefit both the application and the user experience.

Provide your employees with training and development opportunities that enhance their skills and knowledge in performance techniques across cloud services. Establish a community of practice (CoP) and offer mentorship and coaching programs to support employee growth.

To prevent reactive performance management and encourage proactive performance management, encourage ongoing feedback from your employees, your customers, and your stakeholders. You can consider gamifying the process by tracking KPIs on performance and presenting those metrics to teams on a frequent basis in the form of a league table.

To understand your performance and user happiness over time, we recommend that you measure user feedback quantitatively and qualitatively. The HEART framework can help you capture user feedback across five categories:

  • Happiness
  • Engagement
  • Adoption
  • Retention
  • Task success

By using such a framework, you can incentivize engineers with data-driven feedback, user-centered metrics, actionable insights, and a clear understanding of goals.

Design for environmental sustainability

This document in the Google Cloud Architecture Framework summarizes how you can approach environmental sustainability for your workloads in Google Cloud. It includes information about how to minimize your carbon footprint on Google Cloud.

Understand your carbon footprint

To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use.

Choose the most suitable cloud regions

One effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions.

When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker.

Choose the most suitable cloud services

To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine.

Consider serverless options for workloads that don't need VMs. These managed services often optimize resource usage automatically, reducing costs and carbon footprint.

Minimize idle cloud resources

Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following:

  • Unused active cloud resources, such as idle VM instances.
  • Over-provisioned resources, such as larger VM instances machine types than necessary for a workload.
  • Non-optimal architectures, such as lift-and-shift migrations that aren't always optimized for efficiency. Consider making incremental improvements to these architectures.

The following are some general strategies to help minimize wasted cloud resources:

  • Identify idle or overprovisioned resources and either delete them or rightsize them.
  • Refactor your architecture to incorporate a more optimal design.
  • Migrate workloads to managed services.

Reduce emissions for batch workloads

Run batch workloads in regions with lower carbon emissions. For further reductions, run workloads at times that coincide with lower grid carbon intensity when possible.

What's next

Architecture Framework: AI and ML perspective

This document in the Google Cloud Architecture Framework describes principles and recommendations to help you to design, build, and manage AI and ML workloads in Google Cloud that meet your operational, security, reliability, cost, and performance goals.

The target audience for this document includes decision makers, architects, administrators, developers, and operators who design, build, deploy, and maintain AI and ML workloads in Google Cloud.

The following pages describe principles and recommendations that are specific to AI and ML, for each pillar of the Google Cloud Architecture Framework:

Contributors

Authors:

Other contributors:

AI and ML perspective: Operational excellence

This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to help you to build and operate robust AI and ML systems on Google Cloud. These recommendations help you to set up foundational elements like observability, automation, and scalability. This document's recommendations align with the operational excellence pillar of the Architecture Framework.

Operational excellence within the AI and ML domain is the ability to seamlessly deploy, manage, and govern the intricate AI and ML systems and pipelines that power your organization's strategic objectives. Operational excellence lets you respond efficiently to changes, reduce operational complexity, and ensure that operations remain aligned with business goals.

Build a robust foundation for model development

Establish a robust foundation to streamline model development, from problem definition to deployment. Such a foundation ensures that your AI solutions are built on reliable and efficient components and choices. This kind of foundation helps you to release changes and improvements quickly and easily.

Consider the following recommendations:

  • Define the problem that the AI system solves and the outcome that you want.
  • Identify and gather relevant data that's required to train and evaluate your models. Then, clean and preprocess the raw data. Implement data validation checks to ensure data quality and integrity.
  • Choose the appropriate ML approach for the task. When you design the structure and parameters of the model, consider the model's complexity and computational requirements.
  • Adopt a version control system for code, model, and data.

Automate the model-development lifecycle

From data preparation and training to deployment and monitoring, automation helps you to improve the quality and efficiency of your operations. Automation enables seamless, repeatable, and error-free model development and deployment. Automation minimizes manual intervention, speeds up release cycles, and ensures consistency across environments.

Consider the following recommendations:

  • Use a managed pipeline orchestration system to orchestrate and automate the ML workflow. The pipeline must handle the major steps of your development lifecycle: preparation, training, deployment, and evaluation.
  • Implement CI/CD pipelines for the model-development lifecycle. These pipelines should automate the building, testing, and deployment of models. The pipelines should also include continuous training to retrain models on new data as needed.
  • Implement phased release approaches such as canary deployments or A/B testing, for safe and controlled model releases.

Implement observability

When you implement observability, you can gain deep insights into model performance, data drift, and system health. Implement continuous monitoring, alerting, and logging mechanisms to proactively identify issues, trigger timely responses, and ensure operational continuity.

Consider the following recommendations:

  • Implement permanent and automated performance monitoring for your models. Use metrics and success criteria for ongoing evaluation of the model after deployment.
  • Monitor your deployment endpoints and infrastructure to ensure service availability.
  • Set up custom alerting based on business-specific thresholds and anomalies to ensure that issues are identified and resolved in a timely manner.
  • Use explainable AI techniques to understand and interpret model outputs.

Build a culture of operational excellence

Operational excellence is built on a foundation of people, culture, and professional practices. The success of your team and business depends on how effectively your organization implements methodologies that enable the reliable and rapid development of AI capabilities.

Consider the following recommendations:

  • Champion automation and standardization as core development methodologies. Streamline your workflows and manage the ML lifecycle efficiently by using MLOps techniques. Automate tasks to free up time for innovation, and standardize processes to support consistency and easier troubleshooting.
  • Prioritize continuous learning and improvement. Promote learning opportunities that team members can use to enhance their skills and stay current with AI and ML advancements. Encourage experimentation and conduct regular retrospectives to identify areas for improvement.
  • Cultivate a culture of accountability and ownership. Define clear roles so that everyone understands their contributions. Empower teams to make decisions within boundaries and track progress by using transparent metrics.
  • Embed AI ethics and safety into the culture. Prioritize responsible systems by integrating ethics considerations into every stage of the ML lifecycle. Establish clear ethics principles and foster open discussions about ethics-related challenges.

Design for scalability

Architect your AI solutions to handle growing data volumes and user demands. Use scalable infrastructure so that your models can adapt and perform optimally as your project expands.

Consider the following recommendations:

  • Plan for capacity and quotas. Anticipate future growth, and plan your infrastructure capacity and resource quotas accordingly.
  • Prepare for peak events. Ensure that your system can handle sudden spikes in traffic or workload during peak events.
  • Scale AI applications for production. Design for horizontal scaling to accommodate increases in the workload. Use frameworks like Ray on Vertex AI to parallelize tasks across multiple machines.
  • Use managed services where appropriate. Use services that help you to scale while minimizing the operational overhead and complexity of manual interventions.

Contributors

Authors:

Other contributors:

AI and ML perspective: Security

This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to ensure that your AI and ML deployments meet the security and compliance requirements of your organization. The recommendations in this document align with the security pillar of the Architecture Framework.

Secure deployment of AI and ML workloads is a critical requirement, particularly in enterprise environments. To meet this requirement, you need to adopt a holistic security approach that starts from the initial conceptualization of your AI and ML solutions and extends to development, deployment, and ongoing operations. Google Cloud offers robust tools and services that are designed to help secure your AI and ML workloads.

Define clear goals and requirements

It's easier to integrate the required security and compliance controls early in your design and development process, than to add the controls after development. From the start of your design and development process, make decisions that are appropriate for your specific risk environment and your specific business priorities.

Consider the following recommendations:

  • Identify potential attack vectors and adopt a security and compliance perspective from the start. As you design and evolve your AI systems, keep track of the attack surface, potential risks, and obligations that you might face.
  • Align your AI and ML security efforts with your business goals and ensure that security is an integral part of your overall strategy. Understand the effects of your security choices on your main business goals.

Keep data secure and prevent loss or mishandling

Data is a valuable and sensitive asset that must be kept secure. Data security helps you to maintain user trust, support your business objectives, and meet your compliance requirements.

Consider the following recommendations:

  • Don't collect, keep, or use data that's not strictly necessary for your business goals. If possible, use synthetic or fully anonymized data.
  • Monitor data collection, storage, and transformation. Maintain logs for all data access and manipulation activities. The logs help you to audit data access, detect unauthorized access attempts, and prevent unwanted access.
  • Implement different levels of access (for example, no-access, read-only, or write) based on user roles. Ensure that permissions are assigned based on the principle of least privilege. Users must have only the minimum permissions that are necessary to let them perform their role activities.
  • Implement measures like encryption, secure perimeters, and restrictions on data movement. These measures help you to prevent data exfiltration and data loss.
  • Guard against data poisoning for your ML training systems.

Keep AI pipelines secure and robust against tampering

Your AI and ML code and the code-defined pipelines are critical assets. Code that isn't secured can be tampered with, which can lead to data leaks, compliance failure, and disruption of critical business activities. Keeping your AI and ML code secure helps to ensure the integrity and value of your models and model outputs.

Consider the following recommendations:

  • Use secure coding practices, such as dependency management or input validation and sanitization, during model development to prevent vulnerabilities.
  • Protect your pipeline code and your model artifacts, like files, model weights, and deployment specifications, from unauthorized access. Implement different access levels for each artifact based on user roles and needs.
  • Enforce lineage and tracking of your assets and pipeline runs. This enforcement helps you to meet compliance requirements and to avoid compromising production systems.

Deploy on secure systems with secure tools and artifacts

Ensure that your code and models run in a secure environment that has a robust access control system with security assurances for the tools and artifacts that are deployed in the environment.

Consider the following recommendations:

  • Train and deploy your models in a secure environment that has appropriate access controls and protection against unauthorized use or manipulation.
  • Follow standard Supply-chain Levels for Software Artifacts (SLSA) guidelines for your AI-specific artifacts, like models and software packages.
  • Prefer using validated prebuilt container images that are specifically designed for AI workloads.

Protect and monitor inputs

AI systems need inputs to make predictions, generate content, or automate actions. Some inputs might pose risks or be used as attack vectors that must be detected and sanitized. Detecting potential malicious inputs early helps you to keep your AI systems secure and operating as intended.

Consider the following recommendations:

  • Implement secure practices to develop and manage prompts for generative AI systems, and ensure that the prompts are screened for harmful intent.
  • Monitor inputs to predictive or generative systems to prevent issues like overloaded endpoints or prompts that the systems aren't designed to handle.
  • Ensure that only the intended users of a deployed system can use it.

Monitor, evaluate, and prepare to respond to outputs

AI systems deliver value because they produce outputs that augment, optimize, or automate human decision-making. To maintain the integrity and trustworthiness of your AI systems and applications, you need to make sure that the outputs are secure and within expected parameters. You also need a plan to respond to incidents.

Consider the following recommendations:

  • Monitor the outputs of your AI and ML models in production, and identify any performance, security, and compliance issues.
  • Evaluate model performance by implementing robust metrics and security measures, like identifying out-of-scope generative responses or extreme outputs in predictive models. Collect user feedback on model performance.
  • Implement robust alerting and incident response procedures to address any potential issues.

Contributors

Authors:

Other contributors:

AI and ML perspective: Reliability

This document in the Architecture Framework: AI and ML perspective provides an overview of the principles and recommendations to design and operate reliable AI and ML systems on Google Cloud. It explores how to integrate advanced reliability practices and observability into your architectural blueprints. The recommendations in this document align with the reliability pillar of the Architecture Framework.

In the fast-evolving AI and ML landscape, reliable systems are essential for ensuring customer satisfaction and achieving business goals. You need AI and ML systems that are robust, reliable, and adaptable to meet the unique demands of both predictive ML and generative AI. To handle the complexities of MLOps—from development to deployment and continuous improvement—you need to use a reliability-first approach. Google Cloud offers a purpose-built AI infrastructure that's aligned with Site Reliability Engineering (SRE) principles and provides a powerful foundation for reliable AI and ML systems.

Ensure that infrastructure is scalable and highly available

By architecting for scalability and availability, you enable your applications to handle varying levels of demand without service disruptions or performance degradation. This means that your AI services are still available to users during infrastructure outages and when traffic is very high.

Consider the following recommendations:

  • Design your AI systems with automatic and dynamic scaling capabilities to handle fluctuations in demand. This helps to ensure optimal performance, even during traffic spikes.
  • Manage resources proactively and anticipate future needs through load testing and performance monitoring. Use historical data and predictive analytics to make informed decisions about resource allocation.
  • Design for high availability and fault tolerance by adopting the multi-zone and multi-region deployment archetypes in Google Cloud and by implementing redundancy and replication.
  • Distribute incoming traffic across multiple instances of your AI and ML services and endpoints. Load balancing helps to prevent any single instance from being overloaded and helps to ensure consistent performance and availability.

Use a modular and loosely coupled architecture

To make your AI systems resilient to failures in individual components, use a modular architecture. For example, design the data processing and data validation components as separate modules. When a particular component fails, the modular architecture helps to minimize downtime and lets your teams develop and deploy fixes faster.

Consider the following recommendations:

  • Separate your AI and ML system into small self-contained modules or components. This approach promotes code reusability, simplifies testing and maintenance, and lets you develop and deploy individual components independently.
  • Design the loosely coupled modules with well-defined interfaces. This approach minimizes dependencies, and it lets you make independent updates and changes without impacting the entire system.
  • Plan for graceful degradation. When a component fails, the other parts of the system must continue to provide an adequate level of functionality.
  • Use APIs to create clear boundaries between modules and to hide the module-level implementation details. This approach lets you update or replace individual components without affecting interactions with other parts of the system.

Build an automated MLOps platform

With an automated MLOps platform, the stages and outputs of your model lifecycle are more reliable. By promoting consistency, loose coupling, and modularity, and by expressing operations and infrastructure as code, you remove fragile manual steps and maintain AI and ML systems that are more robust and reliable.

Consider the following recommendations:

  • Automate the model development lifecycle, from data preparation and validation to model training, evaluation, deployment, and monitoring.
  • Manage your infrastructure as code (IaC). This approach enables efficient version control, quick rollbacks when necessary, and repeatable deployments.
  • Validate that your models behave as expected with relevant data. Automate performance monitoring of your models, and build appropriate alerts for unexpected outputs.
  • Validate the inputs and outputs of your AI and ML pipelines. For example, validate data, configurations, command arguments, files, and predictions. Configure alerts for unexpected or unallowed values.
  • Adopt a managed version-control strategy for your model endpoints. This kind of strategy enables incremental releases and quick recovery in the event of problems.

Maintain trust and control through data and model governance

The reliability of AI and ML systems depends on the trust and governance capabilities of your data and models. AI outputs can fail to meet expectations in silent ways. For example, the outputs might be formally consistent but they might be incorrect or unwanted. By implementing traceability and strong governance, you can ensure that the outputs are reliable and trustworthy.

Consider the following recommendations:

  • Use a data and model catalog to track and manage your assets effectively. To facilitate tracing and audits, maintain a comprehensive record of data and model versions throughout the lifecycle.
  • Implement strict access controls and audit trails to protect sensitive data and models.
  • Address the critical issue of bias in AI, particularly in generative AI applications. To build trust, strive for transparency and explainability in model outputs.
  • Automate the generation of feature statistics and implement anomaly detection to proactively identify data issues. To ensure model reliability, establish mechanisms to detect and mitigate the impact of changes in data distributions.

Implement holistic AI and ML observability and reliability practices

To continuously improve your AI operations, you need to define meaningful reliability goals and measure progress. Observability is a foundational element of reliable systems. Observability lets you manage ongoing operations and critical events. Well-implemented observability helps you to build and maintain a reliable service for your users.

Consider the following recommendations:

  • Track infrastructure metrics for processors (CPUs, GPUs, and TPUs) and for other resources like memory usage, network latency, and disk usage. Perform load testing and performance monitoring. Use the test results and metrics from monitoring to manage scaling and capacity for your AI and ML systems.
  • Establish reliability goals and track application metrics. Measure metrics like throughput and latency for the AI applications that you build. Monitor the usage patterns of your applications and the exposed endpoints.
  • Establish model-specific metrics like accuracy or safety indicators in order to evaluate model reliability. Track these metrics over time to identify any drift or degradation. For efficient version control and automation, define the monitoring configurations as code.
  • Define and track business-level metrics to understand the impact of your models and reliability on business outcomes. To measure the reliability of your AI and ML services, consider adopting the SRE approach and define service level objectives (SLOs).

Contributors

Authors:

Other contributors:

AI and ML perspective: Cost optimization

This document in Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to optimize the cost of your AI systems throughout the ML lifecycle. By adopting a proactive and informed cost management approach, your organization can realize the full potential of AI and ML systems and also maintain financial discipline. The recommendations in this document align with the cost optimization pillar of the Architecture Framework.

AI and ML systems can help you to unlock valuable insights and predictive capabilities from data. For example, you can reduce friction in internal processes, improve user experiences, and gain deeper customer insights. The cloud offers vast amounts of resources and quick time-to-value without large up-front investments for AI and ML workloads. To maximize business value and to align the spending with your business goals, you need to understand the cost drivers, proactively optimize costs, set up spending controls, and adopt FinOps practices.

Define and measure costs and returns

To effectively manage your AI and ML costs in Google Cloud, you must define and measure the expenses for cloud resources and the business value of your AI and ML initiatives. Google Cloud provides comprehensive tools for billing and cost management to help you to track expenses granularly. Business value metrics that you can measure include customer satisfaction, revenue, and operational costs. By establishing concrete metrics for both costs and business value, you can make informed decisions about resource allocation and optimization.

Consider the following recommendations:

  • Establish clear business objectives and key performance indicators (KPIs) for your AI and ML projects.
  • Use the billing information provided by Google Cloud to implement cost monitoring and reporting processes that can help you to attribute costs to specific AI and ML activities.
  • Establish dashboards, alerting, and reporting systems to track costs and returns against KPIs.

Optimize resource allocation

To achieve cost efficiency for your AI and ML workloads in Google Cloud, you must optimize resource allocation. By carefully aligning resource allocation with the needs of your workloads, you can avoid unnecessary expenses and ensure that your AI and ML systems have the resources that they need to perform optimally.

Consider the following recommendations:

  • Use autoscaling to dynamically adjust resources for training and inference.
  • Start with small models and data. Save costs by testing hypotheses at a smaller scale when possible.
  • Discover your compute needs through experimentation. Rightsize the resources that are used for training and serving based on your ML requirements.
  • Adopt MLOps practices to reduce duplication, manual processes, and inefficient resource allocation.

Enforce data management and governance practices

Effective data management and governance practices play a critical role in cost optimization. Well-organized data helps your organization to avoid needless duplication, reduces the effort required to obtain high quality data, and encourages teams to reuse datasets. By proactively managing data, you can reduce storage costs, enhance data quality, and ensure that your ML models are trained and operate on the most relevant and valuable data.

Consider the following recommendations:

  • Establish and adopt a well-defined data governance framework.
  • Apply labels and relevant metadata to datasets at the point of data ingestion.
  • Ensure that datasets are discoverable and accessible across the organization.
  • Make your datasets and features reusable throughout the ML lifecycle wherever possible.

Automate and streamline with MLOps

A primary benefit of adopting MLOps practices is a reduction in costs, both from a technology perspective and in terms of personnel activities. Automation helps you to avoid duplication of ML activities and improve the productivity of data scientists and ML engineers.

Consider the following recommendations:

  • Increase the level of automation and standardization in your data collection and processing technologies to reduce development effort and time.
  • Develop automated training pipelines to reduce the need for manual interventions and increase engineer productivity. Implement mechanisms for the pipelines to reuse existing assets like prepared datasets and trained models.
  • Use the model evaluation and tuning services in Google Cloud to increase model performance with fewer iterations. This enables your AI and ML teams to achieve more objectives in less time.

Use managed services and pre-trained or existing models

There are many approaches to achieving business goals by using AI and ML. Adopt an incremental approach to model selection and model development. This helps you to avoid excessive costs that are associated with starting fresh every time. To control costs, start with a simple approach: use ML frameworks, managed services, and pre-trained models.

Consider the following recommendations:

  • Enable exploratory and quick ML experiments by using notebook environments.
  • Use existing and pre-trained models as a starting point to accelerate your model selection and development process.
  • Use managed services to train or serve your models. Both AutoML and managed custom model training services can help to reduce the cost of model training. Managed services can also help to reduce the cost of your model-serving infrastructure.

Foster a culture of cost awareness and continuous optimization

Cultivate a collaborative environment that encourages communication and regular reviews. This approach helps teams to identify and implement cost-saving opportunities throughout the ML lifecycle.

Consider the following recommendations:

  • Adopt FinOps principles across your ML lifecycle.
  • Ensure that all costs and business benefits of AI and ML projects have assigned owners with clear accountability.

Contributors

Authors:

Other contributors:

AI and ML perspective: Performance optimization

This document in the Architecture Framework: AI and ML perspective provides an overview of principles and recommendations to help you to optimize the performance of your AI and ML workloads on Google Cloud. The recommendations in this document align with the performance optimization pillar of the Architecture Framework.

AI and ML systems enable new automation and decision-making capabilities for your organization. The performance of these systems can directly affect your business drivers like revenue, costs, and customer satisfaction. To realize the full potential of your AI and ML systems, you need to optimize their performance based on your business goals and technical requirements. The performance optimization process often involves certain trade-offs. For example, a design choice that provides the required performance might lead to higher costs. The recommendations in this document prioritize performance over other considerations like costs.

To optimize AI and ML performance, you need to make decisions regarding factors like the model architecture, parameters, and training strategy. When you make these decisions, consider the entire lifecycle of the AI and ML systems and their deployment environment. For example, LLMs that are very large can be highly performant on massive training infrastructure, but very large models might not perform well in capacity-constrained environments like mobile devices.

Translate business goals to performance objectives

To make architectural decisions that optimize performance, start with a clear set of business goals. Design AI and ML systems that provide the technical performance that's required to support your business goals and priorities. Your technical teams must understand the mapping between performance objectives and business goals.

Consider the following recommendations:

  • Translate business objectives into technical requirements: Translate the business objectives of your AI and ML systems into specific technical performance requirements and assess the effects of not meeting the requirements. For example, for an application that predicts customer churn, the ML model should perform well on standard metrics, like accuracy and recall, and the application should meet operational requirements like low latency.
  • Monitor performance at all stages of the model lifecycle: During experimentation and training after model deployment, monitor your key performance indicators (KPIs) and observe any deviations from business objectives.
  • Automate evaluation to make it reproducible and standardized: With a standardized and comparable platform and methodology for experiment evaluation, your engineers can increase the pace of performance improvement.

Run and track frequent experiments

To transform innovation and creativity into performance improvements, you need a culture and a platform that supports experimentation. Performance improvement is an ongoing process because AI and ML technologies are developing continuously and quickly. To maintain a fast-paced, iterative process, you need to separate the experimentation space from your training and serving platforms. A standardized and robust experimentation process is important.

Consider the following recommendations:

  • Build an experimentation environment: Performance improvements require a dedicated, powerful, and interactive environment that supports the experimentation and collaborative development of ML pipelines.
  • Embed experimentation as a culture: Run experiments before any production deployment. Release new versions iteratively and always collect performance data. Experiment with different data types, feature transformations, algorithms, and hyperparameters.

Build and automate training and serving services

Training and serving AI models are core components of your AI services. You need robust platforms and practices that support fast and reliable creation, deployment, and serving of AI models. Invest time and effort to create foundational platforms for your core AI training and serving tasks. These foundational platforms help to reduce time and effort for your teams and improve the quality of outputs in the medium and long term.

Consider the following recommendations:

  • Use AI-specialized components of a training service: Such components include high-performance compute and MLOps components like feature stores, model registries, metadata stores, and model performance-evaluation services.
  • Use AI-specialized components of a prediction service: Such components provide high-performance and scalable resources, support feature monitoring, and enable model performance monitoring. To prevent and manage performance degradation, implement reliable deployment and rollback strategies.

Match design choices to performance requirements

When you make design choices to improve performance, carefully assess whether the choices support your business requirements or are wasteful and counterproductive. To choose the appropriate infrastructure, models, or configurations, identify performance bottlenecks and assess how they're linked to your performance measures. For example, even on very powerful GPU accelerators, your training tasks can experience performance bottlenecks due to data I/O issues from the storage layer or due to performance limitations of the model itself.

Consider the following recommendations:

  • Optimize hardware consumption based on performance goals: To train and serve ML models that meet your performance requirements, you need to optimize infrastructure at the compute, storage, and network layers. You must measure and understand the variables that affect your performance goals. These variables are different for training and inference.
  • Focus on workload-specific requirements: Focus your performance optimization efforts on the unique requirements of your AI and ML workloads. Rely on managed services for the performance of the underlying infrastructure.
  • Choose appropriate training strategies: Several pre-trained and foundational models are available, and more such models are released often. Choose a training strategy that can deliver optimal performance for your task. Decide whether you should build your own model, tune a pre-trained model on your data, or use a pre-trained model API.
  • Recognize that performance-optimization strategies can have diminishing returns: When a particular performance-optimization strategy doesn't provide incremental business value that's measurable, stop pursuing that strategy.

To innovate, troubleshoot, and investigate performance issues, establish a clear link between design choices and performance outcomes. In addition to experimentation, you must reliably record the lineage of your assets, deployments, model outputs, and the configurations and inputs that produced the outputs.

Consider the following recommendations:

  • Build a data and model lineage system: All of your deployed assets and their performance metrics must be linked back to the data, configurations, code, and the choices that resulted in the deployed systems. In addition, model outputs must be linked to specific model versions and how the outputs were produced.
  • Use explainability tools to improve model performance: Adopt and standardize tools and benchmarks for model exploration and explainability. These tools help your ML engineers understand model behavior and improve performance or remove biases.

Contributors

Authors:

Other contributors: