Google Cloud Architecture Framework

Last reviewed 2023-11-09 UTC

The Google Cloud Architecture Framework provides recommendations and describes best practices to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that's secure, efficient, resilient, high-performing, and cost-effective. The Google Cloud Architecture Framework is our version of a well-architected framework.

A cross-functional team of experts at Google validates the design recommendations and best practices that make up the Architecture Framework. The team curates the Architecture Framework to reflect the expanding capabilities of Google Cloud, industry best practices, community knowledge, and feedback from you. For a summary of the significant changes, see What's new.

The design guidance in the Architecture Framework applies to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.

The Google Cloud Architecture Framework is organized into six categories (also known as pillars), as shown in the following diagram:

Google Cloud Architecture Framework

System design
This category is the foundation of the Google Cloud Architecture Framework. Define the architecture, components, modules, interfaces, and data needed to satisfy cloud system requirements, and learn about Google Cloud products and features that support system design.
Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
Reliability
Design and operate resilient and highly available workloads in the cloud.
Cost optimization
Maximize the business value of your investment in Google Cloud.
Performance optimization
Design and tune your cloud resources for optimal performance.

To view summaries of Google Cloud products and how they relate to one another, see Google Cloud products, features, and services in four words or less.

Google Cloud Architecture Framework: System design

System design is the foundational category of the Google Cloud Architecture Framework. This category provides design recommendations and describes best practices and principles to help you define the architecture, components, modules, interfaces, and data on a cloud platform to satisfy your system requirements. You also learn about Google Cloud products and features that support system design.

The documents in the system design category assume that you understand basic system design principles. These documents don't assume that you are familiar with cloud concepts and Google Cloud products.

For complex cloud migration and deployment scenarios, we recommend that you use Google Cloud consulting services. Our consultants provide expertise on best practices and guiding principles to help you succeed in your cloud journey. Google Cloud also has a strong ecosystem of partners, from large global systems integrators to partners with a deep specialization in a particular area like machine learning. We recommend that you engage Google Cloud partners to accelerate your digital transformation and improve business outcomes.

In the system design category of the Architecture Framework, you learn to do the following:

Core principles of system design

This document in the Google Cloud Architecture Framework describes the core principles of system design. A robust system design is secure, reliable, scalable, and independent. It lets you make iterative and reversible changes without disrupting the system, minimize potential risks, and improve operational efficiency. To achieve a robust system design, we recommend that you follow four core principles.

Document everything

When you start to move your workloads to the cloud or build your applications, a major blocker to success is lack of documentation of the system. Documentation is especially important for correctly visualizing the architecture of your current deployments.

A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. It also provides the information that's necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.

Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current system design, strategy, or history.

Simplify your design and use fully managed services

Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.

Decouple your architecture

Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies.

A decoupled architecture gives you increased flexibility to do the following:

  • Apply independent upgrades.
  • Enforce specific security controls.
  • Establish reliability goals for each subsystem.
  • Monitor health.
  • Granularly control performance and cost parameters.

You can start decoupling early in your design phase or incorporate it as part of your system upgrades as you scale.

Use a stateless architecture

A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

The system design category describes recommendations to make your applications stateless or to utilize cloud-native features to improve capturing machine state for your stateful applications.

What's next

Choose Google Cloud deployment archetypes

This document in the Google Cloud Architecture Framework describes six deployment archetypes1—zonal, regional, multi-regional, global, hybrid, and multicloud—that you can use to build architectures for your cloud workloads based your requirements for availability, cost, performance, and operational efficiency.

What is a deployment archetype?

A deployment archetype is an abstract, provider-independent model that you use as the foundation to build application-specific deployment architectures that meet your business and technical requirements. Each deployment archetype specifies a combination of failure domains where an application can run. These failure domains can be one or more Google Cloud zones or regions, and they can extend to include your on-premises data centers or failure domains in other cloud providers.

The following diagram shows six applications deployed in Google Cloud. Each application uses a deployment archetype that meets its specific requirements.

Applications in Google Cloud deployed using different deployment archetypes.

As the preceding diagram shows, in an architecture that uses the hybrid or multicloud deployment archetype, the cloud topology is based on one of the basic archetypes: zonal, regional, multi-regional, or global. In this sense, the hybrid and multicloud deployment archetypes can be considered as composite deployment archetypes that include one of the basic archetypes.

Choosing a deployment archetype helps to simplify subsequent decisions regarding the Google Cloud products and features that you should use. For example, for a highly available containerized application, if you choose the regional deployment archetype, then regional Google Kubernetes Engine (GKE) clusters are more appropriate than zonal GKE clusters.

When you choose a deployment archetype for an application, you need to consider tradeoffs between factors like availability, cost, and operational complexity. For example, if an application serves users in multiple countries and needs high availability, you might choose the multi-regional deployment archetype. But for an internal application that's used by employees in a single geographical region, you might prioritize cost over availability and, therefore, choose the regional deployment archetype.

Overview of the deployment archetypes

The following tabs provide definitions for the deployment archetypes and a summary of the use cases and design considerations for each.

Zonal

Your application runs within a single Google Cloud zone, as shown in the following diagram:

Zonal deployment archetype
Use cases
  • Development and test environments.
  • Applications that don't need high availability.
  • Low-latency networking between application components.
  • Migrating commodity workloads.
  • Applications that use license-restricted software.
Design considerations
  • Downtime during zone outages.

    For business continuity, you can provision a passive replica of the application in another zone in the same region. If a zone outage occurs, you can restore the application to production by using the passive replica.

More information

See the following sections:

Regional

Your application runs independently in two or more zones within a single Google Cloud region, as shown in the following diagram:

Regional deployment archetype
Use cases
  • Highly available applications that serves users within a geographic area.
  • Compliance with data residency and sovereignty requirements.
Design considerations
  • Downtime during region outages.

    For business continuity, you can back up the application and data to another region. If a region outage occurs, you can use the backups in the other region to restore the application to production.

  • Cost and effort to provision and manage redundant resources.
More information

See the following sections:

Multi-regional

Your application runs independently in multiple zones across two or more Google Cloud regions. You can use DNS routing policies to route incoming traffic to the regional load balancers. The regional load balancers then distribute the traffic to the zonal replicas of the application, as shown in the following diagram:

Multi-regional deployment archetype
Use cases
  • Highly available application with geographically dispersed users.
  • Applications that require low end-user latency experience.
  • Compliance with data residency and sovereignty requirements by using a geofenced DNS routing policy.
Design considerations
  • Cost for cross-region data transfer and data replication.
  • Operational complexity.
More information

See the following sections:

Global

Your application runs across Google Cloud regions worldwide, either as a globally distributed (location-unaware) stack or as regionally isolated stacks. A global anycast load balancer distributes traffic to the region that's nearest to the user. Other components of the application stack can also be global, such as the database, cache, and object store.

The following diagram shows the globally distributed variant of the global deployment archetype. A global anycast load balancer forwards requests to an application stack that's distributed across multiple regions and that uses a globally replicated database.

Global deployment archetype: Globally distributed stack

The following diagram shows a variant of the global deployment archetype with regionally isolated application stacks. A global anycast load balancer forwards requests to an application stack in one of the regions. All the application stacks use a single, globally replicated database.

Global deployment archetype: Regionally isolated stacks
Use cases
  • Highly available applications that serve globally dispersed users.
  • Opportunity to optimize cost and simplify operations by using global resources instead of multiple instances of regional resources.
Design considerations Costs for cross-region data transfer and data replication.
More information

See the following sections:

Hybrid

Certain parts of your application are deployed in Google Cloud, while other parts run on-premises, as shown in the following diagram. The topology in Google Cloud can use the zonal, regional, multi-regional, or global deployment archetype.

Hybrid deployment archetype
Use cases
  • Disaster recovery (DR) site for on-premises workloads.
  • On-premises development for cloud applications.
  • Progressive migration to the cloud for legacy applications.
  • Enhancing on-premises applications with cloud capabilities.
Design considerations
  • Setup effort and operational complexity.
  • Cost of redundant resources.
More information

See the following sections:

Multicloud

Some parts of your application are deployed in Google Cloud, and other parts are deployed in other cloud platforms, as shown in the following diagram. The topology in each cloud platform can use the zonal, regional, multi-regional, or global deployment archetype.

Multicloud deployment archetype
Use cases
  • Google Cloud as the primary site and another cloud as a DR site.
  • Enhancing applications with advanced Google Cloud capabilities.
Design considerations
  • Setup effort and operational complexity.
  • Cost of redundant resources and cross-cloud network traffic.
More information

See the following sections:

Select geographic zones and regions

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on geographic requirements. You learn how to select optimal geographic zones and regions based on availability and proximity, to support compliance, optimize costs, and implement load balancing.

When you select a region or multiple regions for your business applications, you consider criteria including service availability, end-user latency, application latency, cost, and regulatory or sustainability requirements. To support your business priorities and policies, balance these requirements and identify the best tradeoffs. For example, the most compliant region might not be the most cost-efficient region or it might not have the lowest carbon footprint.

Deploy over multiple regions

Regions are independent geographic areas that consist of multiple zones. A zone is a deployment area for Google Cloud resources within a region; each zone represents a single failure domain within a region.

To help protect against expected downtime (including maintenance) and help protect against unexpected downtime like incidents, we recommend that you deploy fault-tolerant applications that have high availability and deploy your applications across multiple zones in one or more regions. For more information, see Geography and regions, Application deployment considerations, and Best practices for Compute Engine regions selection.

Multi-zonal deployments can provide resiliency if multi-region deployments are limited due to cost or other considerations. This approach is especially helpful in preventing zonal or regional outages and in addressing disaster recovery and business continuity concerns. For more information, see Design for scale and high availability.

Select regions based on geographic proximity

Latency impacts the user experience and affects costs associated with serving external users. To minimize latency when serving traffic to external users, select a region or set of regions that are geographically close to your users and where your services run in a compliant way. For more information, see Cloud locations and the Compliance resource center.

Select regions based on available services

Select a region based on the available services that your business requires. Most services are available across all regions. Some enterprise-specific services might be available in a subset of regions with their initial release. To verify region selection, see Cloud locations.

Choose regions to support compliance

Select a specific region or set of regions to meet geographic regulatory or compliance regulations that require the use of certain geographies, for example General Data Protection Regulation (GDPR) or data residency. To learn more about designing secure systems, see Compliance offerings and Data residency, operational transparency, and privacy for European customers on Google Cloud.

Compare pricing of major resources

Regions have different cost rates for the same services. To identify a cost-efficient region, compare pricing of the major resources that you plan to use. Cost considerations differ depending on backup requirements and resources like compute, networking, and data storage. To learn more, see the Cost optimization category.

Use Cloud Load Balancing to serve global users

To improve the user experience when you serve global users, use Cloud Load Balancing to help provide a single IP address that is routed to your application. To learn more about designing reliable systems, see Google Cloud Architecture Framework: Reliability.

Use the Cloud Region Picker to support sustainability

Google has been carbon neutral since 2007 and is committed to being carbon-free by 2030. To select a region by its carbon footprint, use the Google Cloud Region Picker. To learn more about designing for sustainability, see Cloud sustainability.

What's next

Learn how to manage your cloud resources using Resource Manager, the Google Cloud resource hierarchy, and the Organization Policy Service.

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Manage cloud resources

This document in the Google Cloud Architecture Framework provides best practices to organize and manage your resources in Google Cloud.

Resource hierarchy

Google Cloud resources are arranged hierarchically in organizations, folders, and projects. This hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies. For best practices to design the hierarchy of your cloud resources, see Decide a resource hierarchy for your Google Cloud landing zone.

Resource labels and tags

This section provides best practices for using labels and tags to organize your Google Cloud resources.

Use a simple folder structure

Folders let you group any combination of projects and subfolders. Create a simple folder structure to organize your Google Cloud resources. You can add more levels as needed to define your resource hierarchy so that it supports your business needs. The folder structure is flexible and extensible.To learn more, see Creating and managing folders.

Use folders and projects to reflect data governance policies

Use folders, subfolders, and projects to separate resources from each other to reflect data governance policies within your organization. For example, you can use a combination of folders and projects to separate financial, human resources, and engineering.

Use projects to group resources that share the same trust boundary. For example, resources for the same product or microservice can belong to the same project. For more information, see Decide a resource hierarchy for your Google Cloud landing zone.

Use tags and labels at the outset of your project

Use labels and tags when you start to use Google Cloud products, even if you don't need them immediately. Adding labels and tags later on can require manual effort that can be error prone and difficult to complete.

A tag provides a way to conditionally allow or deny policies based on whether a resource has a specific tag. A label is a key-value pair that helps you organize your Google Cloud instances. For more information on labels, see requirements for labels, a list of services that support labels, and label formats.

Resource Manager provides labels and tags to help you manage resources, allocate and report on cost, and assign policies to different resources for granular access controls. For example, you can use labels and tags to apply granular access and management principles to different tenant resources and services. For information about VM labels and network tags, see Relationship between VM labels and network tags.

You can use labels for multiple purposes, including the following:

  • Managing resource billing: Labels are available in the billing system, which lets you separate cost by labels. For example, you can label different cost centers or budgets.
  • Grouping resources by similar characteristics or by relation: You can use labels to separate different application lifecycle stages or environments. For example, you can label production, development, and testing environments.

Assign labels to support cost and billing reporting

To support granular cost and billing reporting based on attributes outside of your integrated reporting structures (like per-project or per-product type), assign labels to resources. Labels can help you allocate consumption to cost centers, departments, specific projects, or internal recharge mechanisms. For more information, see the Cost optimization category.

Avoid creating large numbers of labels

Avoid creating large numbers of labels. We recommend that you create labels primarily at the project level, and that you avoid creating labels at the sub-team level. If you create overly granular labels, it can add noise to your analytics. To learn about common use cases for labels, see Common uses of labels.

Avoid adding sensitive information to labels

Labels aren't designed to handle sensitive information. Don't include sensitive information in labels, including information that might be personally identifiable, like an individual's name or title.

Anonymize information in project names

Follow a project naming pattern like COMPANY_INITIAL_IDENTIFIER-ENVIRONMENT-APP_NAME, where the placeholders are unique and don't reveal company or application names. Don't include attributes that can change in the future, for example, a team name or technology.

Apply tags to model business dimensions

You can apply tags to model additional business dimensions like organization structure, regions, workload types, or cost centers. To learn more about tags, see Tags overview, Tag inheritance, and Creating and managing tags. To learn how to use tags with policies, see Policies and tags. To learn how to use tags to manage access control, see Tags and access control.

Organizational policies

This section provides best practices for configuring governance rules on Google Cloud resources across the cloud resource hierarchy.

Establish project naming conventions

Establish a standardized project naming convention, for example, SYSTEM_NAME-ENVIRONMENT (dev, test, uat, stage, prod).

Project names have a 30-character limit.

Although you can apply a prefix like COMPANY_TAG-SUB_GROUP/SUBSIDIARY_TAG, project names can become out of date when companies go through reorganizations. Consider moving identifiable names from project names to project labels.

Automate project creation

To create projects for production and large-scale businesses, use an automated process like the Deployment Manager or the Google Cloud project factory Terraform module. These tools do the following:

  • Automatically create development, test, and production environments or projects that have the appropriate permissions.
  • Configure logging and monitoring.

The Google Cloud project factory Terraform module helps you to automate the creation of Google Cloud projects. In large enterprises, we recommend that you review and approve projects before you create them in Google Cloud. This process helps to ensure the following:

  • Costs can be attributed. For more information, see the Cost optimization category.
  • Approvals are in place for data uploads.
  • Regulatory or compliance requirements are met.

When you automate the creation and management of Google Cloud projects and resources, you get the benefit of consistency, reproducibility, and testability. Treating your configuration as code lets you version and manage the lifecycle of your configuration together with your software artifacts. Automation lets you support best practices like consistent naming conventions and labeling of resources. As your requirements evolve, automation simplifies project refactoring.

Audit your systems regularly

To ensure that requests for new projects can be audited and approved, integrate with your enterprise's ticketing system or a standalone system that provides auditing.

Configure projects consistently

Configure projects to consistently meet your organization's needs. Include the following when you set up projects:

  • Project ID and naming conventions
  • Billing account linking
  • Networking configuration
  • Enabled APIs and services
  • Compute Engine access configuration
  • Logs export and usage reports
  • Project removal lien

Decouple and isolate workloads or environments

Quotas and limits are enforced at the project level. To manage quotas and limits, decouple and isolate workloads or environments at the project level. For more information, see Working with quotas.

Decoupling environments is different from data classification requirements. Separating data from infrastructure can be expensive and complex to implement, so we recommend that you implement data classification based on data sensitivity and compliance requirements.

Enforce billing isolation

Enforce billing isolation to support different billing accounts and cost visibility per workload and environment. For more information, see Create, modify, or close your self-serve Cloud Billing account and Enable, disable, or change billing for a project.

To minimize administrative complexity, use granular access management controls for critical environments at the project level, or for workloads that spread across multiple projects. When you curate access control for critical production applications, you ensure that workloads are secured and managed effectively.

Use the Organization Policy Service to control resources

The Organization Policy Service gives policy administrators centralized and programmatic control over your organization's cloud resources so that they can configure constraints across the resource hierarchy. For more information, see Add an organization policy administrator.

Use the Organization Policy Service to comply with regulatory policies

To meet compliance requirements, use the Organization Policy Service to enforce compliance requirements for resource sharing and access. For example, you can limit sharing with external parties or determine where to deploy cloud resources geographically. Other compliance scenarios include the following:

  • Centralizing control to configure restrictions that define how your organization's resources can be used.
  • Defining and establishing policies to help your development teams remain within compliance boundaries.
  • Helping project owners and their teams make system changes while maintaining regulatory compliance and minimizing concerns about breaking compliance rules.

Limit resource sharing based on domain

A restricted sharing organization policy helps you to prevent Google Cloud resources from being shared with identities outside your organization. For more information, see Restricting identities by domain and Organization policy constraints.

Disable service account and key creation

To help improve security, limit the use of Identity and Access Management (IAM) service accounts and corresponding keys. For more information, see Restricting service account usage.

Restrict the physical location of new resources

Restrict the physical location of newly created resources by restricting resource locations. To see a list of constraints that give you control of your organization's resources, see Organization Policy Service constraints.

What's next

Learn how to choose and manage compute, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Choose and manage compute

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on compute requirements. You learn how to choose a compute platform and a migration approach, design and scale workloads, and manage operations and VM migrations.

Computation is at the core of many workloads, whether it refers to the execution of custom business logic or the application of complex computational algorithms against datasets. Most solutions use compute resources in some form, and it's critical that you select the right compute resources for your application needs.

Google Cloud provides several options for using time on a CPU. Options are based on CPU types, performance, and how your code is scheduled to run, including usage billing.

Google Cloud compute options include the following:

  • Virtual machines (VM) with cloud-specific benefits like live migration.
  • Bin-packing of containers on cluster-machines that can share CPUs.
  • Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request.

Choosing compute

This section provides best practices for choosing and migrating to a compute platform.

Choose a compute platform

When you choose a compute platform for your workload, consider the technical requirements of the workload, lifecycle automation processes, regionalization, and security.

Evaluate the nature of CPU usage by your app and the entire supporting system, including how your code is packaged and deployed, distributed, and invoked. While some scenarios might be compatible with multiple platform options, a portable workload should be capable and performant on a range of compute options.

The following table provides an overview of the recommended Google Cloud compute services for various use cases:

Compute platform Use cases Recommended products
Serverless
  • Deploy your first app.
  • Focus on data and processing logic and on app development, rather than maintaining infrastructure operations.
  • Cloud Run: Put your business logic in containers by using this fully managed serverless option. Cloud Run is designed for workloads that are compute intensive, but not always on. Scale cost effectively from 0 (no traffic) and define the CPU and RAM of your tasks and services. Deploy with a single command and Google automatically provisions the right amount of resources.
  • Cloud Functions: Separate your code into flexible pieces of business logic without the infrastructure concerns of load balancing, updates, authentication, or scaling.
Kubernetes Build complex microservice architectures that need additional services like Istio to manage service mesh control.
  • Google Kubernetes Engine: An open source container-orchestration engine that automates deploying, scaling, and managing containerized apps.
Virtual machines (VMs) Create and run VMs from predefined and customizable VM families that support your application and workload requirements, as well as third-party software and services.
  • Compute Engine: Add graphics processing units (GPUs) to your VM instances. You can use these GPUs to accelerate specific workloads on your instances like machine learning and data processing.

To select appropriate machine types based on your requirements, see Recommendations for machine families.

For more information, see Choosing compute options.

Choose a compute migration approach

If you're migrating your existing applications from another cloud or from on-premises, use one of the following Google Cloud products to help you optimize for performance, scale, cost, and security.

Migration goal Use case Recommended product
Lift and shift Migrate or extend your VMware workloads to Google Cloud in minutes. Google Cloud VMware Engine
Lift and shift Move your VM-based applications to Compute Engine. Migrate to Virtual Machines
Upgrade to containers Modernize traditional applications into built-in containers on Google Kubernetes Engine. Migrate to Containers

To learn how to migrate your workloads while aligning internal teams, see VM Migration lifecycle and Building a Large Scale Migration Program with Google Cloud.

Designing workloads

This section provides best practices for designing workloads to support your system.

Evaluate serverless options for simple logic

Simple logic is a type of compute that doesn't require specialized hardware or machine types like CPU-optimized machines. Before you invest in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance, evaluate serverless options for lightweight logic.

Decouple your applications to be stateless

Where possible, decouple your applications to be stateless to maximize use of serverless computing options. This approach lets you use managed compute offerings, scale applications based on demand, and optimize for cost and performance. For more information about decoupling your application to design for scale and high availability, see Design for scale and high availability.

Use caching logic when you decouple architectures

If your application is designed to be stateful, use caching logic to decouple and make your workload scalable. For more information, see Database best practices.

Use live migrations to facilitate upgrades

To facilitate Google maintenance upgrades, use live migration by setting instance availability policies. For more information, see Set VM host maintenance policy.

Scaling workloads

This section provides best practices for scaling workloads to support your system.

Use startup and shutdown scripts

For stateful applications, use startup and shutdown scripts where possible to start and stop your application state gracefully. A graceful startup is when a computer is turned on by a software function and the operating system is allowed to perform its tasks of safely starting processes and opening connections.

Graceful startups and shutdowns are important because stateful applications depend on immediate availability to the data that sits close to the compute, usually on local or persistent disks, or in RAM. To avoid running application data from the beginning for each startup, use a startup script to reload the last saved data and run the process from where it previously stopped on shutdown. To save the application memory state to avoid losing progress on shutdown, use a shutdown script. For example, use a shutdown script when a VM is scheduled to be shut down due to downscaling or Google maintenance events.

Use MIGs to support VM management

When you use Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto updating, and stateful workloads. You can create zonal or regional MIGs based on your availability goals. You can use MIGs for stateless serving or batch workloads and for stateful applications that need to preserve each VM's unique state.

Use pod autoscalers to scale your GKE workloads

Use horizontal and vertical Pod autoscalers to scale your workloads, and use node auto-provisioning to scale underlying compute resources.

Distribute application traffic

To scale your applications globally, use Cloud Load Balancing to distribute your application instances across more than one region or zone. Load balancers optimize packet routing from Google Cloud edge networks to the nearest zone, which increases serving traffic efficiency and minimizes serving costs. To optimize for end-user latency, use Cloud CDN to cache static content where possible.

Automate compute creation and management

Minimize human-induced errors in your production environment by automating compute creation and management.

Managing operations

This section provides best practices for managing operations to support your system.

Use Google-supplied public images

Use public images supplied by Google Cloud. The Google Cloud public images are regularly updated. For more information, see List of public images available on Compute Engine.

You can also create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that you can share with authorized users within your organization. Creating and curating a custom image in a separate project lets you update, patch, and create a VM using your own configurations. You can then share the curated VM image with relevant projects.

Use snapshots for instance backups

Snapshots let you create backups for your instances. Snapshots are especially useful for stateful applications, which aren't flexible enough to maintain state or save progress when they experience abrupt shutdowns. If you frequently use snapshots to create new instances, you can optimize your backup process by creating a base image from that snapshot.

Use a machine image to enable VM instance creation

Although a snapshot only captures an image of the data inside a machine, a machine image captures machine configurations and settings, in addition to the data. Use a machine image to store all of the configurations, metadata, permissions, and data from one or more disks that are needed to create a VM instance.

When you create a machine from a snapshot, you must configure instance settings on the new VM instances, which requires a lot of work. Using machine images lets you copy those known settings to new machines, reducing overhead. For more information, see When to use a machine image.

Capacity, reservations, and isolation

This section provides best practices for managing capacity, reservations, and isolation to support your system.

Use committed-use discounts to reduce costs

You can reduce your operational expenditure (OPEX) cost for workloads that are always on by using committed use discounts. For more information, see the Cost optimization category.

Choose machine types to support cost and performance

Google Cloud offers machine types that let you choose compute based on cost and performance parameters. You can choose a low-performance offering to optimize for cost or choose a high-performance compute option at higher cost. For more information, see the Cost optimization category.

Use sole-tenant nodes to support compliance needs

Sole-tenant nodes are physical Compute Engine servers that are dedicated to hosting only your project's VMs. Sole-tenant nodes can help you to meet compliance requirements for physical isolation, including the following:

  • Keep your VMs physically separated from VMs in other projects.
  • Group your VMs together on the same host hardware.
  • Isolate payments processing workloads.

For more information, see Sole-tenant nodes.

Use reservations to ensure resource availability

Google Cloud lets you define reservations for your workloads to ensure those resources are always available. There is no additional charge to create reservations, but you pay for the reserved resources even if you don't use them. For more information, see Consuming and managing reservations.

VM migration

This section provides best practices for migrating VMs to support your system.

Evaluate built-in migration tools

Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program.

Use virtual disk import for customized operating systems

To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, we recommend that you do following:

  • Review Google Cloud Marketplace offerings to evaluate whether your application is listed under a supported vendor. Google Cloud supports running various open source systems and various third-party software.

  • Consider Migrate to Containers and GKE to extract and package your VM-based application as a containerized application running on GKE.

  • Use Compute Engine to run your applications on Google Cloud. If you have legacy dependencies running in a VM-based application, verify whether they meet your vendor requirements.

  • Evaluate using a Google Cloud internal passthrough Network Load Balancer to scale your decoupled architecture. For more information, see Internal passthrough Network Load Balancer overview.

  • Evaluate your options for switching from traditional on-premises use cases like HA-Proxy usage. For more information, see best practice for floating IP address.

  • Use VM Manager to manage operating systems for your large VM fleets running windows or Linux on Compute Engine, and apply consistent configuration policies.

  • Consider using GKE Autopilot and let Google SRE fully manage your clusters.

  • Use Policy Controller and Config Sync for policy and configuration management across your GKE clusters.

  • Ensure availability and scalability of machines in specific regions and zones. Google Cloud can scale to support your compute needs. However, if you need a lot of specific machine types in a specific region or zone, work with your account teams to ensure availability. For more information, see Reservations for Compute Engine.

What's next

Learn networking design principles, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Design your network infrastructure

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on networking design. You learn how to choose and implement Virtual Private Cloud (VPC), and how to test and manage network security.

Core principles

Networking design is critical to successful system design because it helps you optimize for performance and secure application communications with internal and external services. When you choose networking services, it's important to evaluate your application needs and evaluate how the applications will communicate with each other. For example, while some components require global services, other components might need to be geo-located in a specific region.

Google's private network connects regional locations to more than 100 global network points of presence. Google Cloud uses software-defined networking and distributed systems technologies to host and deliver your services around the world. Google's core element for networking within Google Cloud is the global VPC. VPC uses Google's global high-speed network to link your applications across regions while supporting privacy and reliability. Google ensures that your content is delivered with high throughput by using technologies like Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion-control intelligence.

Developing your cloud networking design includes the following steps:

  1. Design the workload VPC architecture. Start by identifying how many Google Cloud projects and VPC networks you require.
  2. Add inter-VPC connectivity. Design how your workloads connect to other workloads in different VPC networks.
  3. Design hybrid network connectivity. Design how your workload VPCs connect to on-premises and other cloud environments.

When you design your Google Cloud network, consider the following:

To see a complete list of VPC specifications, see Specifications.

Workload VPC architecture

This section provides best practices for designing workload VPC architectures to support your system.

Consider VPC network design early

Make VPC network design an early part of designing your organizational setup in Google Cloud. Organizational-level design choices can't be easily reversed later in the process. For more information, see Best practices and reference architectures for VPC design and Decide the network design for your Google Cloud landing zone.

Start with a single VPC network

For many use cases that include resources with common requirements, a single VPC network provides the features that you need. Single VPC networks are simple to create, maintain, and understand. For more information, see VPC Network Specifications.

Keep VPC network topology simple

To ensure a manageable, reliable, and well-understood architecture, keep the design of your VPC network topology as simple as possible.

Use VPC networks in custom mode

To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC. For more information, see VPC.

Inter-VPC connectivity

This section provides best practices for designing inter-VPC connectivity to support your system.

Choose a VPC connection method

If you decide to implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google's Andromeda software-defined network (SDN). There are several ways that VPC networks can communicate with each other. Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements. To learn more about the connection options, see Choose the VPC connection method that meets your cost, performance, and security needs.

Use Shared VPC to administer multiple working groups

For organizations with multiple teams, Shared VPC provides an effective tool to extend the architectural simplicity of a single VPC network across multiple working groups.

Use simple naming conventions

Choose simple, intuitive, and consistent naming conventions. Doing so helps administrators and users to understand the purpose of each resource, where it's located, and how it's differentiated from other resources.

Use connectivity tests to verify network security

In the context of network security, you can use connectivity tests to verify that traffic you intend to prevent between two endpoints is blocked. To verify that traffic is blocked and why it's blocked, define a test between two endpoints and evaluate the results. For example, you might test a VPC feature that lets you define rules that support blocking traffic. For more information, see Connectivity Tests overview.

Use Private Service Connect to create private endpoints

To create private endpoints that let you access Google services with your own IP address scheme, use Private Service Connect. You can access the private endpoints from within your VPC and through hybrid connectivity that terminates in your VPC.

Secure and limit external connectivity

Limit internet access only to those resources that need it. Resources with only a private, internal IP address can still access many Google APIs and services through Private Google Access.

Use Network Intelligence Center to monitor your cloud networks

Network Intelligence Center provides a comprehensive view of your Google Cloud networks across all regions. It helps you to identify traffic and access patterns that can cause operational or security risks.

What's next

Learn best practices for storage management, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Select and implement a storage strategy

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on storage. You learn how to select a storage strategy and how to manage storage, access patterns, and workloads.

To facilitate data exchange and securely back up and store data, organizations need to choose a storage plan based on workload, input/output operations per second (IOPS), latency, retrieval frequency, location, capacity, and format (block, file, and object).

Cloud Storage provides reliable, secure object storage services, including the following:

In Google Cloud, IOPS scales according to your provisioned storage space. Storage types like Persistent Disk require manual replication and backup because they are zonal or regional. By contrast, object storage is highly available and it automatically replicates data across a single region or across multiple regions.

Storage type

This section provides best practices for choosing a storage type to support your system.

Evaluate options for high-performance storage needs

Evaluate persistent disks or local solid-state drives (SSD) for compute applications that require high-performance storage. Cloud Storage is an immutable object store with versioning. Using Cloud Storage with Cloud CDN helps optimize for cost, especially for frequently accessed static objects.

Filestore supports multi-write applications that need high-performance shared space. Filestore also supports legacy and modern applications that require POSIX-like file operations through Network File System (NFS) mounts.

Cloud Storage supports use cases such as creating data lakes and addressing archival requirements. Make tradeoff decisions based on how you choose Cloud Storage class due to access and retrieval costs, especially when you configure retention policies. For more information, see Design an optimal storage strategy for your cloud workload.

All storage options are by default encrypted at rest and in-transit using Google-managed keys. For storage types such as Persistent Disk and Cloud Storage, you can either supply your own key or manage them through Cloud Key Management Service (Cloud KMS). Establish a strategy for handling such keys before you employ them on production data.

Choose Google Cloud services to support storage design

To learn about the Google Cloud services that support storage design, use the following table:

Google Cloud service Description
Cloud Storage Provides global storage and retrieval of any amount of data at any time. You can use Cloud Storage for multiple scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users through direct download.

For more information, see the following:
Persistent Disk A high-performance block storage for Google Cloud. Persistent Disk provides SSD and hard disk drive (HDD) storage that you can attach to instances running in Compute Engine or Google Kubernetes Engine (GKE).
  • Regional disks provide durable storage and replication of data between two zones in the same region. If you need higher IOPS and low latency, Google Cloud offers Filestore.
  • Local SSDs are physically attached to the server that hosts your virtual machine instance. You can use local SSDs as temporary disk space.
Filestore A managed file storage service for applications that require a file system interface and a shared file system for data. Filestore gives users a seamless experience for standing up managed Network Attached Storage (NAS) with their Compute Engine and GKE instances.
Cloud Storage for Firebase Built for app developers who need to store and serve user-generated content, such as photos or videos. All your files are stored in Cloud Storage buckets, so they are accessible from both Firebase and Google Cloud.

Choose a storage strategy

To select a storage strategy that meets your application requirements, use the following table:

Use case Recommendations
You want to store data at scale at the lowest cost, and access performance is not an issue. Cloud Storage
You are running compute applications that need immediate storage.

For more information, see Optimizing Persistent Disk and Local SSD performance.
Persistent Disk or Local SSD
You are running high-performance workloads that need read and write access to shared space. Filestore
You have high-performance computing (HPC) or high-throughput computing (HTC) use cases. Using clusters for large-scale technical computing in the cloud

Choose active or archival storage based on storage access needs

A storage class is a piece of metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed and can tolerate slightly lower availability, use the Nearline Storage, Coldline Storage, or Archive Storage class. For more information about cost considerations for choosing a storage class, see Cloud Storage pricing.

Evaluate storage location and data protection needs for Cloud Storage

For a Cloud Storage bucket located in a region, data contained within it is automatically replicated across zones within the region. Data replication across zones protects the data if there is a zonal failure within a region.

Cloud Storage also offers locations that are redundant across regions, which means data is replicated across multiple, geographically separate data centers. For more information, see Bucket locations.

Use Cloud CDN to improve static object delivery

To optimize the cost to retrieve objects and minimize access latency, use Cloud CDN. Cloud CDN uses the Cloud Load Balancing external Application Load Balancer to provide routing, health checking, and anycast IP address support. For more information, see Setting up Cloud CDN with cloud buckets.

Storage access pattern and workload type

This section provides best practices for choosing storage access patterns and workload types to support your system.

Use Persistent Disk to support high-performance storage access

Data access patterns depend on how you design system performance. Cloud Storage provides scalable storage, but it isn't an ideal choice when you run heavy compute workloads that need high throughput access to large amounts of data. For high-performance storage access, use Persistent Disk.

Use exponential backoff when implementing retry logic

Use exponential backoff when implementing retry logic to handle 5XX, 408, and 429 errors. Each Cloud Storage bucket is provisioned with initial I/O capacity. For more information, see Request rate and access distribution guidelines. Plan a gradual ramp-up for retry requests.

Storage management

This section provides best practices for storage management to support your system.

Assign unique names to every bucket

Make every bucket name unique across the Cloud Storage namespace. Don't include sensitive information in a bucket name. Choose bucket and object names that are difficult to guess. For more information, see the bucket naming guidelines and Object naming guidelines.

Keep Cloud Storage buckets private

Unless there is a business-related reason, ensure that your Cloud Storage bucket isn't anonymously or publicly accessible. For more information, see Overview of access control.

Assign random object names to distribute load evenly

Assign random object names to facilitate performance and avoid hotspotting. Use a randomized prefix for objects where possible. For more information, see Use a naming convention that distributes load evenly across key ranges.

Use public access prevention

To prevent access at the organization, folder, project, or bucket level, use public access prevention. For more information, see Using public access prevention.

What's next

Learn about Google Cloud database services and best practices, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Optimize your database

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on database design. You learn how to design, migrate, and scale databases, encrypt database information, manage licensing, and monitor your database for events.

Key services

This document in the Architecture Framework system design category provides best practices that include various Google Cloud database services. The following table provides a high-level overview of these services:

Google Cloud service Description
Cloud SQL A fully managed database service that lets you set up, maintain, manage, and administer your relational databases that use Cloud SQL for PostgreSQL, Cloud SQL for MySQL, and Cloud SQL for SQL Server. Cloud SQL offers high performance and scalability. Hosted on Google Cloud, Cloud SQL provides a database infrastructure for applications running anywhere.
Bigtable A table that can scale to billions of rows and thousands of columns, letting you store up to petabytes of data. A single value in each row is indexed; this value is known as the row key. Use Bigtable to store very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is a data source for MapReduce operations.
Spanner A scalable, globally distributed, enterprise database service built for the cloud that includes relational database structure and non-relational horizontal scale. This combination delivers high-performance transactions and consistency across rows, regions, and continents. Spanner provides a 99.999% availability SLA, no planned downtime, and enterprise-grade security.
Memorystore A fully managed Redis service for Google Cloud. Applications that run on Google Cloud can increase performance by using the highly available, scalable, secure Redis service without managing complex Redis deployments.
Firestore A NoSQL document database built for automatic scaling, high performance, and application development. Although the Firestore interface has many of the same features as traditional databases, it is a NoSQL database and it describes relationships between data objects differently.
Firebase Realtime Database A cloud-hosted database. Firebase stores data as JSON and it synchronizes in real time to every connected client. When you build cross-platform apps with Google, iOS, Android, and JavaScript SDKs, all of your clients share one real-time database instance and automatically receive updates with the newest data.
Open source databases Google partners offer different open source databases, including MongoDB, MariaDB, and Redis.
AlloyDB for PostgreSQL A fully managed PostgreSQL-compatible database service for demanding enterprise workloads. Provides up to 4x faster performance for transactional workloads and up to 100x faster analytical queries when compared to standard PostgreSQL. AlloyDB for PostgreSQL simplifies management with machine learning-enabled autopilot systems.

Database selection

This section provides best practices for choosing a database to support your system.

Consider using a managed database service

Evaluate Google Cloud managed database services before you install your own database or database cluster. Installing your own database involves maintenance overhead including installing patches and updates, and managing daily operational activities like monitoring and performing backups.

Use functional and non-functional application requirements to drive database selection. Consider low latency access, time series data processing, disaster recovery, and mobile client synchronization.

To migrate databases, use one of the products described in the following table:

Database migration product Description
Cloud SQL A regional service that supports read replicas in remote regions, low-latency reads, and disaster recovery.
Spanner A multi-regional offering providing external consistency, global replication, and a five nines service level agreement (SLA).
Bigtable A fully managed, scalable NoSQL database service for large analytical and operational workloads with up to 99.999% availability.
Memorystore A fully managed database service that provides a managed version of two popular open source caching solutions: Redis and Memcached.
Firebase Realtime Database The Firebase Realtime Database is a cloud-hosted NoSQL database that lets you store and sync data between your users in real time.
Firestore A NoSQL document database built for automatic scaling, high performance, and ease of application development.
Open source Alternative database options including MongoDB and MariaDB.

Database migration

To ensure that users experience zero application downtime when you migrate existing workloads to Google Cloud, it's important to choose database technologies that support your requirements. For information about database migration options and best practices, see Database migration solutions and Best practices for homogeneous database migrations.

Planning for a database migration includes the following:

  • Assessment and discovery of the current database.
  • Definitions of migration success criteria.
  • Environment setup for migration and the target database.
  • Creation of the schema in the target database.
  • Migration of the data into the target database.
  • Validation of the migration to verify that all the data is migrated correctly and is present in the database.
  • Creation of rollback strategy.

Choose a migration strategy

Selecting the appropriate target database is one of the keys to a successful migration. The following table provides migration options for some use cases:

Use case Recommendation
New development in Google Cloud. Select one of the managed databases that's built for the cloud—Cloud SQL, Spanner, Bigtable, or Firestore—to meet your use-case requirements.
Lift-and-shift migration. Choose a compatible, managed-database service like Cloud SQL, MYSQL, PostgreSQL, or SQLServer.
Your application requires granular access to a database that CloudSQL doesn't support. Run your database on Compute Engine VMs.

Use Memorystore to support your caching database layer

Memorystore is a fully managed Redis and Memcached database that supports submilliseconds latency. Memorystore is fully compatible with open source Redis and Memcached. If you use these caching databases in your applications, you can use Memorystore without making application-level changes in your code.

Use Bare Metal servers to run an Oracle database

If your workloads require an Oracle database, use Bare Metal servers provided by Google Cloud. This approach fits into a lift-and-shift migration strategy.

If you want to move your workload to Google Cloud and modernize after your baseline workload is working, consider using managed database options like Spanner, Bigtable, and Firestore.

Databases built for the cloud are modern managed databases which are built from the bottom up on the cloud infrastructure. These databases provide unique default capabilities like scalability and high availability, which are difficult to achieve if you run your own database.

Modernize your database

Plan your database strategy early in the system design process, whether you're designing a new application in the cloud or you're migrating an existing database to the cloud. Google Cloud provides managed database options for open source databases such as Cloud SQL for MySQL and Cloud SQL for PostgreSQL. We recommend that you use the migration as an opportunity to modernize your database and prepare it to support future business needs.

Use fixed databases with off-the-shelf applications

Commercial off-the-shelf (COTS) applications require a fixed type of database and fixed configuration. Lift and shift is usually the most appropriate migration approach for COTS applications.

Verify your team's database migration skill set

Choose a cloud database-migration approach based on your team's database migration capabilities and skill sets. Use Google Cloud Partner Advantage to find a partner to support you throughout your migration journey.

Design your database to meet HA and DR requirements

When you design your databases to meet high availability (HA) and disaster recovery (DR) requirements, evaluate the tradeoffs between reliability and cost. Database services that are built for the cloud create multiple copies of your data within a region or in multiple regions, depending upon the database and configuration.

Some Google Cloud services have multi-regional variants, such as BigQuery and Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible.

If you design your database on Compute Engine VMs instead of using managed databases on Google Cloud, ensure that you run multiple copies of your databases. For more information, see Design for scale and high availability in the Reliability category.

Specify cloud regions to support data residency

Data residency describes where your data physically resides at rest. Consider choosing specific cloud regions to deploy your databases based on your data residency requirements.

If you deploy your databases in multiple regions, there might be data replication between them depending on how you configure them. Select the configuration that keeps your data within the desired regions at rest. Some databases, like Spanner, offer default multi-regional replication. You can also enforce data residency by setting an organization policy that includes the resource locations constraints. For more information, see Restricting Resource Locations.

Include disaster recovery in data residency design

Include Recovery Time Objective (RTO) and Recovery Point Objective (RPO) in your data residency plans, and consider the trade-off between RTO/RPO and costs of the disaster recovery solution. Smaller RTO/RPO numbers result in higher costs. If you want your system to recover faster from disruptions, your system will cost more to run. Also, factor customer happiness into your disaster recovery approach to make sure that your reliability investments are appropriate. For more information, see 100% reliability is the wrong target and Disaster recovery planning guide.

Make your database Google Cloud-compliant

When you choose a database for your workload, ensure that the selected service meets compliance for the geographic region that you are operating in and where your data is physically stored. For more information about Google's certifications and compliance standards, see Compliance offerings.

Encryption

This section provides best practices for identifying encryption requirements and choosing an encryption key strategy to support your system.

Determine encryption requirements

Your encryption requirements depend on several factors, including company security policies and compliance requirements. All data that is stored in Google Cloud is encrypted at rest by default, without any action required by you, using AES256. For more information, see Encryption at rest in Google Cloud.

Choose an encryption key strategy

Decide if you want to manage encryption keys yourself or if you want to use a managed service. Google Cloud supports both scenarios. If you want a fully managed service to manage your encryption keys on Google Cloud, choose Cloud Key Management Service (Cloud KMS). If you want to manage your encryption keys to maintain more control over a key's lifecycle, use Customer-managed encryption keys (CMEK).

To create and manage your encryption keys outside of Google Cloud, choose one of the following options:

  • If you use a partner solution to manage your keys, use Cloud External Key Manager.
  • If you manage your keys on-premises and if you want to use those keys to encrypt the data on Google Cloud, import those keys into Cloud KMS either as KMS keys or Hardware Key Module (HSM) keys. Use those keys to encrypt your data on Google Cloud.

Database design and scaling

This section provides best practices for designing and scaling a database to support your system.

Use monitoring metrics to assess scaling needs

Use metrics from existing monitoring tools and environments to establish a baseline understanding of database size and scaling requirements—for example, right-sizing and designing scaling strategies for your database instance.

For new database designs, determine scaling numbers based on expected load and traffic patterns from the serving application. For more information, see Monitoring Cloud SQL instances, Monitoring with Cloud Monitoring, and Monitoring an instance.

Networking and access

This section provides best practices for managing networking and access to support your system.

Run databases inside a private network

Run your databases inside your private network and grant restricted access only from the clients who need to interact with the database. You can create Cloud SQL instances inside a VPC. Google Cloud also provides VPC Service Controls for Cloud SQL, Spanner, and Bigtable databases to ensure that access to these resources is restricted only to clients within authorized VPC networks.

Grant minimum privileges to users

Identity and Access Management (IAM) controls access to Google Cloud services, including database services. To minimize the risk of unauthorized access, grant the least number of privileges to your users. For application-level access to your databases, use service accounts with the least number of privileges.

Automation and right-sizing

This section provides best practices for defining automation and right-sizing to support your system.

Define database instances as code

One of the benefits of migrating to Google Cloud is the ability to automate your infrastructure and other aspects of your workload like compute and database layers. Google Deployment Manager and third-party tools like Terraform Cloud let you define your database instances as code, which lets you apply a consistent and repeatable approach to creating and updating your databases.

Use Liquibase to version control your database

Google database services like Cloud SQL and Spanner support Liquibase, an open source version control tool for databases. Liquibase helps you to track your database schema changes, roll back schema changes, and perform repeatable migrations.

Test and tune your database to support scaling

Perform load tests on your database instance and tune it based on the test results to meet your application's requirements. Determine the initial scale of your database by load testing key performance indicators (KPI) or by using monitoring KPIs derived from your current database.

When you create database instances, start with a size that is based on the testing results or historical monitoring metrics. Test your database instances with the expected load in the cloud. Then fine-tune the instances until you get the desired results for the expected load on your database instances.

Choose the right database for your scaling requirements

Scaling databases is different from scaling compute layer components. Databases have state; when one instance of your database isn't able to handle the load, consider the appropriate strategy to scale your database instances. Scaling strategies vary depending on the database type.

Use the following table to learn about Google products that address scaling use cases.

Use case Recommended product Description
Horizontally scale your database instance by adding nodes to your database when you need to scale up the serving capacity and storage. Spanner A relational database that's built for the cloud.
Add nodes to scale your database. Bigtable Fully managed NoSQL big data database service.
Automatically handle database scaling. Firestore Flexible, scalable database for mobile, web, and server development.
To serve more queries, vertically scale up Cloud SQL database instances to give them more compute and memory capacity. In Cloud SQL, the storage layer is decoupled from the database instance. You can choose to scale your storage layer automatically whenever it approaches capacity. Cloud SQL Fully managed database service that helps you set up, maintain, manage, and administer your relational databases on Google Cloud.

Operations

This section provides best practices for operations to support your system.

Use Cloud Monitoring to monitor and set up alerts for your database

Use Cloud Monitoring to monitor your database instances and set up alerts to notify appropriate teams of events. For information about efficient alerting best practices, see Build efficient alerts.

All databases that are built for the cloud provide logging and monitoring metrics. Each service provides a dashboard to visualize logging and monitoring metrics. The monitoring metrics for all services integrate with Google Cloud Observability. Spanner provides query introspection tools like the Key Visualizer for debugging and root cause analysis. The Key Visualizer provides the following capabilities:

  • Helps you analyze Spanner usage patterns by generating visual reports for your databases. The reports display usage patterns by ranges of rows over time.
  • Provides insights into usage patterns at scale.

Bigtable also provides a Key Visualizer diagnostic tool that helps you to analyze Bigtable instance usage patterns.

Licensing

This section provides best practices for licensing to support your system.

Choose between on-demand licenses and existing licenses

If you use Cloud SQL for SQL Server, bringing your own licenses isn't supported; your licensing costs are based on per-core hour usage.

If you want to use existing Cloud SQL for SQL Server licenses, consider running Cloud SQL for SQL Server on Compute VMs. For more information, see Microsoft licenses and Choosing between on-demand licenses and bringing existing licenses.

If you use Oracle and if you're migrating to the Bare Metal Solution for Oracle, you can bring your own licenses. For more information, see Plan for Bare Metal Solution.

Migration timelines, methodology, and toolsets

This section provides best practices for planning and supporting your database migration to support your system.

Determine database modernization readiness

Assess whether your organization is ready to modernize your databases and use databases that are built for the cloud.

Consider database modernization when you plan workload migration timelines, because modernization is likely to impact your application side.

Involve relevant stakeholders in migration planning

To migrate a database, you complete the following tasks:

  • Set up the target databases.
  • Convert the schema.
  • Set up data replication between the source and target database.
  • Debug issues as they arise during the migration.
  • Establish network connectivity between the application layer and the database.
  • Implement target database security.
  • Ensure that the applications connect to the target databases.

These tasks often require different skill sets and multiple teams collaborate across your organization to complete the migration. When you plan the migration, include stakeholders from all teams, such as app developers, database administrators, and infrastructure and security teams.

If your team lacks skills to support this type of migration, Google's partners can help you perform your migrations. For more information, see Google Cloud Partner Advantage.

Identify tool sets for homogeneous and heterogeneous migrations

A homogeneous migration is a database migration between the source and target databases of the same database technology. A heterogeneous migration is a migration whose target database is different from the source database.

Heterogeneous migrations usually involve additional steps of schema conversion from the source database to the target database engine type. Your database teams need to assess the challenges involved in the schema conversion, because they depend on the complexity of the source database schema.

Test and validate each step in data migration

Data migrations involve multiple steps. To minimize migration errors, test and validate each step in the migration before moving to the next step. The following factors drive the migration process:

  • Whether the migration is homogeneous or heterogeneous.
  • What type of tools and skill sets you have to perform the migration.
  • For heterogeneous migrations, your experience with the target database engine.

Determine continuous data replication requirements

Create a plan to migrate the data initially and then continuously replicate the data from the source to the target database. Continue replication until the target is stabilized and the application is completely migrated to the new database. This plan helps you to identify potential downtime during the database switch and plan accordingly.

If you plan to migrate database engines from Cloud SQL, Cloud SQL for MySQL, or Cloud SQL for PostgreSQL, use Database Migration Service to automate this process in a fully managed way. For information about third-party tools that support other types of migrations, see Cloud Marketplace.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, we recommend that you do the following:

  • Multi-tenancy for databases involves storing data from multiple customers on a shared piece of infrastructure, in this case a database. If you offer a software-as-a-service (SaaS) based offering to your customers, make sure that you understand how you can logically isolate datasets that belong to different customers, and support their access requirements. Also, evaluate your requirements based on levels of separation.

    For relational databases such as Spanner and Cloud SQL, there are multiple approaches, such as isolating tenants' data at the database-instance level, database level, schema level, or the database-table level. Like other design decisions, there is a tradeoff between the degree of isolation and other factors such as cost and performance. IAM policies control access to your database instances.

  • Choose the right database for your data model requirements.

  • Choose key values to avoid key hotspotting. A hotspot is a location within a table that receives many more access requests than other locations. For more information about hotspots, see Schema design best practices.

  • Shard your database instance whenever possible.

  • Use connection-management best practices, such as connection pooling and exponential backoff.

  • Avoid very large transactions.

  • Design and test your application's response to maintenance updates on databases.

  • Secure and isolate connections to your database.

  • Size your database and growth expectations to ensure that the database supports your requirements.

  • Test your HA and DR failover strategies.

  • Perform backups and restore as well as exports and imports so that you're familiar with the process.

Cloud SQL recommendations

  • Use private IP address networking (VPC). For additional security, consider the following:
  • If you need public IP address networking, consider the following:
    • Use the built-in firewall with a limited or narrow IP address list and ensure that Cloud SQL instances require that incoming connections use SSL. For more information, see Configuring SSL/TLS certificates.
  • For additional security, consider the following:
  • Use limited privileges for database users.

What's next

Learn data analytics best practices, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Analyze your data

This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key data-analytics services, and how they can help at the various stages of the data lifecycle. These best practices help you to meet your data analytics needs and create your system design.

Core principles

Businesses want to analyze data and generate actionable insights from that data. Google Cloud provides you with various services that help you through the entire data lifecycle, from data ingestion through reports and visualization. Most of these services are fully managed, and some are serverless. You can also build and manage a data-analytics environment on Compute Engine VMs, such as to self-host Apache Hadoop or Beam.

Your particular focus, team expertise, and strategic outlook help you to determine which Google Cloud services you adopt to support your data analytics needs. For example, Dataflow lets you write complex transformations in a serverless approach, but you must rely on an opinionated version of configurations for compute and processing needs. Alternatively, Dataproc lets you run the same transformations, but you manage the clusters and fine-tune the jobs yourself.

In your system design, think about which processing strategy your teams use, such as extract, transform, load (ETL) or extract, load, transform (ELT). Your system design should also consider whether you need to process batch analytics or streaming analytics. Google Cloud provides a unified data platform, and it lets you build a data lake or a data warehouse to meet your business needs.

Key services

The following table provides a high-level overview of Google Cloud analytics services:

Google Cloud service Description
Pub/Sub Simple, reliable, and scalable foundation for stream analytics and event-driven computing systems.
Dataflow A fully managed service to transform and enrich data in stream (real time) and batch (historical) modes.
Dataprep by Trifacta Intelligent data service to visually explore, clean, and prepare structured and unstructured data for analysis.
Dataproc Fast, easy-to-use, and fully managed cloud service to run Apache Spark and Apache Hadoop clusters.
Cloud Data Fusion Fully managed, data integration service that's built for the cloud and lets you build and manage ETL/ELT data pipelines. Cloud DataFusion provides a graphical interface and a broad open source library of preconfigured connectors and transformations.
BigQuery Fully managed, low-cost, serverless data warehouse that scales with your storage and compute power needs. BigQuery is a columnar and ANSI SQL database that can analyze terabytes to petabytes of data.
Cloud Composer Fully managed workflow orchestration service that lets you author, schedule, and monitor pipelines that span clouds and on-premises data centers.
Data Catalog Fully managed and scalable metadata management service that helps you discover, manage, and understand all your data.
Looker Studio Fully managed visual analytics service that can help you unlock insights from data through interactive dashboards.
Looker Enterprise platform that connects, analyzes, and visualizes data across multi-cloud environments.
Dataform Fully managed product to help you collaborate, create, and deploy data pipelines, and ensure data quality.
Dataplex Managed data lake service that centrally manages, monitors, and governs data across data lakes, data warehouses, and data marts using consistent controls.
AnalyticsHub Platform that efficiently and securely exchanges data analytics assets across your organization to address challenges of data reliability and cost.

Data lifecycle

When you create your system design, you can group the Google Cloud data analytics services around the general data movement in any system, or around the data lifecycle.

The data lifecycle includes the following stages and example services:

The following stages and services run across the entire data lifecycle:

  • Data integration includes services such as Data Fusion.
  • Metadata management and governance includes services such as Data Catalog.
  • Workflow management includes services such as Cloud Composer.

Data ingestion

Apply the following data ingestion best practices to your own environment.

Determine the data source for ingestion

Data typically comes from another cloud provider or service, or from an on-premises location:

Consider how you want to process your data after you ingest it. For example, Storage Transfer Service only writes data to a Cloud Storage bucket, and BigQuery Data Transfer Service only writes data to a BigQuery dataset. Cloud Data Fusion supports multiple destinations.

Identify streaming or batch data sources

Consider how you need to use your data and identify where you have streaming or batch use cases. For example, if you run a global streaming service that has low latency requirements, you can use Pub/Sub. If you need your data for analytics and reporting uses, you can stream data into BigQuery.

If you need to stream data from a system like Apache Kafka in an on-premises or other cloud environment, use the Kafka to BigQuery Dataflow template. For batch workloads, the first step is usually to ingest data into Cloud Storage. Use the gsutil tool or Storage Transfer Service to ingest data.

Ingest data with automated tools

Manually moving data from other systems into the cloud can be a challenge. If possible, use tools that let you automate the data ingestion processes. For example, Cloud Data Fusion provides connectors and plugins to bring data from external sources with a drag-and-drop GUI. If your teams want to write some code, Data Flow or BigQuery can help to automate data ingestion. Pub/Sub can help in both a low-code or code-first approach. To ingest data into storage buckets, use gsutil for data sizes of up to 1 TB. To ingest amounts of data larger than 1 TB, use Storage Transfer Service.

Use migration tools to ingest from another data warehouse

If you need to migrate from another data warehouse system, such as Teradata, Netezza, or Redshift, you can use the BigQuery Data Transfer Service migration assistance. The BigQuery Data Transfer Service also provides third-party transfers that help you ingest data on a schedule from external sources. For more information, see the detailed migration approaches for each data warehouse.

Estimate your data ingestion needs

The volume of data that you need to ingest helps you to determine which service to use in your system design. For streaming ingestion of data, Pub/Sub scales to tens of gigabytes per second. Capacity, storage, and regional requirements for your data help you to determine whether Pub/Sub Lite is a better option for your system design. For more information, see Choosing Pub/Sub or Pub/Sub Lite.

For batch ingestion of data, estimate how much data you want to transfer in total, and how quickly you want to do it. Review the available migration options, including an estimate on time and comparison of online versus offline transfers.

Use appropriate tools to regularly ingest data on a schedule

Storage Transfer Service and BigQuery Data Transfer Service both let you schedule ingestion jobs. For fine-grain control of the timing of ingestion or the source and destination system, use a workflow-management system like Cloud Composer. If you want a more manual approach, you can use Cloud Scheduler and Pub/Sub to trigger a Cloud Function.
If you want to manage the Compute infrastructure, you can use the gsutil command with cron for data transfer of up to 1 TB. If you use this manual approach instead of Cloud Composer, follow the best practices to script production transfers.

Review FTP/SFTP server data ingest needs

If you need a code-free environment to ingest data from an FTP/SFTP server, you can use the FTP copy plugins. If you want to modernize and create a long-term workflow solution, Cloud Composer is a fully managed service that lets you read and write from various sources and sinks.

Use Apache Kafka connectors to ingest data

If you use Pub/Sub, Dataflow, or BigQuery, you can ingest data using one of the Apache Kafka connectors. For example, the open source Pub/Sub Kafka connector lets you ingest data from Pub/Sub or Pub/Sub Lite.

Additional resources

Data storage

Apply the following data storage best practices to your own environment.

Choose the appropriate data store for your needs

To help you choose what type of storage solution to use, review and understand the downstream usage of your data. The following common use cases for your data give recommendations for which Google Cloud product to use:

Data use case Product recommendation
File-based Filestore
Object-based Cloud Storage
Low latency Bigtable
Time series Bigtable
Online cache Memorystore
Transaction processing Cloud SQL
Business intelligence (BI) & analytics BigQuery
Batch processing Cloud Storage

Bigtable if incoming data is time series and you need low latency access to it.

BigQuery if you use SQL.

Review your data structure needs

For most unstructured data, such as documents and text files, audio and video files, or logs, an object-based store is the most suitable choice. You can then load and process the data from object storage when you need it.

For semi-structured data, such as XML or JSON, your use cases and data access patterns help guide your choice. You can load such datasets into BigQuery for automatic schema detection. If you have low latency requirements, you can load your JSON data into Bigtable. If you have legacy requirements or your applications work with relational databases, you can also load datasets into a relation store.

For structured data, such as CSV, Parquet, Avro, or ORC, you can use BigQuery if you have BI and analytics requirements that use SQL. For more information, see how to batch load data. If you want to create a data lake on open standards and technologies, you can use Cloud Storage.

Migrate data and reduce costs for HDFS

Look for ways to move Hadoop Distributed File System (HDFS) data from on-premises or from another cloud provider to a cheaper object-storage system. Cloud Storage is the most common choice that enterprises make as an alternative data store. For information about the advantages and disadvantages of this choice, see HDFS vs. Cloud Storage.

You can move data with a push or pull method. Both methods use the hadoop distcp command. For more information, see Migrating HDFS Data from On-Premises to Google Cloud.

You can also use the open source Cloud Storage connector to let Hadoop and Spark jobs access data in Cloud Storage. The connector is installed by default on Dataproc clusters, and can be manually installed on other clusters.

Use object storage to build a cohesive data lake

A data lake is a centralized repository designed to store, process, and secure large amounts of structured, semistructured, and unstructured data. You can use Cloud Composer and Cloud Data Fusion to build a data lake.

To build a modern data platform, you can use BigQuery as your central data source instead of Cloud Storage. BigQuery is a modern data warehouse with separation of storage and compute. A data lake built on BigQuery lets you perform traditional analytics from BigQuery in the Cloud console. It also lets you access the data stored from other frameworks such as Apache Spark.

Additional resources

Process and transform data

Apply the following data analytics best practices to your own environment when you process and transform data.

Explore the open source software you can use in Google Cloud

Many Google Cloud services use open source software to help make your transition seamless. Google Cloud offers managed and serverless solutions that have Open APIs and are compatible with open source frameworks to reduce vendor lock-in.

Dataproc is a Hadoop-compatible managed service that lets you host open source software, with little operational burden. Dataproc includes support for Spark, Hive, Pig, Presto, and Zookeeper. It also provides Hive Metastore as a managed service to remove itself as a single point of failure in the Hadoop ecosystem.

You can migrate to Dataflow if you currently use Apache Beam as a batch and streaming processing engine. Dataflow is a fully managed and serverless service that uses Apache Beam. Use Dataflow to write jobs in Beam, but let Google Cloud manage the execution environment.

If you use CDAP as your data integration platform, you can migrate to Cloud Data Fusion for a fully managed experience.

Determine your ETL or ELT data-processing needs

Your team's experience and preferences help determine your system design for how to process data. Google Cloud lets you use either traditional ETL or more modern ELT data-processing systems.

Use the appropriate framework for your data use case

Your data use cases determine which tools and frameworks to use. Some Google Cloud products are built to handle all of the following data use cases, while others best support only one particular use case.

  • For a batch data processing system, you can process and transform data in BigQuery with a familiar SQL interface. If you have an existing pipeline that runs on Apache Hadoop or Spark on-premises or in another public cloud, you can use Dataproc.
    • You can also use Dataflow if you want a unified programing interface for both batch and streaming use cases. We recommend that you modernize and use Dataflow for ETL and BigQuery for ELT.
  • For streaming data pipelines, you use a managed and serverless service like Dataflow that provides windowing, autoscaling, and templates. For more information, see Building production-ready data pipelines using Dataflow.

  • For real-time use cases, such as time series analysis or streaming video analytics, use Dataflow.

Retain future control over your execution engine

To minimize vendor lock-in and to be able to use a different platform in the future, use the Apache Beam programming model and Dataflow as a managed serverless solution. The Beam programming model lets you change the underlying execution engine, such as changing from Dataflow to Apache Flink or Apache Spark.

Use Dataflow to ingest data from multiple sources

To ingest data from multiple sources, such as Pub/Sub, Cloud Storage, HDFS, S3, or Kafka, use Dataflow. Dataflow is a managed serverless service that supports Dataflow templates, which lets your teams run templates from different tools.

Dataflow Prime provides horizontal and vertical autoscaling of machines that are used in the execution process of a pipeline. It also provides smart diagnostics and recommendations that identify problems and suggest how to fix them.

Discover, identify, and protect sensitive data

Use Sensitive Data Protection to inspect and transform structured and unstructured data. Sensitive Data Protection works for data located anywhere in Google Cloud, such as in Cloud Storage or databases. You can classify, mask, and tokenize your sensitive data to continue to use it safely for downstream processing. Use Sensitive Data Protection to perform actions such as to scan BigQuery data or de-identify and re-identify PII in large-scale datasets.

Modernize your data transformation processes

Use Dataform to write data transformations as code and to start to use version control by default. You can also adopt software development best practices such as CI/CD, unit tests, and version control to SQL code. Dataform supports all major cloud data warehouse products and databases, such as PostgreSQL.

Additional Resources

Data analytics and warehouses

Apply the following data analytics and warehouse best practices to your own environment.

Review your data storage needs

Data lakes and data warehouses aren't mutually exclusive. Data lakes are useful for unstructured and semi-structured data storage and processing. Data warehouses are best for analytics and BI.

Review your data needs to help determine where to store your data and which Google Cloud product is the most appropriate to process and analyze your data. Products like BigQuery can process PBs of data and grow with your demands.

Identify opportunities to migrate from a traditional data warehouse to BigQuery

Review the traditional data warehouses that are currently in use in your environment. To reduce complexity and potentially reduce costs, identify opportunities to migrate your traditional data warehouses to a Google Cloud service like BigQuery. For more information and example scenarios, see Migrating data warehouses to BigQuery.

Plan for federated access to data

Review your data requirements and how you might need to interact with other products and services. Identify your data federation needs, and create an appropriate system design.

For example, BigQuery lets you define external tables that can read data from other sources, such as Bigtable, Cloud SQL, Cloud Storage, or Google Drive. You can join these external sources with tables that you store in BigQuery.

Use BigQuery flex slots to provide on-demand burst capacity

Sometimes you need extra capacity to do experimental or exploratory analysis that needs a lot of compute resources. BigQuery lets you get additional compute capacity in the form of flex slots. These flex slots help you when there's a period of high demand or when you want to complete an important analysis.

Understand schema differences if you migrate to BigQuery

BigQuery supports both star and snowflake schemas, but by default it uses nested and repeated fields. Nested and repeated fields can be easier to read and correlate compared to other schemas. If your data is represented in a star or snowflake schema, and if you want to migrate to BigQuery, review your system design for any necessary changes to processes or analytics.

Additional resources

Reports and visualization

Apply the following reporting and visualization best practices to your own environment.

Use BigQuery BI Engine to visualize your data

BigQuery BI Engine is a fast, in-memory analysis service. You can use BI Engine to analyze data stored in BigQuery with subsecond query response time and with high concurrency. BI Engine is integrated into the BigQuery API. Use reserved BI Engine capacity to manage the on-demand or flat-rate pricing for your needs. BI Engine can also work with other BI or custom dashboard applications that require subsecond response times.

Modernize your BI processes with Looker

Looker is a modern, enterprise platform for BI, data applications, and embedded analytics. You can create consistent data models on top of your data with speed and accuracy, and you can access data inside transactional and analytical datastores. Looker can also analyze your data on multiple databases and clouds. If you have existing BI processes and tools, we recommend that you modernize and use a central platform such as Looker.

Additional resources

Use workflow management tools

Data analytics involves many processes and services. Data moves across different tools and processing pipelines during the data analytics lifecycle. To manage and maintain end-to-end data pipelines, use appropriate workflow management tools. Cloud Composer is a fully managed workflow management tool based on the open source Apache Airflow project.

You can use Cloud Composer to launch Dataflow pipelines and to use Dataproc Workflow Templates. Cloud Composer can also help you create a CI/CD pipeline to test, synchronize, and deploy DAGs or use a CI/CD pipeline for data-processing workflows. For more information, watch Cloud Composer: Development best practices.

Migration resources

If you already run a data analytics platform and if you want to migrate some or all of the workloads to Google Cloud, review the following migration resources for best practices and guidance:

What's next

Learn about system design best practices for Google Cloud AI and machine learning, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Implement machine learning

This document in the Google Cloud Architecture Framework explains some of the core principles and best practices for data analytics in Google Cloud. You learn about some of the key AI and machine learning (ML) services, and how they can help during the various stages of the AI and ML lifecycle. These best practices help you to meet your AI and ML needs and create your system design. This document assumes that you're familiar with basic AI and ML concepts.

To simplify the development process and minimize overhead when you build ML models on Google Cloud, consider the highest level of abstraction that makes sense for your use case. Level of abstraction is defined as the amount of complexity by which a system is viewed or programmed. The higher the level of abstraction, the less detail is available to the viewer.

To select Google AI and ML services based on your business needs, use the following table:

Persona Google services
Business users Standard solutions such as Contact Center AI Insights, Document AI, Discovery AI, and Cloud Healthcare API.
Developers with minimum ML experience Pretrained APIs address common perceptual tasks such as vision, video, and natural language. These APIs are supported by pretrained models and provide default detectors. They are ready to use without any ML expertise or model development effort. Pretrained APIs include: Vision API, Video API, Natural Language API, Speech-to-Text API, Text-to-Speech API, and Cloud Translation API.
Generative AI for Developers Vertex AI Search and Conversation lets developers use its out-of-the-box capabilities to build and deploy chatbots in minutes and search engines in hours. Developers who want to combine multiple capabilities into enterprise workflows can use the Gen App Builder API for direct integration.
Developers and data scientists AutoML enables custom model development with your own image, video, text, or tabular data. AutoML accelerates model development with automatic search through the Google model zoo for the most performant model architecture, so you don't need to build the model. AutoML handles common tasks for you, such as choosing a model architecture, hyperparameter tuning, provisioning machines for training and serving.
Data scientists and ML engineers Vertex AI custom model toolings let you train and serve custom models, and they operationalize the ML workflow. You can also run your ML workload on self-managed compute such as Compute Engine VMs.
Data scientists & machine learning engineers Generative AI support on Vertex AI (also known as genai) provides access to Google's large generative AI models so you can test, tune, and deploy the models in your AI-powered applications.
Data engineers, data scientists, and data analysts familiar with SQL interfaces BigQuery ML lets you develop SQL-based models on top of data that's stored in BigQuery.

Key services

The following table provides a high-level overview of AI and ML services:

Google service Description
Cloud Storage and BigQuery Provide flexible storage options for machine learning data and artifacts.
BigQuery ML Lets you build machine learning models directly from data housed inside BigQuery.
Pub/Sub, Dataflow,
Cloud Data Fusion, and Dataproc
Support batch and real-time data ingestion and processing. For more information, see Data Analytics.
Vertex AI Offers data scientists and machine learning engineers a single platform to create, train, test, monitor, tune, and deploy ML models for everything from generative AI to MLOps.

Tooling includes the following:
Vertex AI Search and Conversation Lets you build chatbots and search engines for websites and for use across enterprise data.
  • Conversational AI on Vertex AI Search and Conversation can help reinvent customer and employee interactions with generative-AI-powered chatbots and digital assistants. For example, with these tools, you can provide more than just information by enabling transactions from within the chat experience.
  • Enterprise Search on Vertex AI Search and Conversation, lets enterprises build search experiences for customers and employees on their public or private websites. In addition to providing high-quality multimodal search results, Enterprise Search can also summarize results and provide corresponding citations with generative AI.
Generative AI on Vertex AI Gives you access to Google's large generative AI models so you can test, tune, and deploy them for use in your AI-powered applications. Generative AI on Vertex AI is also known as genai.
  • Generative AI models, which are also known as Foundation models, are categorized by the type of content they're designed to generate. This content includes text and chat, image, code, and text embeddings.
  • Vertex AI Studio lets you rapidly prototype and test generative AI models in Google Cloud console. You can test sample prompts, design your own prompts, and customize foundation models to handle tasks that meet your application's needs.
  • Model Tuning - lets you customize foundation models for specific use cases by tuning them using a dataset of input-output examples.
  • Model Garden provides enterprise-ready foundation models, task-specific models, and APIs.
Pretrained APIs
AutoML Provides custom model tooling to build, deploy, and scale ML models. Developers can upload their own data and use the applicable AutoML service to build a custom model.
  • AutoML Image: Performs image classification and object detection on image data.
  • AutoML Video: Performs object detection, classification, and action recognition on video data.
  • AutoML Text: Performs language classification, entity extraction, and sentiment analysis on text data.
  • AutoML Translation: Detects and translates between language pairs.
  • AutoML Tabular: Lets you build a regression, classification, or forecasting model. Intended for structured data.
AI infrastructure Lets you use AI accelerators to process large-scale ML workloads. These accelerators let you train and get inference from deep learning models and from machine learning models in a cost-effective way.

GPUs can help with cost-effective inference and scale-up or scale-out training for deep learning models. Tensor Processing Units (TPUs) are custom-built ASICs to train and execute deep neural networks.
Dialogflow Delivers virtual agents that provide a conversational experience.
Contact Center AI Delivers an automated, insights-rich contact-center experience with Agent Assist functionality for human agents.
Document AI Provides document understanding at scale for documents in general, and for specific document types like lending-related and procurement-related documents.
Lending DocAI Automates mortgage document processing. Reduces processing time and streamlines data capture while supporting regulatory and compliance requirements.
Procurement DocAI Automates procurement data capture at scale by turning unstructured documents (like invoices and receipts) into structured data to increase operational efficiency, improve customer experience, and inform decision-making.
Recommendations Delivers personalized product recommendations.
Healthcare Natural Language AI Lets you review and analyze medical documents.
Media Translation API Enables real-time speech translation from audio data.

Data processing

Apply the following data processing best practices to your own environment.

Ensure that your data meets ML requirements

The data that you use for ML should meet certain basic requirements, regardless of data type. These requirements include the data's ability to predict the target, consistency in granularity between the data used for training and the data used for prediction, and accurately labeled data for training. Your data should also be sufficient in volume. For more information, see Data processing.

Store tabular data in BigQuery

If you use tabular data, consider storing all data in BigQuery and using the BigQuery Storage API to read data from it. To simplify interaction with the API, use one of the following additional tooling options, depending on where you want to read the data:

The input data type also determines the available model development tooling. Pre-trained APIs, AutoML, and BigQuery ML can provide more cost-effective and time-efficient development environments for certain image, video, text, and structured data use cases.

Ensure you have enough data to develop an ML model

To develop a useful ML model, you need to have enough data. To predict a category, the recommended number of examples for each category is 10 times the number of features. The more categories you want to predict, the more data you need. Imbalanced datasets require even more data. If you don't have enough labeled data available, consider semi-supervised learning.

Dataset size also has training and serving implications: if you have a small dataset, you can train it directly within a Notebooks instance; if you have larger datasets that require distributed training, use Vertex AI custom training service. If you want Google to train the model for your data, use AutoML.

Prepare data for consumption

Well-prepared data can accelerate model development. When you configure your data pipeline, make sure that it can process both batch and stream data so that you get consistent results from both types of data.

Model development and training

Apply the following model development and training best practices to your own environment.

Choose managed or custom-trained model development

When you build your model, consider the highest level of abstraction possible. Use AutoML when possible so that the development and training tasks are handled for you. For custom-trained models, choose managed options for scalability and flexibility, instead of self-managed options. To learn more about model development options, see Use recommended tools and products.

Consider the Vertex AI training service instead of self-managed training on Compute Engine VMs or Deep Learning VM containers. For a JupyterLab environment, consider Vertex AI Workbench, which provides both managed and user-managed JupyterLab environments. For more information, see Machine learning development and Operationalized training.

Use pre-built or custom containers for custom-trained models

For custom-trained models on Vertex AI, you can use pre-built or custom containers depending on your machine learning framework and framework version. Pre-built containers are available for Python training applications that are created for specific TensorFlow, scikit-learn, PyTorch, and XGBoost versions.

Otherwise, you can choose to build a custom container for your training job. For example, use a custom container if you want to train your model using a Python ML framework that isn't available in a pre-built container, or if you want to train using a programming language other than Python. In your custom container, pre-install your training application and all its dependencies onto an image that runs your training job.

Consider distributed training requirements

Consider your distributed training requirements. Some ML frameworks, like TensorFlow and PyTorch, let you run identical training code on multiple machines. These frameworks automatically coordinate division of work based on environment variables that are set on each machine. Other frameworks might require additional customization.

What's next

For more information about AI and machine learning, see the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Design for environmental sustainability

This document in the Google Cloud Architecture Framework summarizes how you can approach environmental sustainability for your workloads in Google Cloud. It includes information about how to minimize your carbon footprint on Google Cloud.

Understand your carbon footprint

To understand the carbon footprint from your Google Cloud usage, use the Carbon Footprint dashboard. The Carbon Footprint dashboard attributes emissions to the Google Cloud projects that you own and the cloud services that you use.

For more information, see Understand your carbon footprint in "Reduce your Google Cloud carbon footprint."

Choose the most suitable cloud regions

One simple and effective way to reduce carbon emissions is to choose cloud regions with lower carbon emissions. To help you make this choice, Google publishes carbon data for all Google Cloud regions.

When you choose a region, you might need to balance lowering emissions with other requirements, such as pricing and network latency. To help select a region, use the Google Cloud Region Picker.

For more information, see Choose the most suitable cloud regions in "Reduce your Google Cloud carbon footprint."

Choose the most suitable cloud services

To help reduce your existing carbon footprint, consider migrating your on-premises VM workloads to Compute Engine.

Also consider that many workloads don't require VMs. Often you can utilize a serverless offering instead. These managed services can optimize cloud resource usage, often automatically, which simultaneously reduces cloud costs and carbon footprint.

For more information, see Choose the most suitable cloud services in "Reduce your Google Cloud carbon footprint."

Minimize idle cloud resources

Idle resources incur unnecessary costs and emissions. Some common causes of idle resources include the following:

  • Unused active cloud resources, such as idle VM instances.
  • Over-provisioned resources, such as larger VM instances machine types than necessary for a workload.
  • Non-optimal architectures, such as lift-and-shift migrations that aren't always optimized for efficiency. Consider making incremental improvements to these architectures.

The following are some general strategies to help minimize wasted cloud resources:

  • Identify idle or overprovisioned resources and either delete them or rightsize them.
  • Refactor your architecture to incorporate a more optimal design.
  • Migrate workloads to managed services.

For more information, see Minimize idle cloud resources in "Reduce your Google Cloud carbon footprint."

Reduce emissions for batch workloads

Run batch workloads in regions with lower carbon emissions. For further reductions, run workloads at times that coincide with lower grid carbon intensity when possible.

For more information, see Reduce emissions for batch workloads in "Reduce your Google Cloud carbon footprint."

What's next

Google Cloud Architecture Framework: Operational excellence

This category in the Google Cloud Architecture Framework shows you how to operate services efficiently on Google Cloud. It discusses how to run, manage, and monitor systems that deliver business value. It also discusses Google Cloud products and features that support operational excellence. Using the principles of operational excellence helps you build a foundation for reliability. It does so by setting up foundational elements like observability, automation, and scalability.

This Architecture Framework describes best practices, provides implementation recommendations, and explains some available products and services that help you achieve operational excellence. The framework aims to help you design your Google Cloud deployment so that it best matches your business needs.

In the operational excellence category of the Architecture Framework, you learn to do the following:

Automate your deployments

This document in the Google Cloud Architecture Framework provides best practices for automating your builds, tests, and deployments.

Automation helps you standardize your builds, tests, and deployments by eliminating human-induced errors for repeated processes like code updates. This section describes how to use various checks and guards as you automate. A standardized machine-controlled process helps ensure that your deployments are applied safely. It also provides a mechanism to restore previous deployments as needed without significantly affecting your user's experience.

Store your code in central code repositories

Store your code in central code repositories that include a version control system with tagging and the ability to roll back code changes. Version control lets you organize files and control access, updates, and deletion across teams and organizations.

For different stages of development, version and label the repositories as needed. For example, labels could be test, dev, and prod.

In Google Cloud, you can store your code in Cloud Source Repositories, and version and integrate them with other Google Cloud products. If you are building containerized applications, use Artifact Registry, a managed registry for containers.

For more details about version control, see Version control. For details about implementing trunk-based development with your repositories, see Trunk-based development.

Use continuous integration and continuous deployment (CI/CD)

Automate your deployments using a continuous integration and continuous deployment (CI/CD) approach. A CI/CD approach is a combination of pipelines that you configure and processes that your development team follows.

A CI/CD approach increases deployment velocity by making your software development team more productive. This approach lets developers make smaller and more frequent changes that are thoroughly tested while reducing the time needed to deploy those changes.

As part of your CI/CD approach, automate all the steps that are part of building, testing, and deploying your code. For example:

  • Whenever new code is committed to the repository, have the commit automatically invoke the build and test pipeline.
  • Automate integration testing.
  • Automate your deployment so that changes deploy after your build meets specific testing criteria.

In Google Cloud, you can use Cloud Build and Cloud Deploy for your CI/CD pipelines.

Use Cloud Build to help define dependencies and versions that you can use for packaging and building an application package. Version your build configuration to make sure all your builds are consistent, and to make sure you can roll back to a previous configuration if necessary.

Use Cloud Deploy to deploy your applications to specific environments on Google Cloud, and to manage your deployment pipelines.

For more details about implementing CI/CD, read Continuous integration and Deployment automation.

Provision and manage your infrastructure using infrastructure as code

Infrastructure as code is the use of a descriptive model to manage infrastructure, such as VMs, and configurations, such as firewall rules. Infrastructure as code lets you do the following:

  • Create your cloud resources automatically, including the deployment or test environments for your CI/CD pipeline.
  • Treat infrastructure changes like you treat application changes. For example, ensure changes to the configuration are reviewed, tested, and can be audited.
  • Have a single version of the truth for your cloud infrastructure.
  • Replicate your cloud environment as needed.
  • Roll back to a previous configuration if necessary.

This concept of infrastructure as code also applies to projects in Google Cloud. You can use this approach to define resources such as Shared VPC connectivity or Identity and Access Management (IAM) access in your projects. For an example of this approach, see the Google Cloud Project Factory Terraform Module.

Third-party tools, like Terraform, help you to automatically create your infrastructure on Google Cloud. For more information, see Managing infrastructure as code with Terraform, Cloud Build, and GitOps.

Consider using Google Cloud features, such as project liens, Cloud Storage retention policies, and Cloud Storage bucket locks, to protect critical resources from being accidentally or maliciously deleted.

Incorporate testing throughout the software delivery lifecycle

Testing is critical to successfully launching your software. Continuous testing helps teams create high-quality software faster and enhance software stability.

Testing types:

  • Unit tests. Unit tests are fast and help you perform rapid deployments. Treat unit tests as part of the codebase and include them as part of the build process.
  • Integration tests. Integration tests are important, especially for workloads that are designed for scale and dependent on more than one service. These tests can become complex when you test for integration with interconnected services.
  • System tests. System tests are time consuming and complex, but they help you identify edge cases and fix issues before deployment.
  • Other tests. There are other tests you should run, including static testing, load testing, security testing, policy validation testing, and others. Run these tests before deploying your application in production.

To incorporate testing:

  • Perform all types of testing continuously throughout the software delivery lifecycle.
  • Automate these tests and include them in the CI/CD pipeline. Make your pipeline fail if any of the tests fail.
  • Update and add new tests continuously to improve and maintain the operational health of your deployment.

For your testing environments:

  • Use separate Google Cloud projects for each test environment you have. For each application, use a separate project environment. This separation provides a clear demarcation between production environment resources and the resources of your lower environments. This separation helps ensure that any changes in one environment don't accidentally affect other environments.
  • Automate the creation of test environments. One way to do this automation is using infrastructure as code.
  • Use a synthetic production environment to test changes. This environment provides a production-like environment to test your application and perform various types of tests on your workloads, including end-to-end testing and performance testing.

For more information about implementing continuous testing, see Test automation.

Launch deployments gradually

Choose your deployment strategy based on important parameters, like minimum disruption to end users, rolling updates, rollback strategies, and A/B testing strategies. For each workload, evaluate these requirements and pick a deployment strategy from proven techniques, such as rolling updates, blue/green deployments, and canary deployments.

Only let CI/CD processes make and push changes in your production environment.

Consider using an immutable infrastructure. An immutable infrastructure is an infrastructure that isn't changed or updated. When you need to deploy new code or change any other configuration in your environment, you replace the entire environment (a collection of VMs, or Pods for example) with the new environment. Blue/green deployments are an example of immutable infrastructure.

We recommend that you do canary testing and observe your system for any errors as you deploy changes. This type of observation is easier if you have a robust monitoring and alerting system. To do A/B testing or canary testing, you can use Google Cloud's managed instance groups.. Then you can perform a slow rollout, or a restoration if necessary.

Consider using Cloud Deploy to automate deployments and manage your deployment pipeline. You can also use many third-party tools, like Spinnaker and Tekton, on Google Cloud for both automated deployments and for creating deployment pipelines.

Restore previous releases seamlessly

Define your restoration strategy as part of your deployment strategy. Ensure that you can roll back a deployment, or an infrastructure configuration, to a previous version of the source code. Restoring a previous stable deployment is an important step in incident management for both reliability and security incidents.

Also ensure that you can restore the environment to the state it was in before the deployment process started. This can include:

  • The ability to revert any code changes in your application.
  • The ability to revert any configuration changes made to the environment.
  • Using immutable infrastructure and ensuring that deployments don't change the environment. These practices make reverting configuration changes easier.

Monitor your CI/CD pipelines

To keep your automated build, test, and deploy process running smoothly, monitor your CI/CD pipelines. Set alerts that indicate when anything in any pipeline fails. Each step of your pipeline should write suitable log statements so that your team can perform root cause analysis if a pipeline fails.

In Google Cloud, all the CI/CD services are integrated with Google Cloud Observability. For example:

For details about monitoring and logging, see Set up monitoring, alerting, and logging.

Deploy applications securely

Review the Deploy applications securely section from the security, compliance and privacy category of the Architecture Framework.

Establish management guidelines for version releases

To help your engineers avoid making mistakes, and to enable high-velocity software delivery, ensure that your management guidelines for releasing new software versions are clearly documented.

Release engineers oversee how software is built and delivered. The system of release engineering is guided by four practices:

  • Self-service mode. Establish guidelines to help software engineers avoid common mistakes. These guidelines are generally codified in automated processes.

  • Frequent releases. High velocity helps troubleshooting and makes fixing issues easier. Frequent releases rely on automated unit tests.

  • Hermetic builds. Ensure consistency with your build tools. Version the build compilers you use to build versions now versus one month ago.

  • Policy enforcement. All changes need code review, ideally including a set of guidelines and policy to enforce security. Policy enforcement improves code review, troubleshooting, and testing a new release.

What's next

Set up monitoring, alerting, and logging

This document in the Google Cloud Architecture Framework shows you how to set up monitoring, alerting, and logging so that you can act based on the behavior of your system. This includes identifying meaningful metrics to track and building dashboards to make it easier to view information about your systems.

The DevOps Resource and Assessment (DORA) research program defines monitoring as:

"The process of collecting, analyzing, and using information to track applications and infrastructure in order to guide business decisions. Monitoring is a key capability because it gives you insight into your systems and your work."

Monitoring enables service owners to:

  • Make informed decisions when changes to the service affect performance
  • Apply a scientific approach to incident response
  • Measure your service's alignment with business goals

With monitoring, logging, and alerting in place, you can do the following:

  • Analyze long-term trends
  • Compare your experiments over time
  • Define alerting on critical metrics
  • Build relevant real-time dashboards
  • Perform retrospective analysis
  • Monitor both business-driven metrics and system-health metric
    • Business-driven metrics help you understand how well your systems support your business. For example, use metrics to monitor the following:
      • The cost to an application to serve a user
      • The volume change in site traffic following a redesign
      • How long it takes a customer to purchase a product on your site
    • System health metrics help you understand whether your systems are operating correctly and within acceptable performance levels.

Use the following four golden signals to monitor your system:

  • Latency. The time it takes to service a request.
  • Traffic. How much demand is being placed on your system.
  • Errors. The rate of requests that fail. Failure can be explicit (for example, HTTP 500s), implicit (for example, an HTTP 200 success response coupled with the wrong content), or by policy (for example, if you commit to one-second response times, any request over one second is an error).
  • Saturation. How full your service is. Saturation is a measure of your system fraction, emphasizing the resources that are most constrained (that is, in a memory-constrained system, show memory; in an I/O-constrained system, show I/O).

Create a monitoring plan

Create a monitoring plan that aligns with your organization's mission and its operations strategy. Include monitoring and observability planning during application development. Including a monitoring plan early in application development can drive an organization toward operational excellence.

Include the following details in your monitoring plan:

  • Include all your systems, including on-premises resources and cloud resources.
  • Include monitoring of your cloud costs to help make sure that scaling events doesn't cause usage to cross your budget thresholds.
  • Build different monitoring strategies for measuring infrastructure performance, user experience, and business key performance indicators (KPIs). For example, static thresholds might work well to measure infrastructure performance but don't truly reflect the user's experience.

Update the plan as your monitoring strategies mature. Iterate on the plan to improve the health of your systems.

Define metrics that measure all aspects of your organization

Define the metrics that are required to measure how your deployment behaves. To do so:

  • Define your business objectives.
  • Identify the metrics and KPIs that can provide you with quantifiable information to measure performance. Make sure your metric definitions translate to all aspects of your organization, from business needs—including cloud costs—to technical components.
  • Use these metrics to create service level indicators (SLIs) for your applications. For more information, see Choose appropriate SLIs.

Common metrics for various components

Metrics are generated at all levels of your service, from infrastructure and networking to business logic. For example:

  • Infrastructure metrics:
    • Virtual machine statistics, including instances, CPU, memory, utilization, and counts
    • Container-based statistics, including cluster utilization, cluster capacity, pod level utilization, and counts
    • Networking statistics, including ingress/egress, bandwidth between components, latency, and throughput
    • Requests per second, as measured by the load balancer
    • Total disk blocks read, per disk
    • Packets sent over a given network interface
    • Memory heap size for a given process
    • Distribution of response latencies
    • Number of invalid queries rejected by a database instance
  • Application metrics:
    • Application-specific behavior, including queries per second, writes per second, and messages sent per second
  • Managed services statistics metrics:
    • QPS, throughput, latency, utilization for Google-managed services (APIs or products such as BigQuery, App Engine, and Bigtable)
  • Network connectivity statistics metrics:
    • VPN/interconnect-related statistics about connecting to on-premises systems or systems that are external to Google Cloud.
  • SLIs
    • Metrics associated with the overall health of the system.

Set up monitoring

Set up monitoring to monitor both on-premises resources and cloud resources.

Choose a monitoring solution that:

  • Is platform independent
  • Provides uniform capabilities for monitoring of on-premises, hybrid, and multi-cloud environments

Using a single platform to consolidate the monitoring data that comes in from different sources lets you build uniform metrics and visualization dashboards.

As you set up monitoring, automate monitoring tasks where possible.

Monitoring with Google Cloud

Using a monitoring service, such as Cloud Monitoring, is easier than building a monitoring service yourself. Monitoring a complex application is a substantial engineering endeavor by itself. Even with existing infrastructure for instrumentation, data collection and display, and alerting in place, it is a full-time job for someone to build and maintain.

Consider using Cloud Monitoring to obtain visibility into the performance, availability, and health of your applications and infrastructure for both on-premises and cloud resources.

Cloud Monitoring is a managed service that is part of the Google Cloud Observability. You can use Cloud Monitoring to monitor Google Cloud services and custom metrics. Cloud Monitoring provides an API for integration with third-party monitoring tools.

Cloud Monitoring aggregates metrics, logs, and events from your system's cloud-based infrastructure. That data gives developers and operators a rich set of observable signals that can speed root-cause analysis and reduce mean time to resolution. You can use Cloud Monitoring to define alerts and custom metrics that meet your business objectives and help you aggregate, visualize, and monitor system health.

Cloud Monitoring provides default dashboards for cloud and open source application services. Using the metrics model, you can define custom dashboards with powerful visualization tools and configure charts in Metrics Explorer.

Set up alerting

A good alerting system improves your ability to release features. It helps compare performance over time to determine the velocity of feature releases or the need to roll back a feature release. For information about rollbacks, see Restore previous releases seamlessly.

As you set up alerting, map alerts directly to critical metrics. These critical metrics include:

  • The four golden signals:
    • Latency
    • Traffic
    • Errors
    • Saturation
  • System health
  • Service usage
  • Security events
  • User experience

Make alerts actionable to minimize the time to resolution. To do so, for each alert:

  • Include a clear description, including stating what is monitored and its business impact.
  • Provide all the information necessary to act immediately. If it takes a few clicks and navigation to understand alerts, it is challenging for the on-call person to act.
  • Define priority levels for various alerts.
  • Clearly identify the person or team responsible for responding to the alert.

For critical applications and services, build self-healing actions into the alerts triggered due to common fault conditions such as service health failure, configuration change, or throughput spikes.

As you set up alerts, try to eliminate toil. For example, eliminate toil by eliminating frequent errors, or automating fixes for these errors which possibly avoids an alert being triggered. Eliminating toil lets those on call focus on making your application's operational components reliable. For more information, see Create a culture of automation.

Build monitoring and alerting dashboards

Once monitoring is in place, build relevant, uncomplicated dashboards that include information from your monitoring and alerting systems.

Choosing an appropriate way to visualize your dashboard can be difficult to tie into your reliability goals. Create dashboards to visualize both:

  • Short-term and real-time analysis
  • Long-term analysis

For more information about implementing visual management, see the capability article Visual management.

Enable logging for critical applications

Logging services are critical to monitoring your systems. While metrics form the basis of specific items to monitor, logs contain valuable information that you need for debugging, security-related analysis, and compliance requirements.

Logging the data your systems generate helps you ensure an effective security posture. For more information about logging and security, see Implement logging and detective controls in the security category of the Architecture Framework.

Cloud Logging is an integrated logging service you can use to store, search, analyze, monitor, and alert on log data and events. Logging automatically collects logs from the services of Google Cloud and other cloud providers. You can use these logs to build metrics for monitoring and to create logging exports to external services such as Cloud Storage, BigQuery, and Pub/Sub.

Set up an audit trail

To help answer questions like "who did what, where, and when" in your Google Cloud projects, use Cloud Audit Logs.

Cloud Audit Logs captures several types of activity, such as the following:

  • Admin Activity logs contain log entries for API calls or other administrative actions that modify the configuration or metadata of resources. Admin Activity logs are always enabled.
  • Data Access audit logs record API calls that create, modify, or read user-provided data. Data Access audit logs are disabled by default because they can be quite large. You can configure which Google Cloud services produce data access logs.

For a list of Google Cloud services that write audit logs, see Google services with audit logs. Use Identity and Access Management (IAM) controls to limit who has access to view audit logs.

What's next

Establish cloud support and escalation processes

This document in the Google Cloud Architecture Framework shows you how to define an effective escalation process. Establishing support from your cloud provider or other third-party service providers is a key part of effective escalation management.

Google Cloud provides you with various support channels, including live support or through published guidance such as developer communities or product documentation. An offering from Cloud Customer Care ensures you can work with Google Cloud to run your workloads efficiently.

Establish support from your providers

Purchase a support contract from your cloud provider or other third-party service providers. Support is critical to ensure the prompt response and resolution of various operational issues.

To work with Google Cloud Customer Care, consider purchasing a Customer Care offering that includes Standard, Enhanced, or Premium Support. Consider using Enhanced or Premium Support for your major production environments.

Define your escalation process

A well-defined escalation process is key to reducing the effort and time that it takes to identify and address any issues in your systems. This includes issues that require support for Google Cloud products or for other cloud producers or third-party services.

To create your escalation path:

  • Define when and how to escalate issues internally.
  • Define when and how to create support cases with your cloud provider or other third-party service provider.
  • Learn how to work with the teams that provide you support. For Google Cloud, you and your operations teams should review the Best practices for working with Customer Care. Incorporate these practices into your escalation path.
  • Find or create documents that describe your architecture. Ensure these documents include information that is helpful for support engineers.
  • Define how your teams communicate during an outage.
  • Ensure that people who need support have appropriate levels of support permissions to access the Google Cloud Support Center, or to communicate with other support providers. To learn about using the Google Cloud Support Center, visit Support procedures.
  • Set up monitoring, alerting, and logging so that you have the information needed to act on when issues arise.
  • Create templates for incident reporting. For information to include in your incident reports, see Best practices for working with Customer Care.
  • Document your organization's escalation process. Ensure that you have clear, well-defined actions to address escalations.
  • Include a plan to teach new team members how to interact with support.

Regularly test your escalation process internally. Test your escalation process before major events, such as migrations, new product launches, and peak traffic events. If you have Google Cloud Customer Care Premium Support, your Technical Account Manager can help review your escalation process and coordinate your tests with Google Cloud Customer Care.

Ensure you receive communication from support

Ensure that your administrators are receiving communication from your cloud providers and third-party services. This information allows admins to make informed decisions and fix issues before they cause larger problems. Ensure that the following are true:

  • Security, network, and system administrators are set up to receive critical emails from Google Cloud and your other providers or services.
  • Security, network, and system administrators are set up to receive system alerts generated by monitoring tools, like Cloud Monitoring.
  • Project owners have email-routable usernames so that they can receive critical emails.

For information about managing notifications from Google Cloud, see Managing contacts for notifications.

Establish review processes

Establish a review or postmortem processes. Follow these processes after you raise a new support ticket or escalate an existing support ticket. As part of the postmortem, document the lessons learned and track mitigations. As you do this review, foster a blameless postmortem culture.

For more information about postmortems, see Postmortem Culture: Learning from Failure.

Build centers of excellence

It can be valuable to capture your organization's information, experience, and patterns in an internal knowledge base, such as a wiki, Google site, or intranet site. As new products and features are continually being rolled out in Google Cloud, this knowledge can help track why you chose a particular design for your applications and services. For more information, see Architecture decision records.

It's also a good practice to nominate Google Cloud experts and champions in your organization. A range of training and certification options are available to help nominated champions grow in their area of expertise. Teams can stay up to date on the latest news, announcements, and customer stories by subscribing to the Google Cloud blog.

What's next

Manage capacity and quota

This document in the Google Cloud Architecture Framework shows you how to evaluate and plan your capacity and quota on the cloud.

In conventional data centers, you typically spend cycles each quarter reviewing current resource requirements and forecasting future ones. You must consider physical, logistical, and human-resource-related concerns. Concerns like rack space, cooling, electricity, bandwidth, cabling, procurement times, shipping times, and how many engineers are available to rack and stack new equipment need to be considered. You also have to actively manage capacity and workload distributions so that resource-intensive jobs, such as Hadoop pipelines, don't interfere with services, such as web servers, that must be highly available.

In contrast, when you use Google Cloud you cede most capacity planning to Google. Using the cloud means you don't have to provision and maintain idle resources when they aren't needed. For example, you can create, scale up, and scale down VM instances as needed. Because you pay for what you use, you can optimize your spending, including excess capacity that you only need at peak traffic times. To help you save, Compute Engine provides machine type recommendations if it detects that you have underutilized VM instances that can be resized or deleted.

Evaluate your cloud capacity requirements

To manage your capacity effectively, you need to know your organization's capacity requirements.

To evaluate your capacity requirements, start by identifying your top cloud workloads. Evaluate the average and peak utilizations of these workloads, and their current and future capacity needs.

Identify the teams who use these top workloads. Work with them to establish an internal demand-planning process. Use this process to understand their current and forecasted cloud resource needs.

Analyze load pattern and call distribution. Use factors like last 30 days peak, hourly peak, and peak per minute in your analysis.

Consider using Cloud Monitoring to get visibility into the performance, uptime, and overall health of your applications and infrastructure.

View your infrastructure utilization metrics

To make capacity planning easier, gather and store historical data about your organization's use of cloud resources.

Ensure you have visibility into infrastructure utilization metrics. For example, for top workloads, evaluate the following:

  • Average and peak utilization
  • Spikes in usage patterns
  • Seasonal spikes based on business requirements, such as holiday periods for retailers
  • How much over-provisioning is needed to prepare for peak events and rapidly handle potential traffic spikes

Ensure your organization has set up alerts to automatically notify of when you get close to quota and capacity limitations.

Use Google's monitoring tools to get insights on application usage and capacity. For example, you can define custom metrics with Monitoring. Use these custom metrics to define alerting trends. Monitoring also provides flexible dashboards and rich visualization tools to help identify emergent issues.

Create a process for capacity planning

Establish a process for capacity planning and document this plan.

As you create this plan do the following:

  1. Run load tests to determine how much load the system can handle while meeting its latency targets, given a fixed amount of resources. Load tests should use a mix of request types that matches production traffic profiles from live users. Don't use a uniform or random mix of operations. Include spikes in usage in your traffic profile.
  2. Create a capacity model. A capacity model is a set of formulas for calculating incremental resources needed per unit increase in service load, as determined from load testing.
  3. Forecast future traffic and account for growth. See the article Measure Future Load for a summary of how Google builds traffic forecasts.
  4. Apply the capacity model to the forecast to determine future resource needs.
  5. Estimate the cost of resources your organization needs. Then, get budget approval from your Finance organization. This step is essential because the business can choose to make cost versus risk tradeoffs across a range of products. Those tradeoffs can mean acquiring capacity that's lower or higher than the predicted need for a given product based on business priorities.
  6. Work with your cloud provider to get the correct amount of resources at the correct time with quotas and reservations. Involve infrastructure teams for capacity planning and have operations create capacity plans with confidence intervals.
  7. Repeat the previous steps every quarter or two.

For more detailed guidance on the process of planning capacity while also optimizing resource usage, see Capacity Planning.

Ensure your quotas match your capacity requirements

Google Cloud uses quotas to restrict how much of a particular shared Google Cloud resource that you can use. Each quota represents a specific countable resource, such as API calls to a particular service, the number of load balancers used concurrently by your project, or the number of projects that you can create. For example, quotas ensure that a few customers or projects can't monopolize CPU cores in a particular region or zone.

As you review your quota, consider these details:

  • Plan the capacity requirements of your projects in advance to prevent unexpected limiting of your resource consumption.
  • Set up your quota and capacity to handle full region failure.
  • Use quotas to cap the consumption of a particular resource. For example, you can set a maximum query usage per day quota over the BigQuery API to ensure that a project doesn't overspend on BigQuery.
  • Plan for spikes in usage and include these spikes as part of your quota planning. Spikes in usage can be expected fluctuations throughout the day, unexpected peak traffic events, or known peak traffic and launch events. For details about how to plan for peak traffic and launch events, read the next section in Operational Excellence: Plan for peak traffic and launch events.

If your current quotas aren't sufficient, you can manage your quota using the Google Cloud console. If you require a large capacity, contact your Google Cloud sales team. However, you should know that many services also have limits that are unrelated to the quota system, see Working with quotas for more information.

Regularly review your quotas. Submit quota requests before they're needed. Read Working with quotas for details about how quota requests are evaluated and how requests are approved or denied.

There are several ways to view and manage your Google Cloud quota:

What's next

Plan for peak traffic and launch events

This document in the Google Cloud Architecture Framework shows you how to plan for peak traffic and launch events to avoid disrupting your business.

Peak events are major business-related events that cause a sharp increase in traffic beyond the application's standard baseline. These peak events require planned scaling.

For example, retail businesses with an online presence can expect peak events during holidays. Black Friday, which occurs the day after Thanksgiving in the United States, is one of the busiest shopping days of the year. For the healthcare industry in the United States, the months of October and November can have peak events due to spikes in online traffic for benefits enrollment.

Launch events are any substantial roll outs or migrations of new capability in production. For example, a migration from on-premises to the cloud, or a launch of a new product service or feature.

If you are launching a new product, you should expect an increased load on your systems during the announcement and potentially after. These events can often cause load increases of 5–20 times (or greater) on frontend systems. That increased load increases the load on backend systems as well. Often, these frontend and backend loads are characterized by rapid scaling over a short time as the event opens for web traffic. Launch events involve a trailing decrease in traffic to normal levels. This decrease is usually slower than the scale to peak.

Peak and launch events includes three stages:

  • Planning and preparation for the launch or peak traffic event
  • Launching the event
  • Reviewing event performance and post event analysis

The practices described in this document can help each of these stages run smoothly.

Create a general playbook for launch and peak events

Build a general playbook with a long-term view of current and future peak events. Keep adding lessons learned to the document, so it can be a reference for future peak events.

Plan for your launch and for peak events

Plan ahead. Create business projections for upcoming launches and for expected (and unexpected) peak events. Preparing your system for scale spikes depends on understanding your business projections. The more you know about prior forecasts, the more accurate you can make your new business forecasts. Those new forecasts are critical inputs into projecting expected demand on your system.

Establishing program management and coordinated planning—across your organization and with third-party vendors—is also a key to success. Get these teams set up early so that your program management team can set timelines, secure budgets, and gather resources for additional infrastructure, testing support, and training.

It's important to set up clear communication channels. Communication is critical for all stages of a launch or a peak event. Discuss risks and areas of concern early and swarm issues before they become blockers. Create event planning documentation. Condense the most critical information about the peak event and distribute it. Doing so helps people absorb planning information and resolves basic questions. The document helps bring new people up to speed on peak-event planning.

Document your plan for each event. As you document your plan, ensure that you do the following:

  • Identify any assumptions, risks, and unknown factors.
  • Review past events to determine relevant information for the upcoming launch or peak event. Determine what data is available and what value that data has provided in the past.
  • Detail the rollback plan for launch and migration events.
  • Perform an architecture review:
    • Document key resources and architectural components.
    • Use the Architecture Framework to review all aspects of the environment for risks and scale concerns.
    • Create a diagram that shows how the major components of the architecture are connected. A review of the diagram might help you isolate issues and expedite their resolution.
  • If appropriate, configure the service to use alert actions to auto-restart if there's a failure. When using Compute Engine, consider using autoscaling for handling throughput spikes.
  • To make sure that Compute Engine resources are available when you need them, use Reservations. Reservations provide a very high level of assurance in obtaining capacity for Compute Engine zonal resources. You can use reservations to help ensure that your project has resources available.
  • Identify metrics and alerts to track:
    • Identify business and system metrics to monitor for the event. If any metrics or service level indicators (SLIs) aren't being collected, modify the system to collect this data.
    • Ensure you have sufficient monitoring and alerting capabilities and have reviewed normal and previous peak traffic patterns. Ensure alerts are set appropriately. Use Google Cloud Monitoring tools to view application use, capacity, and the overall health of your applications and infrastructure.
    • Ensure system metrics are being captured with monitoring and alert levels of interest.
  • Review your increased capacity requirements with your Google Cloud account team and plan for the required quota management. For more details, review Ensure your quotas match your capacity requirements.
  • Ensure you have appropriate cloud support levels, your team understands how to open support cases, and you have an escalation path established. For more details, review Establish cloud support and escalation processes.
  • Define a communication plan, timeline, and responsibilities:
    • Engage cross-functional stakeholders to coordinate communication and program planning. These stakeholders can include appropriate people from technical, operational, and leadership teams, and third-party vendors.
    • Establish an unambiguous timeline containing critical tasks and the teams that own them.
    • Establish a responsibility assignment matrix (RACI) to communicate ownership for teams, team leads, stakeholders, and responsible parties.
    • You can use Premium Support's Event Management Service for planned peak events. With this service, Customer Care partners with your team to create a plan and provide guidance throughout the event.

Establish review processes

When the peak traffic event or launch event is over, review the event to document the lessons you learned. Then, update your playbook with those lessons. Finally, apply what you learned to the next major event. Learning from prior events is important, especially when they highlight constraints to the system while under stress.

Retrospective reviews, also called postmortems, for peak traffic events or launch events are a useful technique for capturing data and understanding the incidents that occurred. Do this review for peak traffic and launch events that went as expected, and for any incidents that caused problems. As you do this review foster a blameless culture.

For more information about postmortems, see Postmortem Culture: Learning from Failure.

What's next

Create a culture of automation

This document in the Google Cloud Architecture Framework shows you how to assess toil and mitigate its impacts on your systems and your teams.

Toil is manual and repetitive work with no enduring value, and it increases as a service grows. Continually aim to reduce or eliminate toil. Otherwise, operational work can eventually overwhelm operators, and any growth in product use or complexity can require additional staffing.

Automation is a key way to minimize toil. Automation also improves release velocity and helps minimize human-induced errors.

For more information, see Eliminating Toil.

Create an inventory and assess the cost of toil

Start by creating an inventory and assessing the cost of toil on the teams managing your systems. Make this a continuous process, followed by investing in customized automation to extend what's already provided by Google Cloud services and partners. You can often modify Google Cloud's own automation—for example, Compute Engine's autoscaler.

Prioritize eliminating toil

Automation is useful but isn't a solution to all operational problems. As a first step in addressing known toil, we recommend reviewing your inventory of existing toil and prioritize eliminating as much toil as you can. Then, you can focus on automation.

Automate necessary toil

Some toil in your systems cannot be eliminated. As a second step in addressing known toil, automate this toil using the solutions that Google Cloud provides through configurable automation.

The following are some areas where configurable automation or customized automation can assist your organization in eliminating toil:

  • Identity management—for example, Cloud Identity and Identity and Access Management.
  • Google Cloud hosted solutions, as opposed to self-designed solutions—for example, cluster management (Google Kubernetes Engine (GKE)), relational database management (Cloud SQL), data warehouse management (BigQuery), and API management (Apigee).
  • Google Cloud services and tenant provisioning—for example Terraform and Cloud Foundation Toolkit.
  • Automated workflow orchestration for multi-step operations—for example, Cloud Composer.
  • Additional capacity provisioning—for example, several Google Cloud products, like Compute Engine and GKE, offer configurable autoscaling. Evaluate the Google Cloud services you are using to determine if they include configurable autoscaling.
  • CI/CD pipelines with automated deployment—for example, Cloud Build.
  • Canary analysis to validate deployments.
  • Automated model training (for machine learning)—for example, AutoML.

If a Google Cloud product or service only partially satisfies your technical needs when automating or eliminating manual workflows, consider filing a feature request through your Google Cloud account representative. Your issue might be a priority for other customers or already a part of our roadmap. If so, knowing the feature's priority and timeline helps you to better assess the trade-offs of building your own solution versus waiting to use a Google Cloud feature.

Build or buy solutions for high-cost toil

The third step, which can be completed in parallel with the first and second steps, entails evaluating building or buying other solutions if your toil cost stays high—for example, if toil takes a significant amount of time for any team managing your production systems.

When building or buying solutions, consider integration, security, privacy, and compliance costs. Designing and implementing your own automation comes with maintenance costs and risks to reliability beyond its initial development and setup costs, so consider this option as a last resort.

What's next

Explore other categories in the Architecture Framework such as system design, security, privacy, and compliance, reliability, cost optimization, and performance optimization.

Google Cloud Architecture Framework: Security, privacy, and compliance

This category in the Google Cloud Architecture Framework shows you how to architect and operate secure services on Google Cloud. You also learn about Google Cloud products and features that support security and compliance.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework helps you design your Google Cloud deployment so that it matches your business needs.

Moving your workloads into Google Cloud requires an evaluation of your business requirements, risks, compliance obligations, and security controls. This document helps you consider key best practices related to designing a secure solution in Google Cloud.

Google core principles include defense in depth, at scale, and by default. In Google Cloud, data and systems are protected through multiple layered defenses using policies and controls that are configured across IAM, encryption, networking, detection, logging, and monitoring.

Google Cloud comes with many security controls that you can build on, such as the following:

  • Secure options for data in transit, and default encryption for data at rest.
  • Built-in security features for Google Cloud products and services.
  • A global infrastructure that's designed for geo-redundancy, with security controls throughout the information processing lifecycle.
  • Automation capabilities that use infrastructure as code (IaC) and configuration guardrails.

For more information about the security posture of Google Cloud, see the Google security paper and the Google Infrastructure Security Design Overview. For an example secure-by-default environment, see the Google Cloud enterprise foundations blueprint.

In the security category of the Architecture Framework, you learn to do the following:

Shared responsibilities and shared fate on Google Cloud

This document describes the differences between the shared responsibility model and shared fate in Google Cloud. It discusses the challenges and nuances of the shared responsibility model. This document describes what shared fate is and how we partner with our customers to address cloud security challenges.

Understanding the shared responsibility model is important when determining how to best protect your data and workloads on Google Cloud. The shared responsibility model describes the tasks that you have when it comes to security in the cloud and how these tasks are different for cloud providers.

Understanding shared responsibility, however, can be challenging. The model requires an in-depth understanding of each service you utilize, the configuration options that each service provides, and what Google Cloud does to secure the service. Every service has a different configuration profile, and it can be difficult to determine the best security configuration. Google believes that the shared responsibility model stops short of helping cloud customers achieve better security outcomes. Instead of shared responsibility, we believe in shared fate.

Shared fate includes us building and operating a trusted cloud platform for your workloads. We provide best practice guidance and secured, attested infrastructure code that you can use to deploy your workloads in a secure way. We release solutions that combine various Google Cloud services to solve complex security problems and we offer innovative insurance options to help you measure and mitigate the risks that you must accept. Shared fate involves us more closely interacting with you as you secure your resources on Google Cloud.

Shared responsibility

You're the expert in knowing the security and regulatory requirements for your business, and knowing the requirements for protecting your confidential data and resources. When you run your workloads on Google Cloud, you must identify the security controls that you need to configure in Google Cloud to help protect your confidential data and each workload. To decide which security controls to implement, you must consider the following factors:

  • Your regulatory compliance obligations
  • Your organization's security standards and risk management plan
  • Security requirements of your customers and your vendors

Defined by workloads

Traditionally, responsibilities are defined based on the type of workload that you're running and the cloud services that you require. Cloud services include the following categories:

Cloud service Description
Infrastructure as a service (IaaS) IaaS services include Compute Engine, Cloud Storage, and networking services such as Cloud VPN, Cloud Load Balancing, and Cloud DNS.

IaaS provides compute, storage, and network services on demand with pay-as-you-go pricing. You can use IaaS if you plan on migrating an existing on-premises workload to the cloud using lift-and-shift, or if you want to run your application on particular VMs, using specific databases or network configurations.

In IaaS, the bulk of the security responsibilities are yours, and our responsibilities are focused on the underlying infrastructure and physical security.

Platform as a service (PaaS) PaaS services include App Engine, Google Kubernetes Engine (GKE), and BigQuery.

PaaS provides the runtime environment that you can develop and run your applications in. You can use PaaS if you're building an application (such as a website), and want to focus on development not on the underlying infrastructure.

In PaaS, we're responsible for more controls than in IaaS, including network controls. You share responsibility with us for application-level controls and IAM management. You remain responsible for your data security and client protection.

Software as a service (SaaS) SaaS applications include Google Workspace, Chronicle, and third-party SaaS applications that are available in Google Cloud Marketplace.

SaaS provides online applications that you can subscribe to or pay for in some way. You can use SaaS applications when your enterprise doesn't have the internal expertise or business requirement to build the application themselves, but does require the ability to process workloads.

In SaaS, we own the bulk of the security responsibilities. You remain responsible for your access controls and the data that you choose to store in the application.

Function as a service (FaaS) or serverless

FaaS provides the platform for developers to run small, single-purpose code (called functions) that run in response to particular events. You would use FaaS when you want particular things to occur based on a particular event. For example, you might create a function that runs whenever data is uploaded to Cloud Storage so that it can be classified.

FaaS has a similar shared responsibility list as SaaS. Cloud Functions is a FaaS application.

The following diagram shows the cloud services and defines how responsibilities are shared between the cloud provider and customer.

Shared security responsibilities

As the diagram shows, the cloud provider always remains responsible for the underlying network and infrastructure, and customers always remain responsible for their access policies and data.

Defined by industry and regulatory framework

Various industries have regulatory frameworks that define the security controls that must be in place. When you move your workloads to the cloud, you must understand the following:

  • Which security controls are your responsibility
  • Which security controls are available as part of the cloud offering
  • Which default security controls are inherited

Inherited security controls (such as our default encryption and infrastructure controls) are controls that you can provide as part of your evidence of your security posture to auditors and regulators. For example, the Payment Card Industry Data Security Standard (PCI DSS) defines regulations for payment processors. When you move your business to the cloud, these regulations are shared between you and your CSP. To understand how PCI DSS responsibilities are shared between you and Google Cloud, see Google Cloud: PCI DSS Shared Responsibility Matrix.

As another example, in the United States, the Health Insurance Portability and Accountability Act (HIPAA) has set standards for handling electronic personal health information (PHI). These responsibilities are also shared between the CSP and you. For more information on how Google Cloud meets our responsibilities under HIPAA, see HIPAA - Compliance.

Other industries (for example, finance or manufacturing) also have regulations that define how data can be gathered, processed, and stored. For more information about shared responsibility related to these, and how Google Cloud meets our responsibilities, see Compliance resource center.

Defined by location

Depending on your business scenario, you might need to consider your responsibilities based on the location of your business offices, your customers, and your data. Different countries and regions have created regulations that inform how you can process and store your customer's data. For example, if your business has customers who reside in the European Union, your business might need to abide by the requirements that are described in the General Data Protection Regulation (GDPR), and you might be obligated to keep your customer data in the EU itself. In this circumstance, you are responsible for ensuring that the data that you collect remains in the Google Cloud regions in the EU. For more information about how we meet our GDPR obligations, see GDPR and Google Cloud.

For information about the requirements related to your region, see Compliance offerings. If your scenario is particularly complicated, we recommend speaking with our sales team or one of our partners to help you evaluate your security responsibilities.

Challenges for shared responsibility

Though shared responsibility helps define the security roles that you or the cloud provider has, relying on shared responsibility can still create challenges. Consider the following scenarios:

  • Most cloud security breaches are the direct result of misconfiguration (listed as number 3 in the Cloud Security Alliance's Pandemic 11 Report) and this trend is expected to increase. Cloud products are constantly changing, and new ones are constantly being launched. Keeping up with constant change can seem overwhelming. Customers need cloud providers to provide them with opinionated best practices to help keep up with the change, starting with best practices by default and having a baseline secure configuration.
  • Though dividing items by cloud services is helpful, many enterprises have workloads that require multiple cloud services types. In this circumstance, you must consider how various security controls for these services interact, including whether they overlap between and across services. For example, you might have an on-premises application that you're migrating to Compute Engine, use Google Workspace for corporate email, and also run BigQuery to analyze data to improve your products.
  • Your business and markets are constantly changing; as regulations change, as you enter new markets, or as you acquire other companies. Your new markets might have different requirements, and your new acquisition might host their workloads on another cloud. To manage the constant changes, you must constantly re-assess your risk profile and be able to implement new controls quickly.
  • How and where to manage your data encryption keys is an important decision that ties with your responsibilities to protect your data. The option that you choose depends on your regulatory requirements, whether you're running a hybrid cloud environment or still have an on-premises environment, and the sensitivity of the data that you're processing and storing.
  • Incident management is an important, and often overlooked, area where your responsibilities and the cloud provider responsibilities aren't easily defined. Many incidents require close collaboration and support from the cloud provider to help investigate and mitigate them. Other incidents can result from poorly configured cloud resources or stolen credentials, and ensuring that you meet the best practices for securing your resources and accounts can be quite challenging.
  • Advanced persistent threats (APTs) and new vulnerabilities can impact your workloads in ways that you might not consider when you start your cloud transformation. Ensuring that you remain up-to-date on the changing landscape, and who is responsible for threat mitigation is difficult, particularly if your business doesn't have a large security team.

Shared fate

We developed shared fate in Google Cloud to start addressing the challenges that the shared responsibility model doesn't address. Shared fate focuses on how all parties can better interact to continuously improve security. Shared fate builds on the shared responsibility model because it views the relationship between cloud provider and customer as an ongoing partnership to improve security.

Shared fate is about us taking responsibility for making Google Cloud more secure. Shared fate includes helping you get started with a secured landing zone and being clear, opinionated, and transparent about recommended security controls, settings, and associated best practices. It includes helping you better quantify and manage your risk with cyber-insurance, using our Risk Protection Program. Using shared fate, we want to evolve from the standard shared responsibility framework to a better model that helps you secure your business and build trust in Google Cloud.

The following sections describe various components of shared fate.

Help getting started

A key component of shared fate is the resources that we provide to help you get started, in a secure configuration in Google Cloud. Starting with a secure configuration helps reduce the issue of misconfigurations which is the root cause of most security breaches.

Our resources include the following:

  • Enterprise foundations blueprint that discuss top security concerns and our top recommendations.
  • Secure blueprints that let you deploy and maintain secure solutions using infrastructure as code (IaC). Blueprints have our security recommendations enabled by default. Many blueprints are created by Google security teams and managed as products. This support means that they're updated regularly, go through a rigorous testing process, and receive attestations from third-party testing groups. Blueprints include the enterprise foundations blueprint, the secured data warehouse blueprint, and the Vertex AI Workbench notebooks blueprint.

  • Architecture Framework best practices that address the top recommendations for building security into your designs. The Architecture Framework includes a security section and a community zone that you can use to connect with experts and peers.

  • Landing zone navigation guides that step you through the top decisions that you need to make to build a secure foundation for your workloads, including resource hierarchy, identity onboarding, security and key management, and network structure.

Risk Protection Program

Shared fate also includes the Risk Protection Program (currently in preview), which helps you use the power of Google Cloud as a platform to manage risk, rather than just seeing cloud workloads as another source of risk that you need to manage. The Risk Protection Program is a collaboration between Google Cloud and two leading cyber insurance companies, Munich Re and Allianz Global & Corporate Speciality.

The Risk Protection Program includes Risk Manager, which provides data-driven insights that you can use to better understand your cloud security posture. If you're looking for cyber insurance coverage, you can share these insights from Risk Manager directly with our insurance partners to obtain a quote. For more information, see Google Cloud Risk Protection Program now in Preview.

Help with deployment and governance

Shared fate also helps with your continued governance of your environment. For example, we focus efforts on products such as the following:

Putting shared responsibility and shared fate into practice

As part of your planning process, consider the following actions to help you understand and implement appropriate security controls:

  • Create a list of the type of workloads that you will host in Google Cloud, and whether they require IaaS, PaaS, and SaaS services. You can use the shared responsibility diagram as a checklist to ensure that you know the security controls that you need to consider.
  • Create a list of regulatory requirements that you must comply with, and access resources in the Compliance resource center that relate to those requirements.
  • Review the list of available blueprints and architectures in the Architecture Center for the security controls that you require for your particular workloads. The blueprints provide a list of recommended controls and the IaC code that you require to deploy that architecture.
  • Use the landing zone documentation and the recommendations in the enterprise foundations guide to design a resource hierarchy and network architecture that meets your requirements. You can use the opinionated workload blueprints, like the secured data warehouse, to accelerate your development process.
  • After you deploy your workloads, verify that you're meeting your security responsibilities using services such as the Risk Manager, Assured Workloads, Policy Intelligence tools, and Security Command Center Premium.

For more information, see the CISO's Guide to Cloud Transformation paper.

What's next

Security principles

This document in the Google Cloud Architecture Framework explains core principles for running secure and compliant services on Google Cloud. Many of the security principles that you're familiar with in your on-premises environment apply to cloud environments.

Build a layered security approach

Implement security at each level in your application and infrastructure by applying a defense-in-depth approach. Use the features in each product to limit access and configure encryption where appropriate.

Design for secured decoupled systems

Simplify system design to accommodate flexibility where possible, and document security requirements for each component. Incorporate a robust secured mechanism to account for resiliency and recovery.

Automate deployment of sensitive tasks

Take humans out of the workstream by automating deployment and other admin tasks.

Automate security monitoring

Use automated tools to monitor your application and infrastructure. To scan your infrastructure for vulnerabilities and detect security incidents, use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines.

Meet the compliance requirements for your regions

Be mindful that you might need to obfuscate or redact personally identifiable information (PII) to meet your regulatory requirements. Where possible, automate your compliance efforts. For example, use Sensitive Data Protection and Dataflow to automate the PII redaction job before new data is stored in the system.

Comply with data residency and sovereignty requirements

You might have internal (or external) requirements that require you to control the locations of data storage and processing. These requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and culture. Data residency describes where your data is stored. To help comply with data residency requirements, Google Cloud lets you control where data is stored, how data is accessed, and how it's processed.

Shift security left

DevOps and deployment automation let your organization increase the velocity of delivering products. To help ensure that your products remain secure, incorporate security processes from the start of the development process. For example, you can do the following:

  • Test for security issues in code early in the deployment pipeline.
  • Scan container images and the cloud infrastructure on an ongoing basis.
  • Automate detection of misconfiguration and security anti-patterns. For example, use automation to look for secrets that are hard-coded in applications or in configuration.

What's next

Learn more about core security principles with the following resources:

Manage risk with controls

This document in the Google Cloud Architecture Framework describes best practices for managing risks in a cloud deployment. Performing a careful analysis of the risks that apply to your organization allows you to determine the security controls that you require. You should complete risk analysis before you deploy workloads on Google Cloud, and regularly afterwards as your business needs, regulatory requirements, and the threats relevant to your organization change.

Identify risks to your organization

Before you create and deploy resources on Google Cloud, complete a risk assessment to determine what security features you need in order to meet your internal security requirements and external regulatory requirements. Your risk assessment provides you with a catalog of risks that are relevant to you, and tells you how capable your organization is in detecting and counteracting security threats.

Your risks in a cloud environment differ from your risks in an on-premises environment due to the shared responsibility arrangement that you enter with your cloud provider. For example, in an on-premises environment you need to mitigate vulnerabilities to the hardware stack. In contrast, in a cloud environment these risks are borne by the cloud provider.

In addition, your risks differ depending on how you plan on using Google Cloud. Are you transferring some of your workloads to Google Cloud, or all of them? Are you using Google Cloud only for disaster recovery purposes? Are you setting up a hybrid cloud environment?

We recommend that you use an industry-standard risk assessment framework that applies to cloud environments and to your regulatory requirements. For example, the Cloud Security Alliance (CSA) provides the Cloud Controls Matrix (CCM). In addition, there are threat models such as OWASP application threat modeling that provide you with a list of potential gaps, and that suggest actions to remediate any gaps that are found. You can check our partner directory for a list of experts in conducting risk assessments for Google Cloud.

To help catalog your risks, consider Risk Manager, which is part of the Risk Protection Program. (This program is currently in preview.) Risk Manager scans your workloads to help you understand your business risks. Its detailed reports provide you with a security baseline. In addition, you can use Risk Manager reports to compare your risks against the risks outlined in the Center for Internet Security (CIS) Benchmark.

After you catalog your risks, you must determine how to address them—that is, whether you want to accept, avoid, transfer, or mitigate them. The following section describes mitigation controls.

Mitigate your risks

You can mitigate risks using technical controls, contractual protections, and third-party verifications or attestations. The following table lists how you can use these mitigations when you adopt new public cloud services.

MitigationDescription
Technical controls Technical controls refer to the features and technologies that you use to protect your environment. These include built-in cloud security controls, such as firewalls and logging. Technical controls can also include using third-party tools to reinforce or support your security strategy.

There are two categories of technical controls:
  • Google Cloud includes various security controls to let you mitigate the risks that apply to you. For example, if you have an on-premises environment, you can use Cloud VPN and Cloud Interconnect to secure the connection between your on-premises and your cloud resources.
  • Google has robust internal controls and auditing to protect against insider access to customer data. Our audit logs provide our customers with near real-time logs of Google administrator access on Google Cloud.
Contractual protections Contractual protections refer to the legal commitments made by us regarding Google Cloud services.

Google is committed to maintaining and expanding our compliance portfolio. The Cloud Data Processing Addendum (CDPA) document defines our commitment to maintaining our ISO 27001, 27017, and 27018 certifications and to updating our SOC 2 and SOC 3 reports every 12 months.

The DPST document also outlines the access controls that are in place to limit access by Google support engineers to customers' environments, and it describes our rigorous logging and approval process.

We recommend that you review Google Cloud's contractual controls with your legal and regulatory experts and verify that they meet your requirements. If you need more information, contact your technical account representative.
Third-party verifications or attestations Third-party verifications or attestations refers to having a third-party vendor audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance.

You can see the current Google Cloud certifications and letters of attestation at the Compliance Resource Center.

What's next

Learn more about risk management with the following resources:

Manage your assets

This document in the Google Cloud Architecture Framework provides best practices for managing assets.

Asset management is an important part of your business requirements analysis. You must know what assets you have, and you must have a good understanding of all your assets, their value, and any critical paths or processes related to them. You must have an accurate asset inventory before you can design any sort of security controls to protect your assets.

To manage security incidents and meet your organization's regulatory requirements, you need an accurate and up-to-date asset inventory that includes a way to analyze historical data. You must be able to track your assets, including how their risk exposure might change over time.

Moving to Google Cloud means that you need to modify your asset management processes to adapt to a cloud environment. For example, one of the benefits of moving to the cloud is that you increase your organization's ability to scale quickly. However, the ability to scale quickly can cause shadow IT issues, in which your employees create cloud resources that aren't properly managed and secured. Therefore, your asset management processes must provide sufficient flexibility for employees to get their work done while also providing for appropriate security controls.

Use cloud asset management tools

Google Cloud asset management tools are tailored specifically to our environment and to top customer use cases.

One of these tools is Cloud Asset Inventory, which provides you with both real-time information on the current state of your resources and with a five-week history. By using this service, you can get an organization-wide snapshot of your inventory for a wide variety of Google Cloud resources and policies. Automation tools can then use the snapshot for monitoring or for policy enforcement, or the tools can archive the snapshot for compliance auditing. If you want to analyze changes to the assets, asset inventory also lets you export metadata history.

For more information about Cloud Asset Inventory, see Custom solution to respond to asset changes and Detective controls.

Automate asset management

Automation lets you quickly create and manage assets based on the security requirements that you specify. You can automate aspects of the asset lifecycle in the following ways:

  • Deploy your cloud infrastructure using automation tools such as Terraform. Google Cloud provides the enterprise foundations blueprint, which helps you set up infrastructure resources that meet security best practices. In addition, it configures asset changes and policy compliance notifications in Cloud Asset Inventory.
  • Deploy your applications using automation tools such as Cloud Run and the Artifact Registry.

Monitor for deviations from your compliance policies

Deviations from policies can occur during all phases of the asset lifecycle. For example, assets might be created without the proper security controls, or their privileges might be escalated. Similarly, assets might be abandoned without the appropriate end-of-life procedures being followed.

To help avoid these scenarios, we recommend that you monitor assets for deviation from compliance. Which set of assets that you monitor depends on the results of your risk assessment and business requirements. For more information about monitoring assets, see Monitoring asset changes.

Integrate with your existing asset management monitoring systems

If you already use a SIEM system or other monitoring system, integrate your Google Cloud assets with that system. Integration ensures that your organization has a single, comprehensive view into all resources, regardless of environment. For more information, see Export Google Cloud security data to your SIEM system and Scenarios for exporting Cloud Logging data: Splunk.

Use data analysis to enrich your monitoring

You can export your inventory to a BigQuery table or Cloud Storage bucket for additional analysis.

What's next

Learn more about managing your assets with the following resources:

Manage identity and access

This document in the Google Cloud Architecture Framework provides best practices for managing identity and access.

The practice of identity and access management (generally referred to as IAM) helps you ensure that the right people can access the right resources. IAM addresses the following aspects of authentication and authorization:

  • Account management, including provisioning
  • Identity governance
  • Authentication
  • Access control (authorization)
  • Identity federation

Managing IAM can be challenging when you have different environments or you use multiple identity providers. However, it's critical that you set up a system that can meet your business requirements while mitigating risks.

The recommendations in this document help you review your current IAM policies and procedures and determine which of those you might need to modify for your workloads in Google Cloud. For example, you must review the following:

  • Whether you can use existing groups to manage access or whether you need to create new ones.
  • Your authentication requirements (such as multi-factor authentication (MFA) using a token).
  • The impact of service accounts on your current policies.
  • If you're using Google Cloud for disaster recovery, maintaining appropriate separation of duties.

Within Google Cloud, you use Cloud Identity to authenticate your users and resources and Google's Identity and Access Management (IAM) product to dictate resource access. Administrators can restrict access at the organization, folder, project, and resource level. Google IAM policies dictate who can do what on which resources. Correctly configured IAM policies help secure your environment by preventing unauthorized access to resources.

For more information, see Overview of identity and access management.

Use a single identity provider

Many of our customers have user accounts that are managed and provisioned by identity providers outside of Google Cloud. Google Cloud supports federation with most identity providers and with on-premises directories such as Active Directory.

Most identity providers let you enable single sign-on (SSO) for your users and groups. For applications that you deploy on Google Cloud and that use your external identity provider, you can extend your identity provider to Google Cloud. For more information, see Reference architectures and Patterns for authentication corporate users in a hybrid environment.

If you don't have an existing identity provider, you can use either Cloud Identity Premium or Google Workspace to manage identities for your employees.

Protect the super admin account

The super admin account (managed by Google Workspace or Cloud Identity) lets you create your Google Cloud organization. This admin account is therefore highly privileged. Best practices for this account include the following:

  • Create a new account for this purpose; don't use an existing user account.
  • Create and protect backup accounts.
  • Enable MFA.

For more information, see Super administrator account best practices.

Plan your use of service accounts

A service account is a Google account that applications can use to call the Google API of a service.

Unlike your user accounts, service accounts are created and managed within Google Cloud. Service accounts also authenticate differently than user accounts:

  • To let an application running on Google Cloud authenticate using a service account, you can attach a service account to the compute resource the application runs on.
  • To let an application running on GKE authenticate using a service account, you can use Workload Identity.
  • To let applications running outside of Google Cloud authenticate using a service account, you can use Workload identity federation

When you use service accounts, you must consider an appropriate segregation of duties during your design process. Note the API calls that you must make, and determine the service accounts and associated roles that the API calls require. For example, if you're setting up a BigQuery data warehouse, you probably need identities for at least the following processes and services:

  • Cloud Storage or Pub/Sub, depending on whether you're providing a batch file or creating a streaming service.
  • Dataflow and Sensitive Data Protection to de-identify sensitive data.

For more information, see Best practices for working with service accounts.

Update your identity processes for the cloud

Identity governance lets you track access, risks, and policy violations so that you can support your regulatory requirements. This governance requires that you have processes and policies in place so that you can grant and audit access control roles and permissions to users. Your processes and policies must reflect the requirements of your environments—for example, test, development, and production.

Before you deploy workloads on Google Cloud, review your current identity processes and update them if appropriate. Ensure that you appropriately plan for the types of accounts that your organization needs and that you have a good understanding of their role and access requirements.

To help you audit Google IAM activities, Google Cloud creates audit logs, which include the following:

  • Administrator activity. This logging can't be disabled.
  • Data access activity. You must enable this logging.

If necessary for compliance purposes, or if you want to set up log analysis (for example, with your SIEM system), you can export the logs. Because logs can increase your storage requirements, they might affect your costs. Ensure that you log only the actions that you require, and set appropriate retention schedules.

Set up SSO and MFA

Your identity provider manages user account authentication. Federated identities can authenticate to Google Cloud using SSO. For privileged accounts, such as super admins, you should configure MFA. Titan Security Keys are physical tokens that you can use for two-factor authentication (2FA) to help prevent phishing attacks.

Cloud Identity supports MFA using various methods. For more information, see Enforce uniform MFA to company-owned resources.

Google Cloud supports authentication for workload identities using the OAuth 2.0 protocol or signed JSON Web Tokens (JWT). For more information about workload authentication, see Authentication overview.

Implement least privilege and separation of duties

You must ensure that the right individuals get access only to the resources and services that they need in order to perform their jobs. That is, you should follow the principle of least privilege. In addition, you must ensure there is an appropriate separation of duties.

Overprovisioning user access can increase the risk of insider threat, misconfigured resources, and non-compliance with audits. Underprovisioning permissions can prevent users from being able to access the resources they need in order to complete their tasks.

One way to avoid overprovisioning is to implement just-in-time privileged access — that is, to provide privileged access only as needed, and to only grant it temporarily.

Be aware that when a Google Cloud organization is created, all users in your domain are granted the Billing Account Creator and Project Creator roles by default. Identify the users who will perform these duties, and revoke these roles from other users. For more information, see Creating and managing organizations.

For more information about how roles and permissions work in Google Cloud, see Overview and Understanding roles in the IAM documentation. For more information about enforcing least privilege, see Enforce least privilege with role recommendations.

Audit access

To monitor the activities of privileged accounts for deviations from approved conditions, use Cloud Audit Logs. Cloud Audit Logs records the actions that members in your Google Cloud organization have taken in your Google Cloud resources. You can work with various audit log types across Google services. For more information, see Using Cloud Audit Logs to Help Manage Insider Risk (video).

Use IAM recommender to track usage and to adjust permissions where appropriate. The roles that are recommended by IAM recommender can help you determine which roles to grant to a user based on the user's past behavior and on other criteria. For more information, see Best practices for role recommendations.

To audit and control access to your resources by Google support and engineering personnel, you can use Access Transparency. Access Transparency records the actions taken by Google personnel. Use Access Approval, which is part of Access Transparency, to grant explicit approval every time customer content is accessed. For more information, see Control cloud administrators' access to your data.

Automate your policy controls

Set access permissions programmatically whenever possible. For best practices, see Organization policy constraints. The Terraform scripts for the enterprise foundations blueprint are in the example foundation repository.

Google Cloud includes Policy Intelligence, which lets you automatically review and update your access permissions. Policy Intelligence includes the Recommender, Policy Troubleshooter, and Policy Analyzer tools, which do the following:

  • Provide recommendations for IAM role assignment.
  • Monitor and help prevent overly permissive IAM policies.
  • Assist with troubleshooting access-control-related issues.

Set restrictions on resources

Google IAM focuses on who, and it lets you authorize who can act on specific resources based on permissions. The Organization Policy Service focuses on what, and it lets you set restrictions on resources to specify how they can be configured. For example, you can use an organization policy to do the following:

In addition to using organizational policies for these tasks, you can restrict access to resources using one of the following methods:

  • Use tags to manage access to your resources without defining the access permissions on each resource. Instead, you add the tag and then set the access definition for the tag itself.
  • Use IAM Conditions for conditional, attribute-based control of access to resources.
  • Implement defense-in-depth using VPC Service Controls to further restrict access to resources.

For more about resource management, see Resource management.

What's next

Learn more about IAM with the following resources:

Implement compute and container security

Google Cloud includes controls to protect your compute resources and Google Kubernetes Engine (GKE) container resources. This document in the Google Cloud Architecture Framework describes key controls and best practices for using them.

Use hardened and curated VM images

Google Cloud includes Shielded VM, which allows you to harden your VM instances. Shielded VM is designed to prevent malicious code from being loaded during the boot cycle. It provides boot security, monitors integrity, and uses the Virtual Trusted Platform Module (vTPM). Use Shielded VM for sensitive workloads.

In addition to using Shielded VM, you can use Google Cloud partner solutions to further protect your VMs. Many partner solutions offered on Google Cloud integrate with Security Command Center, which provides event threat detection and health monitoring. You can use partners for advanced threat analysis or extra runtime security.

Use Confidential Computing for processing sensitive data

By default, Google Cloud encrypts data at rest and in transit across the network, but data isn't encrypted while it's in use in memory. If your organization handles confidential data, you need to mitigate against threats that undermine the confidentiality and integrity of either the application or the data in system memory. Confidential data includes personally identifiable information (PII), financial data, and health information.

Confidential Computing builds on Shielded VM. It protects data in use by performing computation in a hardware-based trusted execution environment. This type of secure and isolated environment helps prevent unauthorized access or modification of applications and data while that data is in use. A trusted execution environment also increases the security assurances for organizations that manage sensitive and regulated data.

In Google Cloud, you can enable Confidential Computing by running Confidential VMs or Confidential GKE nodes. Turn on Confidential Computing when you're processing confidential workloads, or when you have confidential data (for example, secrets) that must be exposed while they are processed. For more information, see the Confidential Computing Consortium.

Protect VMs and containers

OS Login lets your employees connect to your VMs using Identity and Access Management (IAM) permissions as the source of truth instead of relying on SSH keys. You therefore don't have to manage SSH keys throughout your organization. OS Login ties an administrator's access to their employee lifecycle, which means that if employees move to another role or leave your organization, their access is revoked with their account. OS Login also supports two-factor authentication, which adds an extra layer of security from account takeover attacks.

In GKE, App Engine runs application instances within Docker containers. To enable a defined risk profile and to restrict employees from making changes to containers, ensure that your containers are stateless and immutable. The principle of immutability means that your employees do not modify the container or access it interactively. If it must be changed, you build a new image and redeploy. Enable SSH access to the underlying containers only in specific debugging scenarios.

Disable external IP addresses unless they're necessary

To disable external IP address allocation (video) for your production VMs and to prevent the use of external load balancers, you can use organization policies. If you require your VMs to reach the internet or your on-premises data center, you can enable a Cloud NAT gateway.

You can deploy private clusters in GKE. In a private cluster, nodes have only internal IP addresses, which means that nodes and Pods are isolated from the internet by default. You can also define a network policy to manage Pod-to-Pod communication in the cluster. For more information, see Private access options for services.

Monitor your compute instance and GKE usage

Cloud Audit Logs are automatically enabled for Compute Engine and GKE. Audit logs let you automatically capture all activities with your cluster and monitor for any suspicious activity.

You can integrate GKE with partner products for runtime security. You can integrate these solutions with the Security Command Center to provide you with a single interface for monitoring your applications.

Keep your images and clusters up to date

Google Cloud provides curated OS images that are patched regularly. You can bring custom images and run them on Compute Engine, but if you do, you have to patch them yourself. Google Cloud regularly updates OS images to mitigate new vulnerabilities as described in security bulletins and provides remediation to fix vulnerabilities for existing deployments.

If you're using GKE, we recommend that you enable node auto-upgrade to have Google update your cluster nodes with the latest patches. Google manages GKE control planes, which are automatically updated and patched. In addition, use Google-curated container-optimized images for your deployment. Google regularly patches and updates these images.

Control access to your images and clusters

It's important to know who can create and launch instances. You can control this access using IAM. For information about how to determine what access workloads need, see Plan your workload identities.

In addition, you can use VPC Service Controls to define custom quotas on projects so that you can limit who can launch images. For more information, see the Secure your network section.

To provide infrastructure security for your cluster, GKE lets you use IAM with role-based access control (RBAC) to manage access to your cluster and namespaces.

Isolate containers in a sandbox

Use GKE Sandbox to deploy multi-tenant applications that need an extra layer of security and isolation from their host kernel. For example, use GKE Sandbox when you are executing unknown or untrusted code. GKE Sandbox is a container isolation solution that provides a second layer of defense between containerized workloads on GKE.

GKE Sandbox was built for applications that have low I/O requirements but that are highly scaled. These containerized workloads need to maintain their speed and performance, but might also involve untrusted code that demands added security. Use gVisor, a container runtime sandbox, to provide additional security isolation between applications and the host kernel. gVisor provides additional integrity checks and limits the scope of access for a service. It's not a container hardening service to protect against external threats. For more inforamtion about gVisor, see gVisor: Protecting GKE and serverless users in the real world.

What's next

Learn more about compute and container security with the following resources:

Secure your network

This document in the Google Cloud Architecture Framework provides best practices for securing your network.

Extending your existing network to include cloud environments has many implications for security. Your on-premises approach to multi-layered defenses likely involves a distinct perimeter between the internet and your internal network. You probably protect the perimeter by using physical firewalls, routers, intrusion detection systems, and so on. Because the boundary is clearly defined, you can easily monitor for intrusions and respond accordingly.

When you move to the cloud (either completely or in a hybrid approach), you move beyond your on-premises perimeter. This document describes ways that you can continue to secure your organization's data and workloads on Google Cloud. As mentioned in Manage risks with controls, how you set up and secure your Google Cloud network depends on your business requirements and risk appetite.

This section assumes that you have read the Networking section in the System design category, and that you've already created a basic architecture diagram of your Google Cloud network components. For an example diagram, see Hub-and-spoke.

Deploy zero trust networks

Moving to the cloud means that your network trust model must change. Because your users and your workloads are no longer behind your on-premises perimeter, you can't use perimeter protections in the same way to create a trusted, inner network. The zero trust security model means that no one is trusted by default, whether they are inside or outside of your organization's network. When verifying access requests, the zero trust security model requires you to check both the user's identity and context. Unlike a VPN, you shift access controls from the network perimeter to the users and devices.

In Google Cloud, you can use BeyondCorp Enterprise as your zero trust solution. BeyondCorp Enterprise provides threat and data protection and additional access controls. For more information about how to set it up, see Getting started with BeyondCorp Enterprise.

In addition to BeyondCorp Enterprise, Google Cloud includes Identity-Aware Proxy (IAP). IAP lets you extend zero trust security to your applications both within Google Cloud and on-premises. IAP uses access control policies to provide authentication and authorization for users who access your applications and resources.

Secure connections to your on-premises or multi-cloud environments

Many organizations have workloads both in cloud environments and on-premises. In addition, for resiliency, some organizations use multi-cloud solutions. In these scenarios, it's critical to secure your connectivity between all of your environments.

Google Cloud includes private access methods for VMs that are supported by Cloud VPN or Cloud Interconnect, including the following:

For a comparison between the products, see Choosing a Network Connectivity product.

Disable default networks

When you create a new Google Cloud project, a default Google Cloud VPC network with auto mode IP addresses and pre-populated firewall rules is automatically provisioned. For production deployments, we recommend that you delete the default networks in existing projects, and disable the creation of default networks in new projects.

Virtual Private Cloud networks let you use any internal IP address. To avoid IP address conflicts, we recommend that you first plan your network and IP address allocation across your connected deployments and across your projects. A project allows multiple VPC networks, but it's usually a best practice to limit these networks to one per project in order to enforce access control effectively.

Secure your perimeter

In Google Cloud, you can use various methods to segment and secure your cloud perimeter, including firewalls and VPC Service Controls.

Use Shared VPC to build a production deployment that gives you a single shared network and that isolates workloads into individual projects that can be managed by different teams. Shared VPC provides centralized deployment, management, and control of the network and network security resources across multiple projects. Shared VPC consists of host and service projects that perform the following functions:

  • A host project contains the networking and network security-related resources, such as VPC networks, subnets, firewall rules, and hybrid connectivity.
  • A service project attaches to a host project. It lets you isolate workloads and users at the project level by using Identity and Access Management (IAM), while it shares the networking resources from the centrally managed host project.

Define firewall policies and rules at the organization, folder, and VPC network level. You can configure firewall rules to permit or deny traffic to or from VM instances. For examples, see Global and regional network firewall policy examples and Hierarchical firewall policy examples. In addition to defining rules based on IP addresses, protocols, and ports, you can manage traffic and apply firewall rules based on the service account that's used by a VM instance or by using secure tags.

To control the movement of data in Google services and to set up context-based perimeter security, consider VPC Service Controls. VPC Service Controls provides an extra layer of security for Google Cloud services that's independent of IAM and firewall rules and policies. For example, VPC Service Controls lets you set up perimeters between confidential and non-confidential data so that you can apply controls that help prevent data exfiltration.

Use Google Cloud Armor security policies to allow, deny, or redirect requests to your external Application Load Balancer at the Google Cloud edge, as close as possible to the source of incoming traffic. These policies prevent unwelcome traffic from consuming resources or entering your network.

Use Secure Web Proxy to apply granular access policies to your egress web traffic and to monitor access to untrusted web services.

Inspect your network traffic

You can use Cloud IDS and Packet Mirroring to help you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE).

Use Cloud Intrusion Detection System (currently in preview) to get visibility in to the traffic moving into and out of your VPC networks. Cloud IDS creates a Google-managed peered network that has mirrored VMs. Palo Alto Networks threat protection technologies mirror and inspect the traffic. For more information, see Cloud IDS overview.

Packet Mirroring clones traffic of specified VM instances in your VPC network and forwards it for collection, retention, and examination. After you configure Packet Mirroring, you can use Cloud IDS or third-party tools to collect and inspect network traffic at scale. Inspecting network traffic in this way helps provide intrusion detection and application performance monitoring.

Use a web application firewall

For external web applications and services, you can enable Google Cloud Armor to provide distributed denial-of-service (DDoS) protection and web application firewall (WAF) capabilities. Google Cloud Armor supports Google Cloud workloads that are exposed using external HTTP(S) load balancing, TCP Proxy load balancing, or SSL Proxy load balancing.

Google Cloud Armor is offered in two service tiers, Standard and Managed Protection Plus. To take full advantage of advanced Google Cloud Armor capabilities, you should invest in Managed Protection Plus for your key workloads.

Automate infrastructure provisioning

Automation lets you create immutable infrastructure, which means that it can't be changed after provisioning. This measure gives your operations team a known good state, fast rollback, and troubleshooting capabilities. For automation, you can use tools such as Terraform, Jenkins, and Cloud Build.

To help you build an environment that uses automation, Google Cloud provides a series of security blueprints that are in turn built on the enterprise foundations blueprint. The security foundations blueprint provides Google's opinionated design for a secure application environment and describes step by step how to configure and deploy your Google Cloud estate. Using the instructions and the scripts that are part of the security foundations blueprint, you can configure an environment that meets our security best practices and guidelines. You can build on that blueprint with additional blueprints or design your own automation.

For more information about automation, see Use a CI/CD pipeline for data-processing workflows.

Monitor your network

Monitor your network and your traffic using telemetry.

VPC Flow Logs and Firewall Rules Logging provide near real-time visibility into the traffic and firewall usage in your Google Cloud environment. For example, Firewall Rules Logging logs traffic to and from Compute Engine VM instances. When you combine these tools with Cloud Logging and Cloud Monitoring, you can track, alert, and visualize traffic and access patterns to improve the operational security of your deployment.

Firewall Insights lets you review which firewall rules matched incoming and outgoing connections and whether the connections were allowed or denied. The shadowed rules feature helps you tune your firewall configuration by showing you which rules are never triggered because another rule is always triggered first.

Use Network Intelligence Center to see how your network topology and architecture are performing. You can get detailed insights into network performance and you can then optimize your deployment to eliminate any bottlenecks in your service. Connectivity Tests provide you with insights into the firewall rules and policies that are applied to the network path.

For more information about monitoring, see Implement logging and detective controls.

What's next

Learn more about network security with the following resources:

Implement data security

This document in the Google Cloud Architecture Framework provides best practices for implementing data security.

As part of your deployment architecture, you must consider what data you plan to process and store in Google Cloud, and the sensitivity of the data. Design your controls to help secure the data during its lifecycle, to identify data ownership and classification, and to help protect data from unauthorized use.

For a security blueprint that deploys a BigQuery data warehouse with the security best practices described in this document, see Secure a BigQuery data warehouse that stores confidential data.

Automatically classify your data

Perform data classification as early in the data management lifecycle as possible, ideally when the data is created. Usually, data classification efforts require only a few categories, such as the following:

  • Public: Data that has been approved for public access.
  • Internal: Non-sensitive data that isn't released to the public.
  • Confidential: Sensitive data that's available for general internal distribution.
  • Restricted: Highly sensitive or regulated data that requires restricted distribution.

Use Sensitive Data Protection to discover and classify data across your Google Cloud environment. Sensitive Data Protection has built-in support for scanning and classifying sensitive data in Cloud Storage, BigQuery, and Datastore. It also has a streaming API to support additional data sources and custom workloads.

Sensitive Data Protection can identify sensitive data using built-in infotypes. It can automatically classify, mask, tokenize, and transform sensitive elements (such as PII data) to let you manage the risk of collecting, storing, and using data. In other words, it can integrate with your data lifecycle processes to ensure that data in every stage is protected.

For more information, see De-identification and re-identification of PII in large-scale datasets using Sensitive Data Protection.

Manage data governance using metadata

Data governance is a combination of processes that ensure that data is secure, private, accurate, available, and usable. Although you are responsible for defining a data governance strategy for your organization, Google Cloud provides tools and technologies to help you put your strategy into practice. Google Cloud also provides a framework for data governance (PDF) in the cloud.

Use Data Catalog to find, curate, and use metadata to describe your data assets in the cloud. You can use Data Catalog to search for data assets, then tag the assets with metadata. To help accelerate your data classification efforts, integrate Data Catalog with Sensitive Data Protection to automatically identify confidential data. After data is tagged, you can use Google Identity and Access Management (IAM) to restrict which data users can query or use through Data Catalog views.

Use Dataproc Metastore or Hive metastore to manage metadata for workloads. Data Catalog has a hive connector that allows the service to discover metadata that's inside a hive metastore.

Use Dataprep by Trifacta to define and enforce data quality rules through a console. You can use Dataprep from within Cloud Data Fusion or use Dataprep as a standalone service.

Protect data according to its lifecycle phase and classification

After you define data within the context of its lifecycle and classify it based on its sensitivity and risk, you can assign the right security controls to protect it. You must ensure that your controls deliver adequate protections, meet compliance requirements, and reduce risk. As you move to the cloud, review your current strategy and where you might need to change your current processes.

The following table describes three characteristics of a data security strategy in the cloud.

Characteristic Description
Identification Understand the identity of users, resources, and applications as they create, modify, store, use, share, and delete data.

Use Cloud Identity and IAM to control access to data. If your identities require certificates, consider Certificate Authority Service.

For more information, see Manage identity and access.
Boundary and access Set up controls for how data is accessed, by whom, and under what circumstances. Access boundaries to data can be managed at these levels:

Visibility You can audit usage and create reports that demonstrate how data is controlled and accessed. Google Cloud Logging and Access Transparency provide insights into the activities of your own cloud administrators and Google personnel. For more information, see Monitor your data.

Encrypt your data

By default, Google Cloud encrypts customer data stored at rest, with no action required from you. In addition to default encryption, Google Cloud provides options for envelope encryption and encryption key management. For example, Compute Engine persistent disks are automatically encrypted, but you can supply or manage your own keys.

You must identify the solutions that best fit your requirements for key generation, storage, and rotation, whether you're choosing the keys for your storage, for compute, or for big data workloads.

Google Cloud includes the following options for encryption and key management:

  • Customer-managed encryption keys (CMEK). You can generate and manage your encryption keys using Cloud Key Management Service (Cloud KMS). Use this option if you have certain key management requirements, such as the need to rotate encryption keys regularly.
  • Customer-supplied encryption keys (CSEK). You can create and manage your own encryption keys, and then provide them to Google Cloud when necessary. Use this option if you generate your own keys using your on-premises key management system to bring your own key (BYOK). If you provide your own keys using CSEK, Google replicates them and makes them available to your workloads. However, the security and availability of CSEK is your responsibility because customer-supplied keys aren't stored in instance templates or in Google infrastructure. If you lose access to the keys, Google can't help you recover the encrypted data. Think carefully about which keys you want to create and manage yourself. You might use CSEK for only the most sensitive information. Another option is to perform client-side encryption on your data and then store the encrypted data in Google Cloud, where the data is encrypted again by Google.
  • Third-party key management system with Cloud External Key Manager (Cloud EKM). Cloud EKM protects your data at rest by using encryption keys that are stored and managed in a third-party key management system that you control outside of the Google infrastructure. When you use this method, you have high assurance that your data can't be accessed by anyone outside of your organization. Cloud EKM lets you achieve a secure hold-your-own-key (HYOK) model for key management. For compatibility information, see the Cloud EKM enabled services list.

Cloud KMS also lets you encrypt your data with either software-backed encryption keys or FIPS 140-2 Level 3 validated hardware security modules (HSMs). If you're using Cloud KMS, your cryptographic keys are stored in the region where you deploy the resource. Cloud HSM distributes your key management needs across regions, providing redundancy and global availability of keys.

For information on how envelope encryption works, see Encryption at rest in Google Cloud.

Control cloud administrators' access to your data

You can control access by Google support and engineering personnel to your environment on Google Cloud. Access Approval lets you explicitly approve before Google employees access your data or resources on Google Cloud. This product complements the visibility provided by Access Transparency, which generates logs when Google personnel interact with your data. These logs include the office location and the reason for the access.

Using these products together, you can deny Google the ability to decrypt your data for any reason.

Configure where your data is stored and where users can access it from

You can control the network locations from which users can access data by using VPC Service Controls. This product lets you limit access to users in a specific region. You can enforce this constraint even if the user is authorized according to your Google IAM policy. Using VPC Service Controls, you create a service perimeter which defines the virtual boundaries from which a service can be accessed, which prevents data from being moved outside those boundaries.

For more information, see the following:

Manage secrets using Secret Manager

Secret Manager lets you store all of your secrets in a centralized place. Secrets are configuration information such as database passwords, API keys, or TLS certificates. You can automatically rotate secrets, and you can configure applications to automatically use the latest version of a secret. Every interaction with Secret Manager generates an audit log, so you view every access to every secret.

Sensitive Data Protection also has a category of detectors to help you identify credentials and secrets in data that could be protected with Secret Manager.

Monitor your data

To view administrator activity and key use logs, use Cloud Audit Logs. To help secure your data, monitor logs using Cloud Monitoring to ensure proper use of your keys.

Cloud Logging captures Google Cloud events and lets you add additional sources if necessary. You can segment your logs by region, store them in buckets, and integrate custom code for processing logs. For an example, see Custom solution for automated log analysis.

You can also export logs to BigQuery to perform security and access analytics to help identify unauthorized changes and inappropriate access to your organization's data.

Security Command Center can help you identify and resolve insecure-access problems to sensitive organizational data that's stored in the cloud. Through a single management interface, you can scan for a wide variety of security vulnerabilities and risks to your cloud infrastructure. For example, you can monitor for data exfiltration, scan storage systems for confidential data, and detect which Cloud Storage buckets are open to the internet.

What's next

Learn more about data security with the following resources:

Deploy applications securely

This document in the Google Cloud Architecture Framework provides best practices for deploying applications securely.

To deploy secure applications, you must have a well-defined software development lifecycle, with appropriate security checks during the design, development, testing, and deployment stages. When you design an application, we recommend a layered system architecture that uses standardized frameworks for identity, authorization, and access control.

Automate secure releases

Without automated tools, it can be hard to deploy, update, and patch complex application environments to meet consistent security requirements. Therefore, we recommend that you build a CI/CD pipeline for these tasks, which can solve many of these issues. Automated pipelines remove manual errors, provide standardized development feedback loops, and enable fast product iterations. For example, Cloud Build private pools let you deploy a highly secure, managed CI/CD pipeline for highly regulated industries, including finance and healthcare.

You can use automation to scan for security vulnerabilities when artifacts are created. You can also define policies for different environments (development, test, production, and so on) so that only verified artifacts are deployed.

Ensure that application deployments follow approved processes

If an attacker compromises your CI/CD pipeline, your entire stack can be affected. To help secure the pipeline, you should enforce an established approval process before you deploy the code into production.

If you plan to use Google Kubernetes Engine (GKE) or GKE Enterprise, you can establish these checks and balances by using Binary Authorization. Binary Authorization attaches configurable signatures to container images. These signatures (also called attestations) help to validate the image. At deployment, Binary Authorization uses these attestations to determine that a process was completed earlier. For example, you can use Binary Authorization to do the following:

  • Verify that a specific build system or continuous integration (CI) pipeline created a container image.
  • Validate that a container image is compliant with a vulnerability signing policy.
  • Verify that a container image passes criteria for promotion to the next deployment environment, such as from development to QA.

Scan for known vulnerabilities before deployment

We recommend that you use automated tools that can continuously perform vulnerability scans on container images before the containers are deployed to production.

Use Artifact Analysis to automatically scan for vulnerabilities for containers that are stored in Artifact Registry and Container Registry. This process includes two tasks: scanning and continuous analysis.

To start, Artifact Analysis scans new images when they're uploaded to Artifact Registry or Container Registry. The scan extracts information about the system packages in the container.

Artifact Analysis then looks for vulnerabilities when you upload the image. After the initial scan, Artifact Analysis continuously monitors the metadata of scanned images in Artifact Registry and Container Registry for new vulnerabilities. When Artifact Analysis receives new and updated vulnerability information from vulnerability sources, it does the following:

  • Updates the metadata of the scanned images to keep them up to date.
  • Creates new vulnerability occurrences for new notes.
  • Deletes vulnerability occurrences that are no longer valid.

Monitor your application code for known vulnerabilities

It's a best practice to use automated tools that can constantly monitor your application code for known vulnerabilities such as the OWASP Top 10. For a description of Google Cloud products and features that support OWASP Top 10 mitigation techniques, see OWASP Top 10 mitigation options on Google Cloud.

Use Web Security Scanner to help identify security vulnerabilities in your App Engine, Compute Engine, and Google Kubernetes Engine web applications. The scanner crawls your application, following all links within the scope of your starting URLs, and attempts to exercise as many user inputs and event handlers as possible. It can automatically scan for and detect common vulnerabilities, including cross-site scripting (XSS), Flash injection, mixed content (HTTP in HTTPS), and outdated or insecure libraries. Web Security Scanner gives you early identification of these types of vulnerabilities with low false positive rates.

Control movement of data across perimeters

To control the movement of data across a perimeter, you can configure security perimeters around the resources of your Google-managed services. Use VPC Service Controls to place all components and services in your CI/CD pipeline (for example, Container Registry, Artifact Registry, Artifact Analysis, and Binary Authorization) inside a security perimeter.

VPC Service Controls improves your ability to mitigate the risk of unauthorized copying or transfer of data (data exfiltration) from Google-managed services. With VPC Service Controls, you configure security perimeters around the resources of your Google-managed services to control the movement of data across the perimeter boundary. When a service perimeter is enforced, requests that violate the perimeter policy are denied, such as requests that are made to protected services from outside a perimeter. When a service is protected by an enforced perimeter, VPC Service Controls ensures the following:

  • A service can't transmit data out of the perimeter. Protected services function as normal inside the perimeter, but can't send resources and data out of the perimeter. This restriction helps prevent malicious insiders who might have access to projects in the perimeter from exfiltrating data.
  • Requests that come from outside the perimeter to the protected service are honored only if the requests meet the criteria of access levels that are assigned to the perimeter.
  • A service can be made accessible to projects in other perimeters using perimeter bridges.

Encrypt your container images

In Google Cloud, you can encrypt your container images using customer-managed encryption keys (CMEK). CMEK keys are managed in Cloud Key Management Service (Cloud KMS). When you use CMEK, you can temporarily or permanently disable access to an encrypted container image by disabling or destroying the key.

What's next

Learn more about securing your supply chain and application security with the following resources:

Manage compliance obligations

This document in the Google Cloud Architecture Framework provides best practices for managing compliance obligations.

Your cloud regulatory requirements depend on a combination of factors, including the following:

  • The laws and regulations that apply your organization's physical locations.
  • The laws and regulations that apply to your customers' physical locations.
  • Your industry's regulatory requirements.

These requirements shape many of the decisions that you need to make about which security controls to enable for your workloads in Google Cloud.

A typical compliance journey goes through three stages: assessment, gap remediation, and continual monitoring. This section addresses the best practices that you can use during each stage.

Assess your compliance needs

Compliance assessment starts with a thorough review of all of your regulatory obligations and how your business is implementing them. To help you with your assessment of Google Cloud services, use the Compliance resource center. This site provides you with details on the following:

  • Service support for various regulations
  • Google Cloud certifications and attestations

You can ask for an engagement with a Google compliance specialist to better understand the compliance lifecycle at Google and how your requirements can be met.

For more information, see Assuring compliance in the cloud (PDF).

Deploy Assured Workloads

Assured Workloads is the Google Cloud tool that builds on the controls within Google Cloud to help you meet your compliance obligations. Assured Workloads lets you do the following:

  • Select your compliance regime. The tool then automatically sets the baseline personnel access controls.
  • Set the location for your data using organization policies so that your data at rest and your resources remain only in that region.
  • Select the key management option (such as the key rotation period) that best fits your security and compliance requirements.
  • For certain regulatory requirements such as FedRAMP Moderate, select the criteria for access by Google support personnel (for example, whether they have completed appropriate background checks).
  • Ensure that Google-managed encryption keys are FIPS-140-2 compliant and support FedRAMP Moderate compliance. For an added layer of control and separation of duties, you can use customer-managed encryption keys (CMEK). For more information about keys, see Encrypt your data.

Review blueprints for templates and best practices that apply to your compliance regime

Google has published blueprints and solutions guides that describe best practices and that provide Terraform modules to let you roll out an environment that helps you achieve compliance. The following table lists a selection of blueprints that address security and alignment with compliance requirements.

StandardDescription
PCI
FedRAMP
HIPAA

Monitor your compliance

Most regulations require you to monitor particular activities, including access controls. To help with your monitoring, you can use the following:

  • Access Transparency, which provides near real-time logs when Google Cloud admins access your content.
  • Firewall Rules Logging to record TCP and UDP connections inside a VPC network for any rules that you create yourself. These logs can be useful for auditing network access or for providing early warning that the network is being used in an unapproved manner.
  • VPC Flow Logs to record network traffic flows that are sent or received by VM instances.
  • Security Command Center Premium to monitor for compliance with various standards.
  • OSSEC (or another open source tool) to log the activity of individuals who have admin access to your environment.
  • Key Access Justifications to view the reasons for a key access request.

Automate your compliance

To help you remain in compliance with changing regulations, determine if there are ways that you can automate your security policies by incorporating them into your infrastructure as code deployments. For example, consider the following:

  • Use security blueprints to build your security policies into your infrastructure deployments.

  • Configure Security Command Center to alert when non-compliance issues occur. For example, monitor for issues such as users disabling two-step verification or over-privileged service accounts. For more information, see Setting up finding notifications.

  • Set up automatic remediation to particular notifications. For more information, see Cloud Functions code.

Fore more information about compliance automation, see the Risk and Compliance as Code (RCaC) solution.

What's next

Learn more about compliance with the following resources:

Implement data residency and sovereignty requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing data residency and sovereignty requirements.

Data residency and sovereignty requirements are based on your regional and industry-specific regulations, and different organizations might have different data sovereignty requirements. For example, you might have the following requirements:

  • Control over all access to your data by Google Cloud, including what type of personnel can access the data and from which region they can access it.
  • Inspectability of changes to cloud infrastructure and services, which can have an impact on access to your data or the security of your data. Insight into these types of changes helps ensure that Google Cloud is unable to circumvent controls or move your data out of the region.
  • Survivability of your workloads for an extended time when you are unable to receive software updates from Google Cloud.

Manage your data sovereignty

Data sovereignty provides you with a mechanism to prevent Google from accessing your data. You approve access only for provider behaviors that you agree are necessary.

For example, you can manage your data sovereignty in the following ways:

Manage your operational sovereignty

Operational sovereignty provides you with assurances that Google personnel can't compromise your workloads.

For example, you can manage operational sovereignty in the following ways:

Manage software sovereignty

Software sovereignty provides you with assurances that you can control the availability of your workloads and run them wherever you want, without depending on (or being locked in to) a single cloud provider. Software sovereignty includes the ability to survive events that require you to quickly change where your workloads are deployed and what level of outside connection is allowed.

For example, Google Cloud supports hybrid and multicloud deployments. In addition, GKE Enterprise lets you manage and deploy your applications in both cloud environments and on-premises environments.

Control data residency

Data residency describes where your data is stored at rest. Data residency requirements vary based on systems design objectives, industry regulatory concerns, national law, tax implications, and even culture.

Controlling data residency starts with the following:

  • Understanding the type of your data and its location.
  • Determining what risks exist to your data, and what laws and regulations apply.
  • Controlling where data is or where it goes.

To help comply with data residency requirements, Google Cloud lets you control where your data is stored, how it is accessed, and how it's processed. You can use resource location policies to restrict where resources are created and to limit where data is replicated between regions. You can use the location property of a resource to identify where the service deploys and who maintains it.

For supportability information, see Resource locations supported services.

What's next

Learn more about data residency and sovereignty with the following resources:

Implement privacy requirements

This document in the Google Cloud Architecture Framework provides best practices for implementing privacy requirements.

Privacy regulations help define how you can obtain, process, store, and manage your users' data. Many privacy controls (for example, controls for cookies, session management, and obtaining user permission) are your responsibility because you own your data (including the data that you receive from your users).

Google Cloud includes the following controls that promote privacy:

  • Default encryption of all data when it's at rest, when it's in transit, and while it's being processed.
  • Safeguards against insider access.
  • Support for numerous privacy regulations.

For more information, see Google Cloud Privacy Commitments.

Classify your confidential data

You must define what data is confidential and then ensure that the confidential data is properly protected. Confidential data can include credit card numbers, addresses, phone numbers, and other personal identifiable information (PII).

Using Sensitive Data Protection, you can set up appropriate classifications. You can then tag and tokenize your data before you store it in Google Cloud. For more information, see Automatically classify your data.

Lock down access to sensitive data

Place sensitive data in its own service perimeter using VPC Service Controls, and set Google Identity and Access Management (IAM) access controls for that data. Configure multi-factor authentication (MFA) for all users who require access to sensitive data.

For more information, see Control movement of data across perimeters and Set up SSO and MFA.

Monitor for phishing attacks

Ensure that your email system is configured to protect against phishing attacks, which are often used for fraud and malware attacks.

If your organization uses Gmail, you can use advanced phishing and malware protection. This collection of settings provides controls to quarantine emails, defends against anomalous attachment types, and helps protect against from inbound spoofing emails. Security Sandbox detects malware in attachments. Gmail is continually and automatically updated with the latest security improvements and protections to help keep your organization's email safe.

Extend zero trust security to your hybrid workforce

A zero trust security model means that no one is trusted implicitly, whether they are inside or outside of your organization's network. When your IAM systems verify access requests, a zero trust security posture means that the user's identity and context (for example, their IP address or location) are considered. Unlike a VPN, zero trust security shifts access controls from the network perimeter to users and their devices. Zero trust security allows users to work more securely from any location. For example, users can access your organization's resources from their laptops or mobile devices while at home.

On Google Cloud, you can configure BeyondCorp Enterprise and Identity-Aware Proxy (IAP) to enable zero trust for your Google Cloud resources. If your users use Google Chrome and you enable BeyondCorp Enterprise, you can integrate zero-trust security into your users browsers.

What's next

Learn more about security and privacy with the following resources:

Implement logging and detective controls

This document in the Google Cloud Architecture Framework provides best practices for implementing logging and detective controls.

Detective controls use telemetry to detect misconfigurations, vulnerabilities, and potentially malicious activity in a cloud environment. Google Cloud lets you create tailored monitoring and detective controls for your environment. This section describes these additional features and recommendations for their use.

Monitor network performance

Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and then use that information to optimize your deployment by eliminating bottlenecks on your services. Connectivity Tests provides you with insights into the firewall rules and policies that are applied to the network path.

Monitor and prevent data exfiltration

Data exfiltration is a key concern for organizations. Typically, it occurs when an authorized person extracts data from a secured system and then shares that data with an unauthorized party or moves it to an insecure system.

Google Cloud provides several features and tools that help you detect and prevent data exfiltration. For more information, see Preventing data exfiltration.

Centralize your monitoring

Security Command Center provides visibility into the resources that you have in Google Cloud and into their security state. Security Command Center helps you prevent, detect, and respond to threats. It provides a centralized dashboard that you can use to help identify security misconfigurations in virtual machines, in networks, in applications, and in storage buckets. You can address these issues before they result in business damage or loss. The built-in capabilities of Security Command Center can reveal suspicious activity in your Cloud Logging security logs or indicate compromised virtual machines.

You can respond to threats by following actionable recommendations or by exporting logs to your SIEM system for further investigation. For information about using a SIEM system with Google Cloud, see Security log analytics in Google Cloud.

Security Command Center also provides multiple detectors that help you analyze the security of your infrastructure. These detectors include the following:

Other Google Cloud services, such as Google Cloud Armor logs, also provide findings for display in Security Command Center.

Enable the services that you need for your workloads, and then only monitor and analyze important data. For more information about enabling logging on services, see the enable logs section in Security log analytics in Google Cloud.

Monitor for threats

Event Threat Detection is an optional managed service of Security Command Center Premium that detects threats in your log stream. By using Event Threat Detection, you can detect high-risk and costly threats such as malware, cryptomining, unauthorized access to Google Cloud resources, DDoS attacks, and brute-force SSH attacks. Using the tool's features to distill volumes of log data, your security teams can quickly identify high-risk incidents and focus on remediation.

To help detect potentially compromised user accounts in your organization, use the Sensitive Actions Cloud Platform logs to identify when sensitive actions are taken and to confirm that valid users took those actions for valid purposes. A sensitive action is an action, such as the addition of a highly privileged role, that could be damaging to your business if a malicious actor took the action. Use Cloud Logging to view, monitor, and query the Sensitive Actions Cloud Platform logs. You can also view the sensitive action log entries with the Sensitive Actions Service, a built-in service of Security Command Center Premium.

Chronicle can store and analyze all of your security data centrally. To help you see the entire span of an attack, Chronicle can map logs into a common model, enrich them, and then link them together into timelines. Furthermore, you can use Chronicle to create detection rules, set up indicators of compromise (IoC) matching, and perform threat-hunting activities. You write your detection rules in the YARA-L language. For sample threat detection rules in YARA-L, see the Community Security Analytics (CSA) repository. In addition to writing your own rules, you can take advantage of curated detections in Chronicle. These curated detections are a set of predefined and managed YARA-L rules that can help you identify threats.

Another option to centralizing your logs for security analysis, audit, and investigation is to use BigQuery. In BigQuery, you monitor common threats or misconfigurations by using SQL queries (such as those in the CSA repository) to analyze permission changes, provisioning activity, workload usage, data access, and network activity. For more information about security log analytics in BigQuery from setup through analysis, see Security log analytics in Google Cloud.

The following diagram shows how to centralize your monitoring by using both the built-in threat detection capabilities of Security Command Center and the threat detection that you do in BigQuery, Chronicle, or a third-party SIEM.

How the various security analytics tools and content interact in Google Cloud.

As shown in the diagram, there are variety of security data sources that you should monitor. These data sources include logs from Cloud Logging, asset changes from Cloud Asset Inventory, Google Workspace logs, or events from hypervisor or a guest kernel. The diagram shows that you can use Security Command Center to monitor these data sources. This monitoring occurs automatically provided that you've enabled the appropriate features and threat detectors in Security Command Center. The diagram shows that you can also monitor for threats by exporting security data and Security Command Center findings to an analytics tool such as BigQuery, Chronicle, or a third-party SIEM. In your analytics tool, the diagram shows that you can perform further analysis and investigation by using and extending queries and rules like those available in CSA.

What's next

Learn more about logging and detection with the following resources:

Google Cloud Architecture Framework: Reliability

This category in the Google Cloud Architecture Framework shows you how to architect and operate reliable services on a cloud platform. You also learn about some of the Google Cloud products and features that support reliability.

The Architecture Framework describes best practices, provides implementation recommendations, and explains some of the available products and services. The framework aims to help you design your Google Cloud deployment so that it best matches your business needs.

To run a reliable service, your architecture must include the following:

  • Measurable reliability goals, with deviations that you promptly correct.
  • Design patterns for scalability, high availability, disaster recovery, and automated change management.
  • Components that self-heal where possible, and code that includes instrumentation for observability.
  • Operational procedures that run the service with minimal manual work and cognitive load on operators, and that let you rapidly detect and mitigate failures.

Reliability is the responsibility of everyone in engineering, such as the development, product management, operations, and site reliability engineering (SRE) teams. Everyone must be accountable and understand their application's reliability targets, and risk and error budgets. Teams should be able to prioritize work appropriately and escalate priority conflicts between reliability and product feature development.

In the reliability category of the Architecture Framework, you learn to do the following:

Reliability principles

This document in the Architecture Framework explains some of the core principles to run reliable services on a cloud platform. These principles help to create a common understanding as you read additional sections of the Architecture Framework that show you how some of the Google Cloud products and features support reliable services.

Key terminology

There are several common terms associated with reliability practices. These may be familiar to many readers. However, for a refresher see the detailed descriptions at the Terminology page.

Core principles

Google's approach to reliability is based on the following core principles.

Reliability is your top feature

New product features are sometimes your top priority in the short term. However, reliability is your top product feature in the long term, because if the product is too slow or is unavailable over a long period of time, your users might leave, making other product features irrelevant.

Reliability is defined by the user

For user-facing workloads, measure the user experience. The user must be happy with how your service performs. For example, measure the success ratio of user requests, not just server metrics like CPU usage.

For batch and streaming workloads, you might need to measure key performance indicators (KPIs) for data throughput, such as rows scanned per time window, instead of server metrics such as disk usage. Throughput KPIs can help ensure a daily or quarterly report required by the user finishes on time.

100% reliability is the wrong target

Your systems should be reliable enough that users are happy, but not excessively reliable such that the investment is unjustified. Define SLOs that set the reliability threshold you want, then use error budgets to manage the appropriate rate of change.

Apply the design and operational principles in this framework to a product only if the SLO for that product or application justifies the cost.

Reliability and rapid innovation are complementary

Use error budgets to achieve a balance between system stability and developer agility. The following guidance helps you determine when to move fast or slow:

  • When an adequate error budget is available, you can innovate rapidly and improve the product or add product features.
  • When the error budget is diminished, slow down and focus on reliability features.

Design and operational principles

To maximize system reliability, the following design and operational principles apply. Each of these principles is discussed in detail in the rest of the Architecture Framework reliability category.

Define your reliability goals

The best practices covered in this section of the Architecture Framework include the following:

  • Choose appropriate SLIs.
  • Set SLOs based on the user experience.
  • Iteratively improve SLOs.
  • Use strict internal SLOs.
  • Use error budgets to manage development velocity.

For more information, see Define your reliability goals in the Architecture Framework reliability category.

Build observability into your infrastructure and applications

The following design principle is covered in this section of the Architecture Framework:

  • Instrument your code to maximize observability.

For more information, see Build observability into your infrastructure and applications in the Architecture Framework reliability category.

Design for scale and high availability

The following design principles are covered in this section of the Architecture Framework:

  • Create redundancy for higher availability.
  • Replicate data across regions for disaster recovery.
  • Design a multi-region architecture for resilience to regional outages.
  • Eliminate scalability bottlenecks.
  • Degrade service levels gracefully when overloaded.
  • Prevent and mitigate traffic spikes.
  • Sanitize and validate inputs.
  • Fail safe in a way that preserves system function.
  • Design API calls and operational commands to be retryable.
  • Identify and manage system dependencies.
  • Minimize critical dependencies.
  • Ensure that every change can be rolled back.

For more information, see Design for scale and high availability in the Architecture Framework reliability category.

Create reliable operational processes and tools

The following operational principles are covered in this section of the Architecture Framework:

  • Choose good names for applications and services.
  • Implement progressive rollouts with canary testing procedures.
  • Spread out traffic for timed promotion and launches.
  • Automate the build, test, and deployment process.
  • Defend against operator error.
  • Test failure recovery procedures.
  • Conduct disaster recovery tests.
  • Practice chaos engineering.

For more information, see Create reliable operational processes and tools in the Architecture Framework reliability category.

Build efficient alerts

The following operational principles are covered in this section of the Architecture Framework:

  • Optimize alert delays.
  • Alert on symptoms, not causes.
  • Alert on outliers, not averages.

For more information, see Build efficient alerts in the Architecture Framework reliability category.

Build a collaborative incident management process

The following operational principles are covered in this section of the Architecture Framework:

  • Assign clear service ownership.
  • Reduce time to detect (TTD) with well tuned alerts.
  • Reduce time to mitigate (TTM) with incident management plans and training.
  • Design dashboard layouts and content to minimize TTM.
  • Document diagnostic procedures and mitigation for known outage scenarios.
  • Use blameless postmortems to learn from outages and prevent recurrences.

For more information, see Build a collaborative incident management process in the Architecture Framework reliability category.

What's next

Explore other categories in the Architecture Framework such as system design, operational excellence, security, privacy, and compliance.

Define your reliability goals

This document in the Google Cloud Architecture Framework provides best practices to define appropriate ways to measure the customer experience of your services so you can run reliable services. You learn how to iterate on the service level objectives (SLOs) you define, and use error budgets to know when reliability might suffer if you release additional updates.

Choose appropriate SLIs

It's important to choose appropriate service level indicators (SLIs) to fully understand how your service performs. For example, if your application has a multi-tenant architecture that is typical of SaaS applications used by multiple independent customers, capture SLIs at a per-tenant level. If your SLIs are measured only at a global aggregate level, you might miss critical problems in your application that affect a single important customer or a minority of customers. Instead, design your application to include a tenant identifier in each user request, then propagate that identifier through each layer of the stack. This identifier lets your monitoring system aggregate statistics at the per-tenant level at every layer or microservice along the request path.

The type of service you run also determines what SLIs to monitor, as shown in the following examples.

Serving systems

The following SLIs are typical in systems that serve data:

  • Availability tells you the fraction of the time that a service is usable. It's often defined in terms of the fraction of well-formed requests that succeed, such as 99%.
  • Latency tells you how quickly a certain percentage of requests can be fulfilled. It's often defined in terms of a percentile other than 50th, such as "99th percentile at 300 ms".
  • Quality tells you how good a certain response is. The definition of quality is often service-specific, and indicates the extent to which the content of the response to a request varies from the ideal response content. The response quality could be binary (good or bad) or expressed on a scale from 0% to 100%.

Data processing systems

The following SLIs are typical in systems that process data:

  • Coverage tells you the fraction of data that has been processed, such as 99.9%.
  • Correctness tells you the fraction of output data deemed to be correct, such as 99.99%.
  • Freshness tells you how fresh the source data or the aggregated output data is. Typically the more recently updated, the better, such as 20 minutes.
  • Throughput tells you how much data is being processed, such as 500 MiB/sec or even 1000 requests per second (RPS).

Storage systems

The following SLIs are typical in systems that store data:

  • Durability tells you how likely the data written to the system can be retrieved in the future, such as 99.9999%. Any permanent data loss incident reduces the durability metric.
  • Throughput and latency are also common SLIs for storage systems.

Choose SLIs and set SLOs based on the user experience

One of the core principles in this Architecture Framework section is that reliability is defined by the user. Measure reliability metrics as close to the user as possible, such as the following options:

Set your SLO just high enough that almost all users are happy with your service, and no higher. Because of network connectivity or other transient client-side issues, your customers might not notice brief reliability issues in your application, allowing you to lower your SLO.

For uptime and other vital metrics, aim for a target lower than 100% but close to it. Service owners should objectively assess the minimum level of service performance and availability that would make most users happy, not just set targets based on external contractual levels.

The rate at which you change affects your system's reliability. However, the ability to make frequent, small changes helps you deliver features faster and with higher quality. Achievable reliability goals tuned to the customer experience help define the maximum pace and scope of changes (feature velocity) that customers can tolerate.

If you can't measure the customer experience and define goals around it, you can run a competitive benchmark analysis. If there's no comparable competition, measure the customer experience, even if you can't define goals yet. For example, measure system availability or the rate of meaningful and successful transactions to the customer. You can correlate this data with business metrics or KPIs such as the volume of orders in retail or the volume of customer support calls and tickets and their severity. Over a period of time, you can use such correlation exercises to get to a reasonable threshold of customer happiness. This threshold is your SLO.

Iteratively improve SLOs

SLOs shouldn't be set in stone. Revisit SLOs quarterly, or at least annually, and confirm that they continue to accurately reflect user happiness and correlate well with service outages. Make sure that they cover current business needs and new critical user journeys. Revise and augment your SLOs as needed after these periodic reviews.

Use strict internal SLOs

It's a good practice to have stricter internal SLOs than external SLAs. As SLA violations tend to require issuing a financial credit or customer refunds, you want to address problems before they have financial impact.

We recommend that you use these stricter internal SLOs with a blameless postmortem process and incident reviews. For more information, see Build a collaborative incident management process in the Architecture Center reliability category.

Use error budgets to manage development velocity

Error budgets tell you if your system is more or less reliable than is needed over a certain time window. Error budgets are calculated as 100% – SLO over a period of time, such as 30 days.

When you have capacity left in your error budget, you can continue to launch improvements or new features quickly. When the error budget is close to zero, freeze or slow down service changes and invest engineering resources to improve reliability features.

Google Cloud Observability includes SLO monitoring to minimize the effort of setting up SLOs and error budgets. The services include a graphical user interface to help you to configure SLOs manually, an API for programmatic setup of SLOs, and built-in dashboards to track the error budget burn rate. For more information, see how to create an SLO.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations::

  • Define and measure customer-centric SLIs, such as the availability or latency of the service.
  • Define a customer-centric error budget that's stricter than your external SLA. Include consequences for violations, such as production freezes.
  • Set up latency SLIs to capture outlier values, such as 90th or 99th percentile, to detect the slowest responses.
  • Review SLOs at least annually and confirm that they correlate well with user happiness and service outages.

What's next

Learn more about how to define your reliability goals with the following resources:

Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.

Define SLOs

This document is Part 1 of two parts that show how teams that operate online services can begin to build and adopt a culture of Site Reliability Engineering (SRE) by using service level objectives (SLOs). An SLO is a target level of reliability for a service.

In software as a service (SaaS), a natural tension exists between the velocity of product development and operational stability. The more you change your system, the more likely it will break. Monitoring and observability tools can help you maintain confidence in your operational stability as you increase development velocity. However, although such tools—known also as application performance management (APM) tools—are important, one of the most important applications of these tools is in setting SLOs.

If defined correctly, an SLO can help teams make data-driven operational decisions that increase development velocity without sacrificing stability. The SLO can also align development and operations teams around a single agreed-to objective, which can alleviate the natural tension that exists between their objectives—creating and iterating products (development) and maintaining system integrity (operations).

SLOs are described in detail in The SRE Book and The SRE Workbook, alongside other SRE practices. This series attempts to simplify the process of understanding and developing SLOs to help you more easily adopt them. Once you have read and understood these articles, you can find more in the books.

This series aims to show you a clear path to implementing SLOs in your organization:

  • This document reviews what SLOs are and how to define them for your services.
  • Adopting SLOs covers different types of SLOs based on workload types, how to measure those SLOs, and how to develop alerts based on them.

This series is intended for SREs, operations teams, DevOps, systems administrators, and others who are responsible for the stability and reliability of an online service. It assumes that you understand how internet services communicate with web browsers and mobile devices, and that you have a basic understanding of how web services are monitored, deployed, and troubleshot.

The State of DevOps reports identified capabilities that drive software delivery performance. This series will help you with the following capabilities:

Why SLOs?

When you build a culture of SRE, why start with SLOs? In short, if you don't define a service level, it's difficult to measure whether your customers are happy with your service. Even if you know that you can improve your service, the lack of a defined service level makes it hard to determine where and how much to invest in improvements.

It can be tempting to develop separate SLOs for every service, user-facing or not. For instance, a common mistake is to measure two or more services separately—for example, a frontend service and a backend datastore—when the user relies on both services and isn't aware of the distinction. A better approach is to develop SLOs that are based on the product (the collection of services) and focus on the most important interactions that your users have with it.

Therefore, to develop an effective SLO, it's ideal that you understand your users' interactions with your service, which are called critical user journeys (CUJs). A CUJ considers the goals of your users, and how your users use your services to accomplish those goals. The CUJ is defined from the perspective of your customer without consideration for service boundaries. If the CUJ is met, the customer is happy, and happy customers are a key measurement of success for a service.

A key aspect to customer happiness with a service is a service's reliability. It doesn't matter what a service does if it's not reliable. Thus, reliability is the most critical feature of any service. A common metric for reliability is uptime, which conventionally means the amount of time a system has been up. However, we prefer a more helpful and precise metric: availability. Availability still answers the question of whether a system is up but in a more precise way than by measuring the time since a system was down. In today's distributed systems, services can be partially down, a factor that uptime doesn't capture well.

Availability is often described in terms of nines—such as 99.9% available (three nines), or 99.99% available (four nines). Measuring an availability SLO is one of the best ways to measure your system's reliability.

In addition to helping define operational success, an SLO can help you choose where to invest resources. For example, SRE books often note that each nine that you engineer for can result in an incremental cost with marginal utility. It is generally recognized that achieving the next nine in availability costs you ten times as much as the preceding one.

Choose an SLI

To determine if an SLO is met (that is, successful), you need a measurement. That measurement is called the service level indicator (SLI). An SLI measures the level of a particular service that you're delivering to your customer. Ideally, the SLI is tied to an accepted CUJ.

Select the best metrics

The first step in developing an SLI is to choose a metric to measure, such as requests per second, errors per second, queue length, the distribution of response codes during a given time period, or the number of bytes transmitted.

Such metrics tend to be of the following types:

  • Counter. For example, the number of errors that occurred up to a given point of measurement. This type of metric can increase but not decrease.
  • Distribution. For example, the number of events that populate a particular measurement segment for a given time period. You might measure how many requests take 0-10 ms to complete, how many take 11-30 ms, and how many take 31-100 ms. The result is a count for each bucket—for example, [0-10: 50], [11-30: 220], [31-100: 1103].
  • Gauge. For example, the actual value of a measurable part of the system (such as queue length). This type of metric can increase or decrease.

For more information about these types, see the Prometheus project documentation and the Cloud Monitoring metric types ValueType and MetricKind.

An important distinction about SLIs is that not all metrics are SLIs. In fact, the SRE Workbook states the following (emphasis added):

"...we generally recommend treating the SLI as the ratio of two numbers: the number of good events divided by the total number of events..."

"The SLI ranges from 0% to 100%, where 0% means nothing works, and 100% means nothing is broken. We have found this scale intuitive, and this style lends itself easily to the concept of an error budget."

Many software companies track hundreds or thousands of metrics; only a handful of metrics qualify as SLIs. So apart from being a ratio of good events to total events, what qualifies a metric as a good SLI? A good SLI metric has the following characteristics:

  • The metric directly relates to user happiness. Generally, users are unhappy if a service does not behave the way they expect it to, fails, or is slow. Any SLOs based on these metrics can be validated by comparing your SLI to other signals of user happiness—for example, the number of customer complaint tickets, support call volume, social media sentiment, or escalations. If your metric doesn't correspond to these other metrics of user happiness, it might not be a good metric to use as an SLI.
  • Metric deterioration correlates with outages. A metric that looks good during an outage is the wrong metric for an SLI. A metric that looks bad during normal operation is also the wrong metric for an SLI.
  • The metric provides a good signal-to-noise ratio. Any metric that results in a large number of false negatives or false positives is not a good SLI.
  • The metric scales monotonically, and approximately linearly, with customer happiness. As the metric improves, customer happiness improves.

Consider the graphs in the following diagram. Two metrics that might be used as SLIs for a service are graphed over time. The period when a service degrades is highlighted in red, and the period when a service is good is highlighted in blue.

image

In the case of the bad SLI, the user's unhappiness doesn't correspond directly with a negative event (such as service degradation, slowness, or an outage). Also, the SLI fluctuates independently of user happiness. With the good SLI, the SLI and user happiness correlate, the different happiness levels are clear, and there are far fewer irrelevant fluctuations.

Select the right number of metrics

Usually, a single service has multiple SLIs, especially if the service performs different types of work or serves different types of users. For example, separating read requests from write requests is a good idea, as these requests tend to act in different ways. In this case, it is best to select metrics appropriate to each service.

In contrast, many services perform similar types of work across the service, which can be directly comparable. For example, if you have an online marketplace, users might view a homepage, view a subcategory or a top-10 list, view a details page, or search for items. Instead of developing and measuring a separate SLI for each of these actions, you might combine them into a single SLI category—for example, browse services.

In reality, the expectations of a user don't change much between actions of a similar category. Their happiness is not dependent on the structure of the data they are browsing, whether the data is derived from a static list of promoted items or is the dynamically generated result of a machine learning-assisted search across a massive dataset. Their happiness is quantifiable by an answer to a question: "Did I see a full page of items quickly?"

Ideally, you want to use as few SLIs as possible to accurately represent the tolerances of a given service. Typically, a service should have between two and six SLIs. If you have too few SLIs, you can miss valuable signals. If you have too many SLIs, your on-call team has too much to track with only marginal added utility. Remember, SLIs should simplify your understanding of production health and provide a sense of coverage.

Choose an SLO

An SLO is composed of the following values:

  • An SLI. For example, the ratio of the number of responses with HTTP code 200 to the total number of responses.
  • A duration. The time period in which a metric is measured. This period can be calendar-based (for example, from the first day of one month to the first day of the next) or a rolling window (for example, the last 30 days).
  • A target. For example, a target percentage of good events to total events (such as 99.9%) that you expect to meet for a given duration.

As you develop an SLO, defining the duration and target can be difficult. One way to begin the process is to identify SLIs and chart them over time. If you can't decide what duration and target to use, remember that your SLO doesn't have to be perfect right away. You likely will iterate on your SLO to ensure that it aligns with customer happiness and meets your business needs. You might try starting with two nines (99.0%) for a month.

As you track SLO compliance during events such as deployments, outages, and daily traffic patterns, you can gain insights on what target is good, bad, or even tolerable. For example, in a background process, you might define 75% success as adequate. But for mission-critical, user-facing requests, you might aim for something more aggressive, like 99.95%.

Of course, there isn't a single SLO that you can apply for every use case. SLOs depend on several factors:

  • customer expectations
  • workload type
  • infrastructure for serving, execution, and monitoring
  • the problem domain

Part 2 in this series, Adopt SLOs, focuses on domain-independent SLOs. Domain-independent SLOs (such as service availability) do not replace high-level indicators (such as widgets sold per minute). However, they can help measure whether a service is working regardless of the business use case.

Domain-independent indicators can often be reduced to a question‐for example, "Is the service available?" or "Is the service fast enough?" The answer is most often found in an SLO that accounts for two factors: availability and latency. You might describe an SLO in the following terms, where X = 99.9% and Y = 800 ms:

X% of valid requests are successful and succeed faster than Y ms.

What's next?

Adopt SLOs

This document defines several service level objectives (SLOs) that are useful for different types of common service workloads. This document is Part 2 of two parts. Part 1, Define SLOs, introduces SLOs, shows how SLOs are derived from service level indicators (SLIs), and describes what makes a good SLO.

The State of DevOps reports identified capabilities that drive software delivery performance. These two documents will help you with the following capabilities:

What to measure

Regardless of your domain, many services share common features and can use generic SLOs. The following discussion about generic SLOs is organized by service type and provides detailed explanations of SLIs that apply to each SLO.

Request-driven services

A request-driven service receives a request from a client (another service or a user), performs some computation, possibly sends network requests to a backend, and then returns a response to the client. Request-driven services are most often measured by availability and latency SLIs.

Availability as an SLI

The SLI for availability indicates whether the service is working. The SLI for availability is defined as follows:

The proportion of valid requests served successfully.

You first have to define valid. Some basic definitions might be "not zero-length" or "adheres to a client-server protocol," but it is up to a service owner to define what they mean by valid. A common method to gauge validity is to use an HTTP (or RPC) response code. For example, we often consider HTTP 500 errors to be server errors that count against an SLO, while 400 errors are client errors that do not.

After you decide what to measure, you need to examine every response code returned by your system to ensure that the application uses those codes properly and consistently. When using error codes for SLOs, it's important to ask whether a code is an accurate indicator of your users' experience of your service. For example, if a user attempts to order an item that is out of stock, does the site break and return an error message, or does the site suggest similar products? For use with SLOs, error codes need to be tied to users' expectations.

Developers can misuse errors. In the case where a user asks for a product that is temporarily out of stock, a developer might mistakenly program an error to be returned. However, the system is actually functioning correctly and not in error. The code needs to return as a success, even though the user could not purchase the item they wanted. Of course, the owners of this service need to know that a product is out of stock, but the inability to make a sale is not an error from the customer's perspective and should not count against an SLO. However, if the service cannot connect to the database to determine if the item is in stock, that is an error that counts against your error budget.

Your service might be more complex. For example, perhaps your service handles asynchronous requests or provides a long-running process for customers. In these cases, you might expose availability in another way. However, we recommend that you still represent availability as the proportion of valid requests that are successful. You might define availability as the number of minutes that a customer's workload is running as requested. (This approach is sometimes referred to as the "good minutes" method of measuring availability.) In the case of a virtual machine, you could measure availability in terms of the proportion of minutes after an initial request for a VM that the VM is accessible through SSH.

Latency as an SLI

The SLI for latency (sometimes called speed) indicates whether the service is fast enough. The SLI for latency is defined similarly to availability:

The proportion of valid requests served faster than a threshold.

You can measure latency by calculating the difference between when a timer starts and when it stops for a given request type. The key is a user's perception of latency. A common pitfall is to be too precise in measuring latency. In reality, users cannot distinguish between a 100-millisecond (ms) and a 300-ms refresh and might accept any point between 300 ms and 1000 ms.

Instead, it's a good idea to develop activity-centric metrics that keep the user in focus, for example, in the following processes:

  • Interactive: 1000 ms for the time that a user waits for a result after clicking an element.
  • Write: 1500 ms for changing an underlying distributed system. While this length of time is considered slow for a system, users tend to accept it. We recommend that you explicitly distinguish between writes and reads in your metrics.
  • Background: 5000 ms for an action that is not user-visible, like a periodic refresh of data or other asynchronous requests.

Latency is commonly measured as a distribution (see Choosing an SLI in Part 1 of this series). Given a distribution, you can measure various percentiles. For example, you might measure the number of requests that are slower than the historical 99th percentile. In this case, we consider good events to be events that are faster than this threshold, which was set by examining the historical distribution. You can also set this threshold based on product requirements. You can even set multiple latency SLOs, for example typical latency versus tail latency.

We recommend that you do not use only the average (or median) latency as your SLI. Discovering that the median latency is too slow means that half your users are already unhappy. In other words, you can have bad latency for days before you discover a real threat to your long-term error budget. Therefore, we recommend that you define your SLO for tail latency (95th percentile) and for median latency (50th percentile).

In the ACM article Metrics That Matter, Benjamin Treynor Sloss writes the following:

"A good practical rule of thumb ... is that the 99th-percentile latency should be no more than three to five times the median latency."

Treynor Sloss continues:

"We find the 50th-, 95th-, and 99th-percentile latency measures for a service are each individually valuable, and we will ideally set SLOs around each of them."

A good model to follow is to determine your latency thresholds based on historical percentiles, then measure how many requests fall into each bucket. For more details, see the section on latency alerts later in this document.

Quality as an SLI

Quality is a helpful SLI for complex services that are designed to fail gracefully by degrading when dependencies are slow or unavailable. The SLI for quality is defined as follows:

The proportion of valid requests served without degradation of service.

For example, a web page might load its main content from one datastore and load ancillary, optional assets from 100 other services and datastores. If one optional service is out of service or too slow, the page can still be rendered without the ancillary elements. By measuring the number of requests that are served a degraded response (that is, a response missing at least one backend service's response), you can report the ratio of requests that were bad. You might even track how many responses to the user were missing a response from a single backend, or were missing responses from multiple backends.

Data processing services

Some services are not built to respond to user requests but instead consume data from an input, process that data, and generate an output. How these services perform at intermediate steps is not as important as the final result. With services like these, your strongest SLIs are freshness, coverage, correctness, and throughput, not latency and availability.

Freshness as an SLI

The SLI for freshness is defined as follows:

The proportion of valid data updated more recently than a threshold.

In batch processing systems, for example, freshness can be measured as the time elapsed since a processing run completed successfully for a given output. In more complex or real-time processing systems, you might track the age of the most-recent record processed in a pipeline.

For example, consider an online game that generates map tiles in real time. Users might not notice how quickly map tiles are created, but they might notice when map data is missing or is not fresh.

Or, consider a system that reads records from an in-stock tracking system to generate the message "X items in stock" for an ecommerce website. You might define the SLI for freshness as follows:

The percentage of views that used stock information that was refreshed within the last minute.

You can also use a metric for serving non-fresh data to inform the SLI for quality.

Coverage as an SLI

The SLI for coverage is defined as follows:

The proportion of valid data processed successfully.

To define coverage, you first determine whether to accept an input as valid or to skip it. For example, if an input record is corrupted or zero-length and cannot be processed, you might consider that record as invalid for measuring your system.

Next, you count the number of your valid records. You might do this step with a simple count() method or another method. This number is your total record count.

Finally, to generate your SLI for coverage, you count the number of records that processed successfully and compare that number against the total valid record count.

Correctness as an SLI

The SLI for correctness is defined as follows:

The proportion of valid data that produced correct output.

In some cases, there are methods of determining the correctness of an output that can be used to validate the processing of the output. For example, a system that rotates or colorizes an image should never produce a zero-byte image, or an image with a length or width of zero. It is important to separate this validation logic from the processing logic itself.

One method of measuring a correctness SLI is to use known-good test input data, which is data that has a known correct output. The input data needs to be representative of user data. In other cases, it is possible that a mathematical or logical check might be made against the output, like in the preceding example of rotating an image. Another example might be a billing system that determines if a transaction is valid by checking whether the difference between the balance before the transaction and the balance after the transaction matches the value of the transaction itself.

Throughput as an SLI

The SLI for throughput is defined as follows:

The proportion of time where the data processing rate was faster than a threshold.

In a data processing system, throughput is often more representative of user happiness than, for example, a single latency measurement for a given piece of work. For example, if the size of each input varies dramatically, it might not make sense to compare how long each element takes to finish if a job progresses at an acceptable rate.

Bytes per second is a common way to measure the amount of work it takes to process data regardless of the size of a dataset. But any metric that roughly scales linearly with respect to the cost of processing can work.

It might be worthwhile to partition your data processing systems based upon expected throughput rates, or implement a quality of service system to ensure that high-priority inputs are handled and low-priority inputs are queued. Either way, measuring throughput as defined in this section can help you determine if your system is working as expected.

Scheduled execution services

For services that need to perform an action at a regular interval, such as Kubernetes cron jobs, you can measure skew and execution duration. The following is a sample scheduled Kubernetes cron job:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "0 * * * *"

Skew as an SLI

As an SLI, skew is defined as follows:

The proportion of executions that start within an acceptable window of the expected start time.

Skew measures the time difference between when a job is scheduled to start and when it does start. For example, if the preceding Kubernetes cron job, which is set up to start at minute zero of every hour, starts at three minutes past the hour, then the skew is three minutes. When a job runs early, you have a negative skew.

You can measure skew as a distribution over time, with corresponding acceptable ranges that define good skew. To determine the SLI, you would compare the number of runs that were within a good range.

Execution duration as an SLI

As an SLI, execution duration is defined as follows:

The proportion of executions that complete within the acceptable duration window.

Execution duration is the time a job takes to complete. For a given execution, a common failure mode is for actual duration to exceed scheduled duration.

One interesting case is how to apply this SLI to catch a never-ending job. Because these jobs don't finish, you need to record the time spent on a given job instead of waiting for a job to complete. This approach provides an accurate distribution of how long work takes to complete, even in worst-case scenarios.

As with skew, you can track execution duration as a distribution and define acceptable upper and lower bounds for good events.

Types of metrics for other systems

Many other workloads have their own metrics that you can use to generate SLIs and SLOs. Consider the following examples:

  • Storage systems: durability, throughput, time to first byte, blob availability
  • Media/video: client playback continuity, time to start playback, transcode graph execution completeness
  • Gaming: time to match active players, time to generate a map

How to measure

After you know what you're measuring, you can decide how to take the measurement. You can gather your SLIs in several ways.

Server-side logging

Method to generate SLIs

Processing server-side logs of requests or processed data.

Considerations

Advantages:

  • Existing logs can be reprocessed to backfill historical SLI records.
  • Cross-service session identifiers can reconstruct complex user journeys across multiple services.

Disadvantages:

  • Requests that do not arrive at the server are not recorded.
  • Requests that cause a server to crash might not be recorded.
  • Length of time to process logs can result in stale SLIs, which might be inadequate data for an operational response.
  • Writing code to process logs can be an error-prone, time- consuming task.

Implementation methods and tools:

Application server metrics

Method to generate SLIs

Exporting SLI metrics from the code that serves requests from users or processes their data.

Considerations

Advantage:

  • Adding new metrics to code is typically fast and inexpensive.

Disadvantages:

  • Requests that do not arrive to application servers are not recorded.
  • Multi-service requests might be hard to track.

Implementation methods and tools:

Frontend infrastructure metrics

Method to generate SLIs

Utilizing metrics from the load-balancing infrastructure (for example, Google Cloud's global Layer 7 load balancer).

Considerations

Advantages:

  • Metrics and historical data often already exist, thus reducing the engineering effort to get started.
  • Measurements are taken at the point nearest the customer yet still within the serving infrastructure.

Disadvantages:

  • Not viable for data processing SLIs.
  • Can only approximate multi-request user journeys.

Implementation methods and tools:

Synthetic clients or data

Method to generate SLIs

Building a client that sends fabricated requests at regular intervals and validates the responses. For data processing pipelines, creating synthetic known-good input data and validating outputs.

Considerations

Advantages:

  • Measures all steps of a multi-request user journey.
  • Sending requests from outside your infrastructure captures more of the overall request path in the SLI.

Disadvantages:

  • Approximates user experience with synthetic requests, which might be misleading (both false positives or false negatives).
  • Covering all corner cases is hard and can devolve into integration testing.
  • High reliability targets require frequent probing for accurate measurement.
  • Probe traffic can drown out real traffic.

Implementation methods and tools:

Client instrumentation

Method to generate SLIs

Adding observability features to the client that the user interacts with, and logging events back to your serving infrastructure that tracks SLIs.

Considerations

Advantages:

  • Provides the most accurate measure of user experience.
  • Can quantify reliability of third parties, for example, CDN or payments providers.

Disadvantages:

  • Client logs ingestion and processing latency make these SLIs unsuitable for triggering an operational response.
  • SLI measurements will contain a number of highly variable factors potentially outside of direct control.
  • Building instrumentation into the client can involve lots of engineering work.

Implementation methods and tools:

Choose a measurement method

Ideally, you need to choose a measurement method that most closely aligns with your customer's experience of your service and demands the least effort on your part. To achieve this ideal, you might need to use a combination of the methods in the preceding tables. Here is a suggested approach that you can implement over time, listed in order of increasing effort:

  1. Using application server exports and infrastructure metrics. Typically, you can access these metrics immediately, and they quickly provide value. Some APM tools include built-in SLO tooling.
  2. Using client instrumentation. Because legacy systems typically lack built-in, end-user client instrumentation, setting up instrumentation might require a significant investment. However, if you use an APM suite or frontend framework that provides client instrumentation, you can quickly gain insight into your customer's happiness.
  3. Using logs processing. If you cannot implement server exports or client instrumentation but logs exist, you might find logs processing to be your best value. Another approach is to combine exports and logs processing, using exports as an immediate source for some SLIs (such as immediate availability) and logs processing for long-term signals (such as slow-burn alerts discussed later in the SLOs and Alert) guide.
  4. Implementing synthetic testing. After you have a basic understanding of how your customers use your service, you test your service level. For example, you can seed test accounts with known-good data and query for it. This testing can help highlight failure modes that aren't easily observed, such as in the case of low-traffic services.

Set your objectives

One of the best ways to set objectives is to create a shared document that describes your SLOs and how you developed them. Your team can iterate on the document as it implements and iterates on the SLOs over time.

We recommend that business owners, product owners, and executives review this document. Those stakeholders can offer insights about service expectations and your product's reliability tradeoffs.

For your company's most important critical user journeys (CUJs), here is a template for developing an SLO:

  1. Choose an SLI specification (for example, availability or freshness).
  2. Define how to implement the SLI specification.
  3. Read through your plan to ensure that your CUJs are covered.
  4. Set SLOs based on past performance or business needs.

CUJs should not be constrained to a single service, nor to a single development team or organization. If your users depend on hundreds of microservices that operate at 99.5% yet nobody tracks end-to-end availability, your customer is likely not happy.

Suppose that you have a query that depends on five services that work in sequence: a load balancer, a frontend, a mixer, a backend, and a database.

If each component has a 99.5% availability, the worst-case user-facing availability is as follows:

99.5% * 99.5% * 99.5% * 99.5% * 99.5% = 97.52%

This is the worst-case user-facing availability because the overall system fails if any one of the five services fails. This would only be true if all layers of the stack must always be immediately available to handle each user request, without any resilience factors such as intermediate retries, caches, or queues. A system with such tight coupling between services is a bad design and defies the microservices model.

Simply measuring performance against the SLO of a distributed system in this piecemeal manner (service by service) doesn't accurately reflect your customer's experience and might result in an overly sensitive interpretation.

Instead, you should measure performance against the SLO at the frontend to understand what users experience. The user does not care if a component service fails, causing a query to be automatically and successfully retried, if the user's query still succeeds. If you have shared internal services, these services can separately measure performance against their SLOs, with the user-facing services acting as their customers. You should handle these SLOs separately from each other.

It is possible to build a highly available service (for example, 99.99%) on top of a less-available service (for example, 99.9%) by using resilience factors such as smart retries, caching, and queueing.

As a general rule, anyone with a working knowledge of statistics should be able to read and understand your SLO without understanding your underlying service or organizational layout.

Example SLO worksheet

When you develop your SLO, remember to do the following:

  • Make sure that your SLIs specify an event, a success criterion, and where and how you record success or failure.
  • Define the SLI specification in terms of the proportion of events that are good.
  • Make sure that your SLO specifies both a target level and a measurement window.
  • Describe the advantages and disadvantages of your approach so that interested parties understand the tradeoffs and subtleties involved.

For example, consider the following SLO worksheet.

CUJ: Home page load

SLI type: Latency

SLI specification: Proportion of home page requests served in less than 100 ms

SLI implementations:

  • Proportion of home page requests served in less than 100 ms as measured from the latency column of the server log. (Disadvantage: This measurement misses requests that fail to reach the backend.)
  • Proportion of home page requests served in less than 100 ms as measured by probers that execute JavaScript in a browser running in a virtual machine. (Advantages and disadvantages: This measurement catches errors when requests cannot reach the network but might miss issues affecting only a subset of users.)

SLO: 99% of home page requests in the past 28 days served in less than 100 ms

What's next?

Terminology

This document provides common definitions for SLO-related terms used in the Google Cloud Architecture Framework: Reliability section.

  • service level: a measurement of how well a given service performs its expected work for the user. You can describe this measurement in terms of user happiness and measure it by various methods, depending on what the service does and what the user expects it to do or is told it can do.

    Example: "A user expects our service to be available and fast."

  • critical user journey (CUJ): a set of interactions a user has with a service to achieve a single goal—for example, a single click or a multi-step pipeline.

    Example: "A user clicks the checkout button and waits for the response that the cart is processed and a receipt is returned."

  • service level indicator (SLI): a gauge of user happiness that can be measured quantitatively for a service level. In other words, to measure a service level, you must measure an indicator that represents user happiness with that service level—for example, a service's availability. An SLI can be thought of as a line on a graph that changes over time, as the service improves or degrades. This tends to be a radio of "good" / "total" expressed as a unit-less percentage. By consistently using these percentages, teams can understand the SLI without deep knowledge of its implementation

    Example: "Measure the number of successful requests in the last 10 minutes divided by the number of all valid requests in the last 10 minutes."

  • service level objective (SLO): the level that you expect a service to achieve most of the time and against which an SLI is measured.

    Example: "Service responses are be faster than 400 milliseconds (ms) for 95% of all valid requests measured over 14 days."

    Graph that shows the relationship between SLOs and SLIs.

  • service level agreement (SLA): a description of what must happen if an SLO is not met. Generally, an SLA is a legal agreement between providers and customers and might even include terms of compensation. In technical discussions about SRE, this term is often avoided.

    Example: "If the service does not provide 99.95% availability over a calendar month, the service provider compensates the customer for every minute out of compliance."

  • error budget: how much time or how many negative events you can withstand before you violate your SLO. This measurement tells you how many errors your business can expect or tolerate. The error budget is critical in helping you make potentially risky decisions.

    Example: "If our SLO is 99.9% available, we allow 0.1% of our requests to serve errors, either through incidents, accidents, or experimentation."

SLOs and alerts

This document in the Google Cloud Architecture Framework: Reliability section provides details about alerting around SLOs.

A mistaken approach to introducing a new observability system like SLOs is to use the system to completely replace an earlier system. Rather, you should see SLOs as a complementary system. For example, instead of deleting your existing alerts, we recommend that you run them in parallel with the SLO alerts introduced here. This approach lets you discover which legacy alerts are predictive of SLO alerts, which alerts fire in parallel with your SLO alerts, and which alerts never fire.

A tenet of SRE is to alert based on symptoms, not on causes. SLOs are, by their very nature, measurements of symptoms. As you adopt SLO alerts, you might find that the symptom alert fires alongside other alerts. If you discover that your legacy, cause-based alerts fire with no SLO or symptoms, these are good candidates to be turned off entirely, turned into ticketing alerts, or logged for later reference.

For more information, see SRE Workbook, Chapter 5.

SLO burn rate

An SLO's burn rate is a measurement of how quickly an outage exposes users to errors and depletes the error budget. By measuring your burn rate, you can determine the time until a service violates its SLO. Alerting based on the SLO burn rate is a valuable approach. Remember that your SLO is based on a duration, which might be quite long (weeks or even months). However, the goal is to quickly detect a condition that results in an SLO violation before that violation actually occurs.

The following table shows the time it takes to exceed an objective if 100% of requests are failing for the given interval, assuming queries per second (QPS) is constant. For example, if you have a 99.9% SLO measured over 30 days, you can withstand 43.2 minutes of full downtime during that 30 days. For example, that downtime can occur all at once, or spaced over several incidents.

Objective 90 days 30 days 7 days 1 day
90% 9 days 3 days 16.8 hours 2.4 hours
99% 21.6 hours 7.2 hours 1.7 hours 14.4 minutes
99.9% 2.2 hours 43.2 minutes 10.1 minutes 1.4 minutes
99.99% 13 minutes 4.3 minutes 1 minute 8.6 seconds
99.999% 1.3 minutes 25.9 seconds 6 seconds 0.9 seconds

In practice, you cannot afford any 100%-outage incidents if you want to achieve high-success percentages. However, many distributed systems can partially fail or degrade gracefully. Even in those cases, you still want to know if a human needs to step in, even in such partial failures, and SLO alerts give you a way to determine that.

When to alert

An important question is when to act based on your SLO burn rate. As a rule, if you will exhaust your error budget in 24 hours, the time to page someone to fix an issue is now.

Measuring the rate of failure isn't always straightforward. A series of small errors might look terrifying in the moment but turn out to be short-lived and have an inconsequential impact on your SLO. Similarly, if a system is slightly broken for a long time, these errors can add up to an SLO violation.

Ideally, your team will react to these signals so that you spend almost all of your error budget (but not exceed it) for a given time period. If you spend too much, you violate your SLO. If you spend too little, you're not taking enough risk or possibly burning out your on-call team.

You need a way to determine when a system is broken enough that a human should intervene. The following sections discuss some approaches to that question.

Fast burns

One type of SLO burn is a fast SLO burn because it burns through your error budget quickly and demands that you intervene to avoid an SLO violation.

Suppose your service operates normally at 1000 queries per second (QPS), and you want to maintain 99% availability as measured over a seven-day week. Your error budget is about 6 million allowable errors (out of about 600 million requests). If you have 24 hours before your error budget is exhausted, for example, that gives you a limit of about 70 errors per second, or 252,000 errors in one hour. These parameters are based on the general rule that pageable incidents should consume at least 1% of the quarterly error budget.

You can choose to detect this rate of errors before that one hour has elapsed. For example, after observing 15 minutes of a 70-error-per-second rate, you might decide to page the on-call engineer, as the following diagram shows.

image

Ideally, the problem is solved before you expend one hour of your 24-hour budget. Choosing to detect this rate in a shorter window (for example, one minute) is likely to be too error-prone. If your target time to detect is shorter than 15 minutes, this number can be adjusted.

Slow burns

Another type of burn rate is a slow burn. Suppose you introduce a bug that burns your weekly error budget by day five or six, or your monthly budget by week two? What is the best response?

In this case, you might introduce a slow SLO burn alert that lets you know you're on course to consume your entire error budget before the end of the alerting window. Of course, that alert might return many false positives. For example, there might often be a condition where errors occur briefly but at a rate that would quickly consume your error budget. In these cases, the condition is a false positive because it lasts only a short time and does not threaten your error budget in the long term. Remember, the goal is not to eliminate all sources of error; it is to stay within the acceptable range to not exceed your error budget. You want to avoid alerting a human to intervene for events that are not legitimately threatening your error budget.

We recommend that you notify a ticket queue (as opposed to paging or emailing) for slow-burn events. Slow-burn events are not emergencies but do require human attention before the budget expires. These alerts shouldn't be emails to a team list, which quickly become a nuisance to be ignored. Tickets should be trackable, assignable, and transferrable. Teams should develop reports for ticket load, closure rates, actionability, and duplicates. Excessive, unactionable tickets are a great example of toil.

Using SLO alerts skillfully can take time and depend on your team's culture and expectations. Remember that you can fine-tune your SLO alerts over time. You can also have multiple alert methods, with varying alert windows, depending on your needs.

Latency alerts

In addition to availability alerts, you can also have latency alerts. With latency SLOs, you're measuring the percent of requests that are not meeting a latency target. By using this model, you can use the same alerting model that you use to detect fast or slow burns of your error budget.

As noted earlier about median latency SLOs, fully half your requests can be out of SLO. In other words, your users can suffer bad latency for days before you detect the impact on your long-term error budget. Instead, services should define tail latency objectives and typical latency objectives. We suggest using the historical 90th percentile to define typical and the 99th percentile for tail. After you set these targets, you can define SLOs based on the number of requests you expect to land in each latency category and how many are too slow. This approach is the same concept as an error budget and should be treated the same. Thus, you might end up with a statement like "90% of requests will be handled within typical latency and 99.9% within tail latency targets." These targets ensure that most users experience your typical latency and still let you track how many requests are slower than your tail latency targets.

Some services might have highly variant expected runtimes. For example, you might have dramatically different performance expectations for reading from a datastore system versus writing to it. Instead of enumerating every possible expectation, you can introduce runtime performance buckets, as the following tables show. This approach presumes that these types of requests are identifiable and pre-categorized into each bucket. You shouldn't expect to categorize requests on the fly.

User-facing website
Bucket Expected maximum runtime
Read 1 second
Write / update 3 seconds
Data processing systems
Bucket Expected maximum runtime
Small 10 seconds
Medium 1 minute
Large 5 minutes
Giant 1 hour
Enormous 8 hours

By measuring the system as it is today, you can understand how long these requests typically take to run. As an example, consider a system for processing video uploads. If the video is very long, the processing time should be expected to take longer. We can use the length of the video in seconds to categorize this work into a bucket, as the following table shows. The table records the number of requests per bucket as well as various percentiles for runtime distribution over the course of a week.

Video length Number of requests measured in one week 10% 90% 99.95%
Small 0 - - -
Medium 1.9 million 864 milliseconds 17 seconds 86 seconds
Large 25 million 1.8 seconds 52 seconds 9.6 minutes
Giant 4.3 million 2 seconds 43 seconds 23.8 minutes
Enormous 81,000 36 seconds 1.2 minutes 41 minutes

From such analysis, you can derive a few parameters for alerting:

  • fast_typical: At most, 10% of requests are faster than this time. If too many requests are faster than this time, your targets might be wrong, or something about your system might have changed.
  • slow_typical: At least 90% of requests are faster than this time. This limit drives your main latency SLO. This parameter indicates whether most of the requests are fast enough.
  • slow_tail: At least 99.95% of requests are faster than this time. This limit ensures that there aren't too many slow requests.
  • deadline: The point at which a user RPC or background processing times out and fails (a limit typically already hard-coded into the system). These requests won't actually be slow but will have actually failed with an error and instead count against your availability SLO.

A guideline in defining buckets is to keep a bucket's fast_typical, slow_typical, and slow_tail within an order of magnitude of each other. This guideline ensures that you don't have too broad of a bucket. We recommend that you don't attempt to prevent overlap or gaps between the buckets.

Bucket fast_typical slow_typical slow_tail deadline
Small 100 milliseconds 1 second 10 seconds 30 seconds
Medium 600 milliseconds 6 seconds 60 seconds (1 minute) 300 seconds
Large 3 seconds 30 seconds 300 seconds (5 minutes) 10 minutes
Giant 30 seconds 6 minutes 60 minutes (1 hour) 3 hours
Enormous 5 minutes 50 minutes 500 minutes (8 hours) 12 hours

This results in a rule like api.method: SMALL => [1s, 10s]. In this case, the SLO tracking system would see a request, determine its bucket (perhaps by analysing its method name or URI and comparing the name to a lookup table), then update the statistic based on the runtime of that request. If this took 700 milliseconds, it is within the slow_typical target. If it is 3 seconds, it is within slow_tail. If it is 22 seconds, it is beyond slow_tail, but not yet an error.

In terms of user happiness, you can think of missing tail latency as equivalent to being unavailable. (That is, the response is so slow that it should be considered a failure.) Due to this, we suggest using the same percentage that you use for availability, for example:

99.95% of all requests are satisfied within 10 seconds.

What you consider typical latency is up to you. Some teams within Google consider 90% to be a good target. This is related to your analysis and how you chose durations for slow_typical. For example:

90% of all requests are handled within 1 second.

Suggested alerts

Given these guidelines, the following table includes a suggested baseline set of SLO alerts.

SLOs Measurement window Burn rate Action

Availability, fast burn

Typical latency

Tail latency

1-hour window Less than 24 hours to SLO violation Page someone

Availability, slow burn

Typical latency, slow burn

Tail latency, slow burn

7-day window Greater than 24 hours to SLO violation Create a ticket

SLO alerting is a skill that can take time to develop. The durations in this section are suggestions; you can adjust these according to your own needs and level of precision. Tying your alerts to the measurement window or error budget expenditure might be helpful, or you might add another layer of alerting between fast burns and slow burns.

Build observability into your infrastructure and applications

This document in the Google Cloud Architecture Framework provides best practices to add observability into your services so that you can better understand your service performance and quickly identify issues. Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.

Monitoring is at the base of the service reliability hierarchy in the Google SRE Handbook. Without proper monitoring, you can't tell whether an application works correctly.

Instrument your code to maximize observability

A well-designed system aims to have the right amount of observability that starts in its development phase. Don't wait until an application is in production before you start to observe it. Instrument your code and consider the following guidance:

  • To debug and troubleshoot efficiently, think about what log and trace entries to write out, and what metrics to monitor and export. Prioritize by the most likely or frequent failure modes of the system.
  • Periodically audit and prune your monitoring. Delete unused or useless dashboards, graphs, alerts, tracing, and logging to eliminate clutter.

Google Cloud Observability provides real-time monitoring, hybrid multi-cloud monitoring and logging (such as for AWS and Azure), plus tracing, profiling, and debugging. Google Cloud Observability can also auto-discover and monitor microservices running on App Engine or in a service mesh like Istio.

If you generate lots of application data, you can optimize large-scale ingestion of analytics events logs with BigQuery. BigQuery is also suitable for persisting and analyzing high-cardinality timeseries data from your monitoring framework. This approach is useful because it lets you run arbitrary queries at a lower cost rather than trying to design your monitoring perfectly from the start, and decouples reporting from monitoring. You can create reports from the data using Looker Studio or Looker.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:

  • Implement monitoring early, such as before you initiate a migration or before you deploy a new application to a production environment.
  • Disambiguate between application issues and underlying cloud issues. Use the Monitoring API, or other Cloud Monitoring products and the Google Cloud Status Dashboard.
  • Define an observability strategy beyond monitoring that includes tracing, profiling, and debugging.
  • Regularly clean up observability artifacts that you don't use or that don't provide value, such as unactionable alerts.
  • If you generate large amounts of observability data, send application events to a data warehouse system such as BigQuery.

What's next

Explore other categories in the Architecture Framework such as system design, operational excellence, security, privacy, and compliance.

Design for scale and high availability

This document in the Google Cloud Architecture Framework provides design principles to architect your services so that they can tolerate failures and scale in response to customer demand. A reliable service continues to respond to customer requests when there's a high demand on the service or when there's a maintenance event. The following reliability design principles and best practices should be part of your system architecture and deployment plan.

Create redundancy for higher availability

Systems with high reliability needs must have no single points of failure, and their resources must be replicated across multiple failure domains. A failure domain is a pool of resources that can fail independently, such as a VM instance, zone, or region. When you replicate across failure domains, you get a higher aggregate level of availability than individual instances could achieve. For more information, see Regions and zones.

As a specific example of redundancy that might be part of your system architecture, to isolate failures in DNS registration to individual zones, use zonal DNS names for instances on the same network to access each other.

Design a multi-zone architecture with failover for high availability

Make your application resilient to zonal failures by architecting it to use pools of resources distributed across multiple zones, with data replication, load balancing and automated failover between zones. Run zonal replicas of every layer of the application stack, and eliminate all cross-zone dependencies in the architecture.

Replicate data across regions for disaster recovery

Replicate or archive data to a remote region to enable disaster recovery in the event of a regional outage or data loss. When replication is used, recovery is quicker because storage systems in the remote region already have data that is almost up to date, aside from the possible loss of a small amount of data due to replication delay. When you use periodic archiving instead of continuous replication, disaster recovery involves restoring data from backups or archives in a new region. This procedure usually results in longer service downtime than activating a continuously updated database replica and could involve more data loss due to the time gap between consecutive backup operations. Whichever approach is used, the entire application stack must be redeployed and started up in the new region, and the service will be unavailable while this is happening.

For a detailed discussion of disaster recovery concepts and techniques, see Architecting disaster recovery for cloud infrastructure outages.

Design a multi-region architecture for resilience to regional outages

If your service needs to run continuously even in the rare case when an entire region fails, design it to use pools of compute resources distributed across different regions. Run regional replicas of every layer of the application stack.

Use data replication across regions and automatic failover when a region goes down. Some Google Cloud services have multi-regional variants, such as Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible. For more information on regions and service availability, see Google Cloud locations.

Make sure that there are no cross-region dependencies so that the breadth of impact of a region-level failure is limited to that region.

Eliminate regional single points of failure, such as a single-region primary database that might cause a global outage when it is unreachable. Note that multi-region architectures often cost more, so consider the business need versus the cost before you adopt this approach.

For further guidance on implementing redundancy across failure domains, see the survey paper Deployment Archetypes for Cloud Applications (PDF).

Eliminate scalability bottlenecks

Identify system components that can't grow beyond the resource limits of a single VM or a single zone. Some applications scale vertically, where you add more CPU cores, memory, or network bandwidth on a single VM instance to handle the increase in load. These applications have hard limits on their scalability, and you must often manually configure them to handle growth.

If possible, redesign these components to scale horizontally such as with sharding, or partitioning, across VMs or zones. To handle growth in traffic or usage, you add more shards. Use standard VM types that can be added automatically to handle increases in per-shard load. For more information, see Patterns for scalable and resilient apps.

If you can't redesign the application, you can replace components managed by you with fully managed cloud services that are designed to scale horizontally with no user action.

Degrade service levels gracefully when overloaded

Design your services to tolerate overload. Services should detect overload and return lower quality responses to the user or partially drop traffic, not fail completely under overload.

For example, a service can respond to user requests with static web pages and temporarily disable dynamic behavior that's more expensive to process. This behavior is detailed in the warm failover pattern from Compute Engine to Cloud Storage. Or, the service can allow read-only operations and temporarily disable data updates.

Operators should be notified to correct the error condition when a service degrades.

Prevent and mitigate traffic spikes

Don't synchronize requests across clients. Too many clients that send traffic at the same instant causes traffic spikes that might cause cascading failures.

Implement spike mitigation strategies on the server side such as throttling, queueing, load shedding or circuit breaking, graceful degradation, and prioritizing critical requests.

Mitigation strategies on the client include client-side throttling and exponential backoff with jitter.

Sanitize and validate inputs

To prevent erroneous, random, or malicious inputs that cause service outages or security breaches, sanitize and validate input parameters for APIs and operational tools. For example, Apigee and Google Cloud Armor can help protect against injection attacks.

Regularly use fuzz testing where a test harness intentionally calls APIs with random, empty, or too-large inputs. Conduct these tests in an isolated test environment.

Operational tools should automatically validate configuration changes before the changes roll out, and should reject changes if validation fails.

Fail safe in a way that preserves function

If there's a failure due to a problem, the system components should fail in a way that allows the overall system to continue to function. These problems might be a software bug, bad input or configuration, an unplanned instance outage, or human error. What your services process helps to determine whether you should be overly permissive or overly simplistic, rather than overly restrictive.

Consider the following example scenarios and how to respond to failure:

  • It's usually better for a firewall component with a bad or empty configuration to fail open and allow unauthorized network traffic to pass through for a short period of time while the operator fixes the error. This behavior keeps the service available, rather than to fail closed and block 100% of traffic. The service must rely on authentication and authorization checks deeper in the application stack to protect sensitive areas while all traffic passes through.
  • However, it's better for a permissions server component that controls access to user data to fail closed and block all access. This behavior causes a service outage when it has the configuration is corrupt, but avoids the risk of a leak of confidential user data if it fails open.

In both cases, the failure should raise a high priority alert so that an operator can fix the error condition. Service components should err on the side of failing open unless it poses extreme risks to the business.

Design API calls and operational commands to be retryable

APIs and operational tools must make invocations retry-safe as far as possible. A natural approach to many error conditions is to retry the previous action, but you might not know whether the first try was successful.

Your system architecture should make actions idempotent - if you perform the identical action on an object two or more times in succession, it should produce the same results as a single invocation. Non-idempotent actions require more complex code to avoid a corruption of the system state.

Identify and manage service dependencies

Service designers and owners must maintain a complete list of dependencies on other system components. The service design must also include recovery from dependency failures, or graceful degradation if full recovery is not feasible. Take account of dependencies on cloud services used by your system and external dependencies, such as third party service APIs, recognizing that every system dependency has a non-zero failure rate.

When you set reliability targets, recognize that the SLO for a service is mathematically constrained by the SLOs of all its critical dependencies. You can't be more reliable than the lowest SLO of one of the dependencies. For more information, see the calculus of service availability.

Startup dependencies

Services behave differently when they start up compared to their steady-state behavior. Startup dependencies can differ significantly from steady-state runtime dependencies.

For example, at startup, a service may need to load user or account information from a user metadata service that it rarely invokes again. When many service replicas restart after a crash or routine maintenance, the replicas can sharply increase load on startup dependencies, especially when caches are empty and need to be repopulated.

Test service startup under load, and provision startup dependencies accordingly. Consider a design to gracefully degrade by saving a copy of the data it retrieves from critical startup dependencies. This behavior allows your service to restart with potentially stale data rather than being unable to start when a critical dependency has an outage. Your service can later load fresh data, when feasible, to revert to normal operation.

Startup dependencies are also important when you bootstrap a service in a new environment. Design your application stack with a layered architecture, with no cyclic dependencies between layers. Cyclic dependencies may seem tolerable because they don't block incremental changes to a single application. However, cyclic dependencies can make it difficult or impossible to restart after a disaster takes down the entire service stack.

Minimize critical dependencies

Minimize the number of critical dependencies for your service, that is, other components whose failure will inevitably cause outages for your service. To make your service more resilient to failures or slowness in other components it depends on, consider the following example design techniques and principles to convert critical dependencies into non-critical dependencies:

  • Increase the level of redundancy in critical dependencies. Adding more replicas makes it less likely that an entire component will be unavailable.
  • Use asynchronous requests to other services instead of blocking on a response or use publish/subscribe messaging to decouple requests from responses.
  • Cache responses from other services to recover from short-term unavailability of dependencies.

To render failures or slowness in your service less harmful to other components that depend on it, consider the following example design techniques and principles:

  • Use prioritized request queues and give higher priority to requests where a user is waiting for a response.
  • Serve responses out of a cache to reduce latency and load.
  • Fail safe in a way that preserves function.
  • Degrade gracefully when there's a traffic overload.

Ensure that every change can be rolled back

If there's no well-defined way to undo certain types of changes to a service, change the design of the service to support rollback. Test the rollback processes periodically. APIs for every component or microservice must be versioned, with backward compatibility such that the previous generations of clients continue to work correctly as the API evolves. This design principle is essential to permit progressive rollout of API changes, with rapid rollback when necessary.

Rollback can be costly to implement for mobile applications. Firebase Remote Config is a Google Cloud service to make feature rollback easier.

You can't readily roll back database schema changes, so execute them in multiple phases. Design each phase to allow safe schema read and update requests by the latest version of your application, and the prior version. This design approach lets you safely roll back if there's a problem with the latest version.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:

  • Implement exponential backoff with randomization in the error retry logic of client applications.
  • Implement a multi-region architecture with automatic failover for high availability.
  • Use load balancing to distribute user requests across shards and regions.
  • Design the application to degrade gracefully under overload. Serve partial responses or provide limited functionality rather than failing completely.
  • Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources.
  • Establish disaster recovery procedures and test them periodically.

What's next

Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.

Create reliable operational processes and tools

This document in the Google Cloud Architecture Framework provides operational principles to run your service in a reliable manner, such as how to deploy updates, run services in production environments, and test for failures. Architecting for reliability should cover the whole lifecycle of your service, not just software design.

Choose good names for applications and services

Avoid using internal code names in production configuration files, because they can be confusing, particularly to newer employees, potentially increasing time to mitigate (TTM) for outages. As much as possible, choose good names for all of your applications, services, and critical system resources such as VMs, clusters, and database instances, subject to their respective limits on name length. A good name describes the entity's purpose; is accurate, specific, and distinctive; and is meaningful to anybody who sees it. A good name avoids acronyms, code names, abbreviations, and potentially offensive terminology, and would not create a negative public response even if published externally.

Implement progressive rollouts with canary testing

Instantaneous global changes to service binaries or configuration are inherently risky. Roll out new versions of executables and configuration changes incrementally. Start with a small scope, such as a few VM instances in a zone, and gradually expand the scope. Roll back rapidly if the change doesn't perform as you expect, or negatively impacts users at any stage of the rollout. Your goal is to identify and address bugs when they only affect a small portion of user traffic, before you roll out the change globally.

Set up a canary testing system that's aware of service changes and does A/B comparison of the metrics of the changed servers with the remaining servers. The system should flag unexpected or anomalous behavior. If the change doesn't perform as you expect, the canary testing system should automatically halt rollouts. Problems can be clear, such as user errors, or subtle, like CPU usage increase or memory bloat.

It's better to stop and roll back at the first hint of trouble and diagnose issues without the time pressure of an outage. After the change passes canary testing, propagate it to larger scopes gradually, such as to a full zone, then to a second zone. Allow time for the changed system to handle progressively larger volumes of user traffic to expose any latent bugs.

For more information, see Application deployment and testing strategies.

Spread out traffic for timed promotions and launches

You might have promotional events, such as sales that start at a precise time and encourage many users to connect to the service simultaneously. If so, design client code to spread the traffic over a few seconds. Use random delays before they initiate requests.

You can also pre-warm the system. When you pre-warm the system, you send the user traffic you anticipate to it ahead of time to ensure it performs as you expect. This approach prevents instantaneous traffic spikes that could crash your servers at the scheduled start time.

Automate build, test, and deployment

Eliminate manual effort from your release process with the use of continuous integration and continuous delivery (CI/CD) pipelines. Perform automated integration testing and deployment. For example, create a modern CI/CD process with GKE.

For more information, see continuous integration, continuous delivery, test automation, and deployment automation.

Defend against operator error

Design your operational tools to reject potentially invalid configurations. Detect and alert when a configuration version is empty, partial or truncated, corrupt, logically incorrect or unexpected, or not received within the expected time. Tools should also reject configuration versions that differ too much from the previous version.

Disallow changes or commands with too broad a scope that are potentially destructive. These broad commands might be to "Revoke permissions for all users", "Restart all VMs in this region", or "Reformat all disks in this zone". Such changes should only be applied if the operator adds emergency override command-line flags or option settings when they deploy the configuration.

Tools must display the breadth of impact of risky commands, such as number of VMs the change impacts, and require explicit operator acknowledgment before the tool proceeds. You can also use features to lock critical resources and prevent their accidental or unauthorized deletion, such as Cloud Storage retention policy locks.

Test failure recovery

Regularly test your operational procedures to recover from failures in your service. Without regular tests, your procedures might not work when you need them if there's a real failure. Items to test periodically include regional failover, how to roll back a release, and how to restore data from backups.

Conduct disaster recovery tests

Like with failure recovery tests, don't wait for a disaster to strike. Periodically test and verify your disaster recovery procedures and processes.

You might create a system architecture to provide high availability (HA). This architecture doesn't entirely overlap with disaster recovery (DR), but it's often necessary to take HA into account when you think about recovery time objective (RTO) and recovery point objective (RPO) values.

HA helps you to meet or exceed an agreed level of operational performance, such as uptime. When you run production workloads on Google Cloud, you might deploy a passive or active standby instance in a second region. With this architecture, the application continues to provide service from the unaffected region if there's a disaster in the primary region. For more information, see Architecting disaster recovery for cloud outages.

Practice chaos engineering

Consider the use of chaos engineering in your test practices. Introduce actual failures into different components of production systems under load in a safe environment. This approach helps to ensure that there's no overall system impact because your service handles failures correctly at each level.

Failures you inject into the system can include crashing tasks, errors and timeouts on RPCs, or reductions in resource availability. Use random fault injection to test intermittent failures (flapping) in service dependencies. These behaviors are hard to detect and mitigate in production.

Chaos engineering ensures that the fallout from such experiments is minimized and contained. Treat such tests as practice for actual outages and use all of the information collected to improve your outage response.

What's next

Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.

Build efficient alerts

This document in the Google Cloud Architecture Framework provides operational principles to create alerts that help you run reliable services. The more information you have on how your service performs, the more informed your decisions are when there's an issue. Design your alerts for early and accurate detection of all user-impacting system problems, and minimize false positives.

Optimize the alert delay

There's a balance between alerts that are sent too soon that stress the operations team and alerts that are sent too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration.

Alert on symptoms rather than causes

Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don't alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.

Alert on outlier values rather than averages

When monitoring latency, define SLOs and set alerts for (pick two out of three) 90th, 95th, or 99th percentile latency, not for average or 50th percentile latency. Good mean or median latency values can hide unacceptably high values at the 90th percentile or above that cause very bad user experiences. Therefore you should apply this principle of alerting on outlier values when monitoring latency for any critical operation, such as a request-response interaction with a webserver, batch completion in a data processing pipeline, or a read or write operation on a storage service.

Build a collaborative incident management process

This document in the Google Cloud Architecture Framework provides best practices to manage services and define processes to respond to incidents. Incidents occur in all services, so you need a well-documented process to efficiently respond to these issues and mitigate them.

Incident management overview

It's inevitable that your well-designed system eventually fails to meet its SLOs. In the absence of an SLO, your customers loosely define what the acceptable service level is themselves from their past experience. Customers escalate to your technical support or similar group, regardless of what's in your SLA.

To properly serve your customers, establish and regularly test an incident management plan. The plan can be as short as a single-page checklist with ten items. This process helps your team to reduce time to detect (TTD) and time to mitigate (TTM).

TTM is preferred as opposed to TTR, where the R for repair or recovery is often used to mean a full fix versus mitigation. TTM emphasizes fast mitigation to quickly end the customer impact of an outage, and then start the often much longer process to fully fix the problem.

A well-designed system where operations are excellent increases the time between failures (TBF). In other words, operational principles for reliability, including good incident management, aim to make failures less frequent.

To run reliable services, apply the following best practices in your incident management process.

Assign clear service ownership

All services and their critical dependencies must have clear owners responsible for adherence to their SLOs. If there are reorganizations or team attrition, engineering leads must ensure that ownership is explicitly handed off to a new team, along with the documentation and training as required. The owners of a service must be easily discoverable by other teams.

Reduce time to detect (TTD) with well tuned alerts

Before you can reduce TTD, review and implement the recommendations in the build observability into your infrastructure and applications and define your reliability goals sections of the Architecture Framework reliability category. For example, disambiguate between application issues and underlying cloud issues.

A well-tuned set of SLIs alerts your team at the right time without alert overload. For more information, see the build efficient alerts section of the Architecture Framework reliability category or Tune up your SLI metrics: CRE life lessons.

Reduce time to mitigate (TTM) with incident management plans and training

To reduce TTM, define a documented and well-exercised incident management plan. Have readily available data on what's changed in the environment. Make sure that teams know generic mitigations they can quickly apply to minimize TTM. These mitigation techniques include draining, rolling back changes, upsizing resources, and degrading quality of service.

As discussed in another Architecture Framework reliability category document, create reliable operational processes and tools to support the safe and rapid rollback of changes.

Design dashboard layouts and content to minimize TTM

Organize your service dashboard layout and navigation so that an operator can determine in a minute or two if the service and all of its critical dependencies are running. To quickly pinpoint potential causes of problems, operators must be able to scan all charts on the dashboard to rapidly look for graphs that change significantly at the time of the alert.

The following list of example graphs might be on your dashboard to help troubleshoot issues. Incident responders should be able to glance at them in a single view:

  • Service level indicators, such as successful requests divided by total valid requests
  • Configuration and binary rollouts
  • Requests per second to the system
  • Error responses per second from the system
  • Requests per second from the system to its dependencies
  • Error responses per second to the system from its dependencies

Other common graphs to help troubleshoot include latency, saturation, request size, response size, query cost, thread pool utilization, and Java virtual machine (JVM) metrics (where applicable). Saturation refers to fullness by some limit such as quota or system memory size. Thread pool utilization looks for regressions due to pool exhaustion.

Test the placement of these graphs against a few outage scenarios to ensure that the most important graphs are near the top, and that the order of the graphs matches your standard diagnostic workflow. You can also apply machine learning and statistical anomaly detection to surface the right subset of these graphs.

Document diagnostic procedures and mitigation for known outage scenarios

Write playbooks and link to them from alert notifications. If these documents are accessible from the alert notifications, operators can quickly get the information they need to troubleshoot and mitigate problems.

Use blameless postmortems to learn from outages and prevent recurrences

Establish a blameless postmortem culture and an incident review process. Blameless means that your team evaluates and documents what went wrong in an objective manner, without the need to assign blame.

Mistakes are opportunities to learn, not a cause for criticism. Always aim to make the system more resilient so that it can recover quickly from human error, or even better, detect and prevent human error. Extract as much learning as possible from each postmortem and follow up diligently on each postmortem action item in order to make outages less frequent, thereby increasing TBF.

Incident management plan example

Production issues have been detected, such as through an alert or page, or escalated to me:

  • Should I delegate to someone else?
    • Yes, if you and your team can't resolve the issue.
  • Is this issue a privacy or security breach?
    • If yes, delegate to the privacy or security team.
  • Is this issue an emergency or are SLOs at risk?
    • If in doubt, treat it as an emergency.
  • Should I involve more people?
    • Yes, if it impacts more than X% of customers or if it takes more than Y minutes to resolve. If in doubt, always involve more people, especially within business hours.
  • Define a primary communications channel, such as IRC, Hangouts Chat, or Slack.
  • Delegate previously defined roles, such as the following:
    • Incident commander who is responsible for overall coordination.
    • Communications lead who is responsible for internal and external communications.
    • Operations lead who is responsible to mitigate the issue.
  • Define when the incident is over. This decision might require an acknowledgment from a support representative or other similar teams.
  • Collaborate on the blameless postmortem.
  • Attend a postmortem incident review meeting to discuss and staff action items.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, follow these recommendations::

What's next

Learn more about how to build a collaborative incident management process with the following resources:

Explore other categories in the Architecture Framework such as system design, operational excellence, and security, privacy, and compliance.

Google Cloud Architecture Framework: Product reliability guides

This section in the Architecture Framework has product-specific best practice guidance for reliability, availability, and consistency of some Google Cloud products. The guides also provide recommendations for minimizing and recovering from failures and for scaling your applications well under load.

The product reliability guides are organized under the following areas:

Compute Engine reliability guide

Compute Engine is a customizable compute service that enables users to create and run virtual machines on demand on Google's infrastructure.

Best practices

Cloud Run reliability guide

Cloud Run is a managed compute platform suitable for deploying containerized applications, and is serverless. Cloud Run abstracts away all infrastructure so users can focus on building applications.

Best practices

  • Cloud Run general tips - how to implement a Cloud Run service, start containers quickly, use global variables, and improve container security.
  • Load testing best practices - how to load test Cloud Run services, including addressing concurrency problems before load testing, managing the maximum number of instances, choosing the best region for load testing, and ensuring services scale with load.
  • Instance scaling - how to scale and limit container instances and minimize response time by keeping some instances idle instead of stopping them.
  • Using minimum instances - specify the least number of container instances ready to serve, and when set appropriately high, minimize average response time by reducing the number of cold starts.
  • Optimizing Java applications for Cloud Run - understand the tradeoffs of some optimizations for Cloud Run services written in Java, and reduce startup time and memory usage.
  • Optimizing Python applications for Cloud Run - optimize the container image by improving efficiency of the WSGI server, and optimize applications by reducing the number of threads and executing startup tasks in parallel.

Cloud Functions reliability guide

Cloud Functions is a scalable, event-driven, serverless platform to help build and connect services. Cloud Functions can be called via HTTP request or triggered based on background events.

Best practices

Google Kubernetes Engine reliability guide

Google Kubernetes Engine (GKE) is a system for operating containerized applications in the cloud, at scale. GKE deploys, manages, and provisions resources for your containerized applications. The GKE environment consists of Compute Engine instances grouped together to form a cluster.

Best practices

  • Best practices for operating containers - how to use logging mechanisms, ensure containers are stateless and immutable, monitor applications, and do liveness and readiness probes.
  • Best practices for building containers - how to package a single application per container, handle process identifiers (PIDs), optimize for the Docker build cache, and build smaller images for faster upload and download times.
  • Best practices for Google Kubernetes Engine networking - use VPC-native clusters for easier scaling, plan IP addresses, scale cluster connectivity, use Google Cloud Armor to block Distributed Denial-of-Service (DDoS) attacks, implement container-native load balancing for lower latency, use the health check functionality of external Application Load Balancers for graceful failover, and use regional clusters to increase the availability of applications in a cluster.
  • Prepare cloud-based Kubernetes applications - learn the best practices to plan for application capacity, grow application horizontally or vertically, set resource limits relative to resource requests for memory versus CPU, make containers lean for faster application startup, and limit Pod disruption by setting a Pod Disruption Budget (PDB). Also, understand how to set up liveness probes and readiness probes for graceful application startup, ensure non-disruptive shutdowns, and implement exponential backoff on retried requests to prevent traffic spikes that overwhelm your application.
  • GKE multi-tenancy best practices - how to design a multi-tenant cluster architecture for high availability and reliability, use Google Kubernetes Engine (GKE) usage metering for per-tenant usage metrics, provide tenant-specific logs, and provide tenant-specific monitoring.

Cloud Storage reliability guide

Cloud Storage is a durable and highly available object repository with advanced security and sharing capabilities. This service is used for storing and accessing data on Google Cloud infrastructure.

Best practices

  • Best practices for Cloud Storage - general best practices for Cloud Storage, including tips to maximize availability and minimize latency of your applications, improve the reliability of upload operations, and improve the performance of large-scale data deletions.
  • Request rate and access distribution guidelines - how to minimize latency and error responses on read, write, and delete operations at very high request rates by understanding how Cloud Storage auto-scaling works.

Firestore reliability guide

Firestore is a NoSQL document database that lets you store, sync, and query data for your mobile and web applications, at global scale.

Best practices

  • Firestore best practices - how to select your database location for increased reliability, minimize performance pitfalls in indexing, improve the performance of read and write operations, reduce latency for change notifications, and design for scale.

Bigtable reliability guide

Bigtable is a fully managed, scalable, NoSQL database for large analytical and operational workloads. It is designed as a sparsely populated table that can scale to billions of rows and thousands of columns, and supports high read and write throughput at low latency.

Best practices

  • Understand Bigtable performance - estimating throughput for Bigtable, how to plan Bigtable capacity by looking at throughput and storage use, how enabling replication affects read and write throughput differently, and how Bigtable optimizes data over time.
  • Bigtable schema design - guidance on designing Bigtable schema, including concepts of key/value store, designing row keys based on planned read requests, handling columns and rows, and special use cases.
  • Bigtable replication overview - how to replicate Bigtable across multiple zones or regions, understand performance implications of replication, and how Bigtable resolves conflicts and handles failovers.
  • About Bigtable backups- how to save a copy of a table's schema and data with Bigtable Backups, which can help you recover from application-level data corruption or from operator errors, such as accidentally deleting a table.

Cloud SQL reliability guide

Cloud SQL is a fully managed relational database service for MySQL, PostgreSQL, and SQL Server. Cloud SQL easily integrates with existing applications and Google Cloud services such as Google Kubernetes Engine and BigQuery.

Best practices

Spanner reliability guide

Spanner is a distributed SQL database management and storage service, with features such as global transactions and highly available horizontal scaling and transactional consistency.

Best practices

  • Spanner backup and restore - key features of Spanner Backup and Restore, comparison of Backup and Restore with Import and Export, implementation details, and how to control access to Spanner resources.
  • Regional and multi-region configurations - description of the two types of instance configurations that Spanner offers: regional configurations and multi-region configurations. The description includes the differences and trade-offs between each configuration.
  • Autoscaling Spanner - introduction to the Autoscaler tool for Spanner (Autoscaler), an open source tool that you can use as a companion tool to Cloud Spanner. This tool lets you automatically increase or reduce the number of nodes or processing units in one or more Spanner instances based on the utilization metrics of each Spanner instance.
  • About point-in-time recovery (PITR) - description of Spanner point-in-time recovery (PITR), a feature that protects against accidental deletion or writes of Spanner data. For example, an operator inadvertently writes data or an application rollout corrupts the database. With PITR, you can recover your data from a point-in-time in the past (up to a maximum of seven days) seamlessly.
  • Spanner best practices - guidance on bulk loading, using Data Manipulation Language (DML), designing schema to avoid hotspots, and SQL best practices.

Filestore reliability guide

Filestore is a managed file storage service for Google Cloud applications, with a filesystem interface and a shared filesystem for data. Filestore offers petabyte-scale online network attached storage (NAS) for Compute Engine and Google Kubernetes Engine instances.

Best practices

  • Filestore performance - performance settings and Compute Engine machine type recommendations, NFS mount options for best performance on Linux client VM instances, and using the fio tool to test performance. Includes recommendations for improved performance across multiple Google Cloud resources.

  • Filestore backups - description of Filestore backups, common use cases, and best practices for creating and using backups.

  • Filestore snapshots - description of Filestore snapshots, common use cases, and best practices for creating and using snapshots.

  • Filestore networking - networking and IP resource requirements needed to use Filestore.

Memorystore reliability guide

Memorystore is a fully-managed, in-memory store that provides a managed version of two open source caching solutions: Redis and Memcached. Memorystore is scalable, and automates complex tasks such as provisioning, replication, failover, and patching.

Best practices

  • Redis general best practices - guidance on exporting Redis Database (RDB) backups, resource-intensive operations, and operations requiring connection retry. In addition, information on maintenance, memory management, and setting up Serverless VPC Access connector, as well as private services access connection mode, and monitoring and alerts.
  • Redis memory management best practices - memory management concepts such as instance capacity and Maxmemory configuration, export, scaling, and version upgrade operations, memory management metrics, and how to resolve an out-of-memory condition.
  • Redis exponential backoff - how exponential backoff works, an example algorithm, and how maximum backoff and maximum number of retries work.
  • Memcached best practices - how to design application for cache misses, connecting directly to nodes' IP addresses, and Memcached Auto Discovery service. Also, guidance on configuring max-item-size parameter, balancing clusters, and using Cloud Monitoring to monitor essential metrics.
  • Memcached memory management best practices - configuring memory for a Memcached instance, Reserved Memory configuration, when to increase Reserved Memory, and metrics for memory usage.

Cloud DNS reliability guide

Cloud DNS is a low-latency domain name system that helps register, manage, and serve your domains. Cloud DNS scales to large numbers of DNS zones and records, and millions of DNS records can be created and updated via a user interface.

Best practices

  • Cloud DNS best practices - learn how to manage private zones, configure DNS forwarding, and create DNS server policies. Includes guidance on using Cloud DNS in a hybrid environment.

Cloud Load Balancing reliability guide

Cloud Load Balancing is a fully distributed, software-defined, managed service for all your traffic. Cloud Load Balancing also provides seamless autoscaling, Layer 4 and Layer 7 load balancing, and support for features such as IPv6 global load balancing.

Best practices

  • Performance best practices - how to spread load across application instances to deliver optimal performance. Strategies include backend placement in regions closest to traffic, caching, forwarding rule protocol selection, and configuring session affinity.
  • Using load balancing for highly available applications - how to use Cloud Load Balancing with Compute Engine to provide high availability, even during a zonal outage.

Cloud CDN reliability guide

Cloud CDN (Content Delivery Network) is a service that accelerates internet content delivery by using Google's edge network to bring content as close as possible to the user. Cloud CDN helps reduce latency, cost, and load, making it easier to scale services to users.

Best practices

BigQuery reliability guide

BigQuery is Google Cloud's data warehouse platform for storing and analyzing data at scale.

Best practices

  • Introduction to reliability - reliability best practices and introduction to concepts such as availability, durability, and data consistency.
  • Availability and durability - the types of failure domains that can occur in Google Cloud data centers, how BigQuery provides storage redundancy based on data storage location, and why cross-region datasets enhance disaster recovery.
  • Best practices for multi-tenant workloads on BigQuery - common patterns used in multi-tenant data platforms. These patterns include ensuring reliability and isolation for customers of software as a service (SaaS) vendors, important BigQuery quotas and limits for capacity planning, using BigQuery Data Transfer Service to copy relevant datasets into another region, and more.
  • Use Materialized Views - how to use BigQuery Materialized Views for faster queries at lower cost, including querying materialized views, aligning partitions, and understanding smart-tuning (automatic rewriting of queries).

Dataflow reliability guide

Dataflow is a fully-managed data processing service which enables fast, simplified, streaming data pipeline development using open source Apache Beam libraries. Dataflow minimizes latency, processing time, and cost through autoscaling and batch processing.

Best practices

Building production-ready data pipelines using Dataflow - a document series on using Dataflow including planning, developing, deploying, and monitoring Dataflow pipelines.

  • Overview - introduction to Dataflow pipelines.
  • Planning - measuring SLOs, understanding the impact of data sources and sinks on pipeline scalability and performance, and taking high availability, disaster recovery, and network performance into account when specifying regions to run your Dataflow jobs.
  • Developing and testing - setting up deployment environments, preventing data loss by using dead letter queues for error handling, and reducing latency and cost by minimizing expensive per-element operations. Also, using batching to reduce performance overhead without overloading external services, unfusing inappropriately fused steps so that the steps are separated for better performance, and running end-to-end tests in preproduction to ensure that the pipeline continues to meet your SLOs and other production requirements.
  • Deploying - continuous integration (CI) and continuous delivery and deployment (CD), with special considerations for deploying new versions of streaming pipelines. Also, an example CI/CD pipeline, and some features for optimizing resource usage. Finally, a discussion of high availability, geographic redundancy, and best practices for pipeline reliability, including regional isolation, use of snapshots, handling job submission errors, and recovering from errors and outages impacting running pipelines.
  • Monitoring - observe service level indicators (SLIs) which are important indicators of pipeline performance, and define and measure service level objectives (SLOs).

Dataproc reliability guide

Dataproc is a fully managed, scalable service for running Apache Hadoop and Spark jobs. With Dataproc, virtual machines can be customized and scaled up and down as needed. Dataproc integrates tightly with Cloud Storage, BigQuery, Bigtable, and other Google Cloud services.

Best practices

  • Dataproc High Availability mode - compare Hadoop High Availability (HA) mode with the default non-HA mode in terms of instance names, Apache ZooKeeper, Hadoop Distributed File System (HDFS), and Yet Another Resource Negotiator (YARN). Also, how to create a high availability cluster.
  • Autoscaling clusters - when to use Dataproc autoscaling, how to create an autoscaling policy, multi-cluster policy usage, reliability best practices for autoscaling configuration, and metrics and logs.
  • Dataproc Enhanced Flexibility Mode (EFM) - examples of using Enhanced Flexibility Mode to minimize job progress delays, advanced configuration such as partitioning and parallelism, and YARN graceful decommissioning on EFM clusters.
  • Graceful decomissioning - using graceful decomissioning to minimize the impact of removing workers from a cluster, how to use this feature with secondary workers, and command examples for graceful decomissioning.
  • Restartable jobs - by using optional settings, you can set jobs to restart on failure to mitigate common types of job failure, including out-of-memory issues and unexpected Compute Engine virtual machine reboots.

Google Cloud Architecture Framework: Cost optimization

This category in the Google Cloud Architecture Framework provides design recommendations and describes best practices to help architects, developers, administrators, and other cloud practitioners optimize the cost of workloads in Google Cloud.

Moving your IT workloads to the cloud can help you to innovate at scale, deliver features faster, and respond to evolving customer needs. To migrate existing workloads or deploy applications built for the cloud, you need a topology that's optimized for security, resilience, operational excellence, cost, and performance.

In the cost optimization category of the Architecture Framework, you learn to do the following:

Adopt and implement FinOps

This document in the Google Cloud Architecture Framework outlines strategies to help you consider the cost impact of your actions and decisions when provisioning and managing resources in Google Cloud. It discusses FinOps, a practice that combines people, processes, and technology to promote financial accountability and the discipline of cost optimization in an organization, regardless of its size or maturity in the cloud.

The guidance in this section is intended for CTOs, CIOs, and executives responsible for controlling their organization's spend in the cloud. The guidance also helps individual cloud operators understand and adopt FinOps.

Every employee in your organization can help reduce the cost of your resources in Google Cloud, regardless of role (analyst, architect, developer, or administrator). In teams that have not had to track infrastructure costs in the past, you might have to educate employees about the need for collective responsibility.

A common model is for a central FinOps team or Cloud Center of Excellence (CCoE) to standardize the process for optimizing cost across all the cloud workloads. This model assumes that the central team has the required knowledge and expertise to identify high-value opportunities to improve efficiency.

Although centralized cost-control might work well in the initial stages of cloud adoption when usage is low, it doesn't scale well when cloud adoption and usage increase. The central team might struggle with scaling, and project teams might not accept decisions made by anyone outside their teams.

We recommend that the central team delegate the decision making for resource optimization to the project teams. The central team can drive broader efforts to encourage the adoption of FinOps across the organization. To enable the individual project teams to practice FinOps, the central team must standardize the process, reporting, and tooling for cost optimization. The central team must work closely with teams that aren't familiar with FinOps practices, and help them consider cost in their decision-making processes. The central team must also act as an intermediary between the finance team and the individual project teams.

The next sections describe the design principles that we recommend your central team promote.

Encourage individual accountability

Any employee who creates and uses cloud resources affects the usage and the cost of those resources. For an organization to succeed at implementing FinOps, the central team must help employees transition from viewing cost as someone else's responsibility, to owning cost as their own individual responsibility. With this transition, employees own and make cost decisions that are appropriate for their workloads, team, and the organization. This ownership extends to implementing data-driven cost-optimization actions.

To encourage accountability for cost, the central team can take the following actions:

  • Educate users about cost-optimization opportunities and techniques.
  • Reward employees who optimize cost, and celebrate success.
  • Make costs visible across the organization.

Make costs visible

For employees to consider cost when provisioning and managing resources in the cloud, they need a complete view of relevant data, as close to real time as possible. Data in reports and dashboards must show the cost and business impact of team members' decisions as the relevant impacts occur. Usage and cost data of other teams can serve as baselines for identifying efficient deployment patterns. This data can help promote a shared understanding of the best ways to use cloud services.

If an organization doesn't encourage and promote sharing cost data, employees might be reluctant to share data. Sometimes, for business reasons, an organization might not permit sharing of raw cost data. Even in these cases, we recommend that you avoid a default policy that restricts access to cost information.

To make costs visible across the organization, the central team can take the following actions:

  • Use a single, well-defined method for calculating the fully loaded costs of cloud resources. For example, the method could consider the total cloud spend adjusted for purchased discounts and shared costs, like the cost of shared databases.
  • Set up dashboards that enable employees to view their cloud spend in near real time.
  • To motivate individuals in the team to own their costs, allow wide visibility of cloud spending across teams.

Enable collaborative behavior

Effective cost management for cloud resources requires that teams collaborate to improve their technical and operational processes. A collaborative culture helps teams design cost-effective deployment patterns based on a consistent set of business objectives and factors.

To enable collaborative behavior, the central team can take the following actions:

  • Create a workload-onboarding process that helps ensure cost efficiency in the design stage through peer reviews of proposed architectures by other engineers.
  • Create a cross-team knowledge base of cost-efficient architectural patterns.

Establish a blameless culture

Promote a culture of learning and growth that makes it safe to take risks, make corrections when required, and innovate. Acknowledge that mistakes, sometimes costly ones, can happen at any stage during the IT design and deployment lifecycle, as in any other part of the business.

Rather than blaming and shaming individuals who have overspent or introduced wastage, promote a blameless culture that helps identify the cause of cost overruns and miscalculations. In this environment, team members are more likely to share their views and experience. Mistakes are anonymized and shared across the business to prevent recurrence.

Don't confuse a blameless culture with a lack of accountability. Employees continue to be accountable for the decisions they make and the money they spend. But when mistakes occur, the emphasis is on the learning opportunity to prevent the errors from occurring again.

To establish a blameless culture, the central team can take the following actions:

  • Run blameless postmortems for major cost issues, focusing on the systemic root cause of the issues, rather than the people involved.
  • Celebrate team members who respond to cost overruns and who share lessons learned. Encourage other members in the team to share mistakes, actions taken, and lessons learned.

Focus on business value

While FinOps practices are often focused on cost reduction, the focus for a central team must be on enabling project teams to make decisions that maximize the business value of their cloud resources. It can be tempting to make decisions that reduce cost to a point where the minimum service levels are met. But, such decisions often shift cost to other resources, can lead to higher maintenance cost, and might increase your total cost of ownership. For example, to reduce cost, you might decide to use virtual machines (VMs) instead of a managed service. But, a VM-based solution requires more effort to maintain when compared with a managed service, and so the managed service might offer a higher net business value.

FinOps practices can provide project teams the visibility and insights that they need to make architectural and operational decisions that maximize the business value of their cloud resources.

To help employees focus on business value, the central team can take the following actions:

  • Use managed services and serverless architectures to reduce the total cost of ownership of your compute resources. For more information, see Choose a compute platform.

  • Correlate cloud usage to business-value metrics like cost efficiency, resilience, feature velocity, and innovation that drive cost-optimization decisions. To learn more about business-value metrics, see the Cloud FinOps whitepaper.

  • Implement unit costing for all your applications and services running in the cloud.

What's next

Monitor and control cost

This document in Google Cloud Architecture Framework describes best practices, tools, and techniques to help you track and control the cost of your resources in Google Cloud.

The guidance in this section is intended for users who provision or manage resources in the cloud.

Cost-management focus areas

The cost of your resources in Google Cloud depends on the quantity of resources that you use and the rate at which you're billed for the resources.

To manage the cost of cloud resources, we recommend that you focus on the following areas:

  • Cost visibility
  • Resource optimization
  • Rate optimization

Cost visibility

Track how much you spend and how your resources and services are billed, so that you can analyze the effect of cost on business outcomes. We recommend that you follow the FinOps operating model, which suggests the following actions to make cost information visible across your organization:

  • Allocate: Assign an owner for every cost item.
  • Report: Make cost data available, consumable, and actionable.
  • Forecast: Estimate and track future spend.

Resource optimization

Align the number and size of your cloud resources to the requirements of your workload. Where feasible, consider using managed services or re-architecting your applications. Typically, individual engineering teams have more context than the central FinOps (financial operations) team on opportunities and techniques to optimize resource deployment. We recommend that the FinOps team work with the individual engineering teams to identify resource-optimization opportunities that can be applied across the organization.

Rate optimization

The FinOps team often makes rate optimization decisions centrally. We recommend that the individual engineering teams work with the central FinOps team to take advantage of deep discounts for reservations, committed usage, Spot VMs, flat-rate pricing, and volume and contract discounting.

Design recommendations

This section suggests approaches that you can use to monitor and control costs.

Consolidate billing and resource management

To manage billing and resources in Google Cloud efficiently, we recommend that you use a single billing account for your organization, and use internal chargeback mechanisms to allocate costs. Use multiple billing accounts for loosely structured conglomerates and organizations with entities that don't affect each other. For example, resellers might need distinct accounts for each customer. Using separate billing accounts might also help you meet country-specific tax regulations.

Another recommended best practice is to move all the projects that you manage into your organization. We recommend using Resource Manager to build a resource hierarchy that helps you achieve the following goals:

  • Establish a hierarchy of resource-ownership based on the relationship of each resource to its immediate parent.
  • Control how access policies and cost-allocation tags or labels are attached to and inherited by the resources in your organization.

In addition, we recommend that you allocate the cost of shared services proportionally based on consumption. Review and adjust the cost allocation parameters periodically based on changes in your business goals and priorities.

Track and allocate cost using tags or labels

Tags and labels are two different methods that you can use to annotate your Google Cloud resources. Tags provide more capabilities than labels. For example, you can implement fine-grained control over resources by creating Identity and Access Management (IAM) policies that are conditional based on whether a tag is attached to a supported resource. In addition, the tags that are associated with a resource are inherited by all the child resources in the hierarchy. For more information about the differences between tags and labels, see Tags overview.

If you're building a new framework for cost allocation and tracking, we recommend using tags.

To categorize cost data at the required granularity, establish a tagging or labeling schema that suits your organization's chargeback mechanism and helps you allocate costs appropriately. You can define tags at the organization or project level. You can assign labels at the project level, and define a set of labels that can be applied by default to all the projects.

Define a process to detect and correct tagging and labeling anomalies and unlabeled projects. For example, from Cloud Asset Inventory, you can download an inventory (.csv file) of all the resources in a project and analyze the inventory to identify resources that aren't assigned any tags or labels.

To track the cost of shared resources and services (for example, common datastores, multi-tenant clusters, and support subscriptions), consider using a special tag or label to identify projects that contain shared resources.

Configure billing access control

To control access to Cloud Billing, we recommend that you assign the Billing Account Administrator role to only those users who manage billing contact information. For example, employees in finance, accounting, and operations might need this role.

To avoid a single point of failure for billing support, assign the Billing Account Administrator role to multiple users or to a group. Only users with the Billing Account Administrator role can contact support. For detailed guidance, see Cloud Billing access control examples and Important Roles.

Make the following configurations to manage access to billing:

  • To associate a billing account with a project, members need the Billing Account User role on the billing account and the Project Billing Manager role on the project.
  • To enable teams to manually associate billing accounts with projects, you can assign the Project Billing Manager role at the organization level and the Billing Account User role on the billing account. You can automate the association of billing accounts during project creation by assigning the Project Billing Manager and Billing Account User roles to a service account. We recommend that you restrict the Billing Account Creator role or remove all assignments of this role.
  • To prevent outages caused by unintentional changes to the billing status of a project, you can lock the link between the project and its billing account. For more information, see Secure the link between a project and its billing account.

Configure billing reports

Set up billing reports to provide data for the key metrics that you need to track. We recommend that you track the following metrics:

  • Cost trends
  • Largest spenders (by project and by product)
  • Areas of irregular spending
  • Key organization-wide insights as follows:
    • Anomaly detection
    • Trends over time
    • Trends that occur in a set pattern (for example, month-on-month)
    • Cost comparison and benchmark analysis between internal and external workloads
    • Business case tracking and value realization (for example, cloud costs compared with the cost of similar on-premises resources)
    • Validation that Google Cloud bills are as expected and accurate

Customize and analyze cost reports using BigQuery Billing Export, and visualize cost data using Looker Studio. Assess the trend of actual costs and how much you might spend by using the forecasting tool.

Optimize resource usage and cost

This section recommendeds best practices to help you optimize the usage and cost of your resources across Google Cloud services.

To prevent overspending, consider configuring default budgets and alerts with high thresholds for all your projects. To help keep within budgets, we recommend that you do the following:

  • Configure budgets and alerts for projects where absolute usage limits are necessary (for example, training or sandbox projects).

  • Define budgets based on the financial budgets that you need to track. For example, if a department has an overall cloud budget, set the scope of the Google Cloud budget to include the specific projects that you need to track.

  • To ensure that budgets are maintained, delegate the responsibility for configuring budgets and alerts to the teams that own the workloads.

To help optimize costs, we also recommend that you do the following:

  • Cap API usage in cases where it has minimal or no business impact. Capping can be useful for sandbox or training projects and for projects with fixed budgets (for example, ad-hoc analytics in BigQuery). Capping doesn't remove all the resources and data from the associated projects.
  • Use quotas to set hard limits that throttle resource deployment. Quotas help you control cost and prevent malicious use or misuse of resources. Quotas are applied at the project level, per resource type and location.
  • View and implement the cost-optimization recommendations in the Recommendation Hub.
  • Purchase committed use discounts (CUD) to save money on resources for workloads with predictable resource needs.

Tools and techniques

The on-demand provisioning and pay-per-use characteristics of the cloud help you to optimize your IT spend. This section describes tools that Google Cloud provides and techniques that you can use to track and control the cost of your resources in the cloud. Before you use these tools and techniques, review the basic Cloud Billing concepts.

Billing reports

Google Cloud provides billing reports within the Google Cloud console to help you view your current and forecasted spend. The billing reports enable you to view cost data on a single page, discover and analyze trends, forecast the end-of-period cost, and take corrective action when necessary.

Billing reports provide the following data:

  • The costs and cost trends for a given period, organized as follows:
    • By billing account
    • By project
    • By product (for example, Compute Engine)
    • By SKU (for example, static IP addresses)
  • The potential costs if discounts or promotional credits were excluded
  • The forecasted spend

Data export to BigQuery

You can export billing reports to BigQuery, and analyze costs using granular and historical views of data, including data that's categorized using labels or tags. You can perform more advanced analyses using BigQuery ML. We recommend that you enable export of billing reports to BigQuery when you create the Cloud Billing account. Your BigQuery dataset contains billing data from the date you set up Cloud Billing export. The dataset doesn't include data for the period before you enabled export.

To visualize cost data, you can create custom dashboards that integrate with BigQuery (example templates: Looker, Looker Studio).

You can use tags and labels as criteria for filtering the exported billing data. The number of labels included in the billing export is limited. Up to a 1,000 label-maps within a a period of one hour are preserved. Labels don't appear in the invoice PDF or CSV. Consider annotating resources by using tags or labels that indicate the business unit, internal chargeback unit, and other relevant metadata.

Billing access control

You can control access to Cloud Billing for specific resources by defining Identity and Access Management (IAM) policies for the resources. To grant or limit access to Cloud Billing, you can set an IAM policy at the organization level, the billing account level, or the project level.

Access control for billing and resource management follows the principle of separation of duties. Each user has only the permissions necessary for their business role. The Organization Administrator and Billing Administrator roles don't have the same permissions.

You can set billing-related permissions at the billing account level and the organization level. The common roles are Billing Account Administrator, Billing Account User, and Billing Account Viewer.

We recommend that you use invoiced billing, or configure a backup payment method. Maintain contact and notification settings for billing and payment.

Budgets, alerts, and quotas

Budgets help you track actual Google Cloud costs against planned spending. When you create a budget, you can configure alert rules to trigger email notifications when the actual or forecasted spend exceeds a defined threshold. You can also use budgets to automate cost-control responses.

Budgets can trigger alerts to inform you about resource usage and cost trends, and prompt you to take cost-optimization actions. However, budgets don't prevent the use or billing of your services when the actual cost reaches or exceeds the budget or threshold. To automatically control cost, you can use budget notifications to programmatically disable Cloud Billing for a project. You can also limit API usage to stop incurring cost after a defined usage threshold.

You can configure alerts for billing accounts and projects. Configure at least one budget for an account.

To prevent provisioning resources beyond a predetermined level or to limit the volume of specific operations, you can set quotas at the resource or API level. The following are examples of how you can use quotas:

  • Control the number of API calls per second.
  • Limit the number of VMs created.
  • Restrict the amount of data queried per day in BigQuery.

Project owners can reduce the amount of quota that can be charged against a quota limit, by using the Service Usage API to apply consumer overrides to specific quota limits. For more information, see Creating a consumer quota override.

Workload efficiency improvement

We recommend the following strategies to help make your workloads in Google Cloud cost-efficient:

  • Optimize resource usage by improving product efficiency.
  • Reduce the rate at which you're billed for resources.
  • Control and limit resource usage and spending.

When selecting cost-reduction techniques and Google Cloud features, consider the effort required and the expected savings, as shown in the following graph:

Cost optimization strategies: effort-to-savings map

The following is a summary of the techniques shown in the preceding graph:

  • The following techniques potentially yield high savings with low effort:
    • Committed use discounts
    • Autoscaling
    • BigQuery slots
  • The following techniques potentially yield high savings with moderate-to-high effort:
    • Spot VMs
    • Re-architecting as serverless or containerized applications
    • Re-platforming to use managed services
  • The following techniques potentially yield moderate savings with moderate effort:
    • Custom machine types
    • Cloud Storage lifecycle management
    • Rightsizing
    • Reclaiming idle resources

The techniques explained in the following sections can help you improve the efficiency of your workloads.

Refactoring or re-architecting

You can achieve substantial cost savings by refactoring or re-architecting your workload to use Google Cloud products. For example, moving to serverless services (like Cloud Storage, Cloud Run, BigQuery, and Cloud Functions) that support scaling to zero can help improve efficiency. To assess and compare the cost of these products, you can use the pricing calculator.

Rightsizing

This technique helps you ensure that the scale of your infrastructure matches the intended usage. This strategy is relevant primarily to infrastructure-as-a-service (IaaS) solutions, where you pay for the underlying infrastructure. For example, you've deployed 50 VMs, but the VMs aren't fully utilized, and you determine that the workloads could run effectively on fewer (or smaller) VMs. In this case, you can remove or resize some of the VMs. Google Cloud provides rightsizing recommendations to help you detect opportunities to save money without affecting performance by provisioning smaller VMs. Rightsizing requires less effort if done during the design phase than after deploying resources to production.

Autoscaling

If the products you use support dynamic autoscaling, consider designing the workloads to take advantage of autoscaling to get cost and performance benefits. For example, for compute-intensive workloads, you can use managed instance groups in Compute Engine or containerize the applications and deploy them to a Google Kubernetes Engine cluster.

Active Assist recommendations

Active Assist uses data, intelligence, and machine learning to reduce cloud complexity and administrative effort. Active Assist makes it easy to optimize the security, performance, and cost of your cloud topology. It provides intelligent recommendations for optimizing your costs and usage. You can apply these recommendations for immediate cost savings and greater efficiency.

The following are examples of recommendations provided by Active Assist:

  • Compute Engine resource rightsizing: Resize your VM instances to optimize for cost and performance based on usage. Identify and delete or back up idle VMs and persistent disks to optimize your infrastructure cost.
  • Committed-use discount (CUD): Google Cloud analyzes your historical usage, finds the optimal commitment quantity for your workloads, and provides easy-to-understand, actionable recommendations for cost savings. For more information, see Committed use discount recommender.
  • Unattended projects: Discover unattended projects in your organization, and remove or reclaim them. For more information, see Unattended project recommender.

For a complete list, see Recommenders.

What's next

Optimize cost: Compute, containers, and serverless

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your virtual machines (VMs), containers, and serverless resources in Google Cloud.

The guidance in this section is intended for architects, developers, and administrators who are responsible for provisioning and managing compute resources for workloads in the cloud.

Compute resources are the most important part of your cloud infrastructure. When you migrate your workloads to Google Cloud, a typical first choice is Compute Engine, which lets you provision and manage VMs efficiently in the cloud. Compute Engine offers a wide range of machine types, and is available globally in all the Google Cloud regions. Compute Engine's predefined and custom machine types let you provision VMs that offer similar compute capacity as your on-premises infrastructure, enabling you to accelerate the migration process. Compute Engine gives you the pricing advantage of paying only for the infrastructure that you use and provides significant savings as you use more compute resources with sustained-use discounts.

In addition to Compute Engine, Google Cloud offers containers and serverless compute services. The serverless approach can be more cost-efficient for new services that aren't always running (for example, APIs, data processing, and event processing).

Along with general recommendations, this document provides guidance to help you optimize the cost of your compute resources when using the following products:

  • Compute Engine
  • Google Kubernetes Engine (GKE)
  • Cloud Run
  • Cloud Functions
  • App Engine

General recommendations

The following recommendations are applicable to all the compute, containers, and serverless services in Google Cloud that are discussed in this section.

Track usage and cost

Use the following tools and techniques to monitor resource usage and cost:

Control resource provisioning

Use the following recommendations to control the quantity of resources provisioned in the cloud and the location where the resources are created:

  • To help ensure that resource consumption and cost don't exceed the forecast, use resource quotas.
  • Provision resources in the lowest-cost region that meets the latency requirements of your workload. To control where resources are provisioned, you can use the organization policy constraint gcp.resourceLocations.

Get discounts for committed use

Committed use discounts (CUDs) are ideal for workloads with predictable resource needs. After migrating your workload to Google Cloud, find the baseline for the resources required, and get deeper discounts for committed usage. For example, purchase a one or three-year commitment, and get a substantial discount on Compute Engine VM pricing.

Automate cost-tracking using labels

Define and assign labels consistently. The following are examples of how you can use labels to automate cost-tracking:

  • For VMs that only developers use during business hours, assign the label env: development. You can use Cloud Scheduler to set up a serverless Cloud Function to shut down these VMs after business hours, and restart them when necessary.

  • For an application that has several Cloud Run services and Cloud Functions instances, assign a consistent label to all the Cloud Run and Cloud Functions resources. Identify the high-cost areas, and take action to reduce cost.

Customize billing reports

Configure your Cloud Billing reports by setting up the required filters and grouping the data as necessary (for example, by projects, services, or labels).

Promote a cost-saving culture

Train your developers and operators on your cloud infrastructure. Create and promote learning programs using traditional or online classes, discussion groups, peer reviews, pair programming, and cost-saving games. As shown in Google's DORA research, organizational culture is a key driver for improving performance, reducing rework and burnout, and optimizing cost. By giving employees visibility into the cost of their resources, you help them align their priorities and activities with business objectives and constraints.

Compute Engine

This section provides guidance to help you optimize the cost of your Compute Engine resources. In addition to this guidance, we recommend that you follow the general recommendations discussed earlier.

Understand the billing model

To learn about the billing options for Compute Engine, see Pricing.

Analyze resource consumption

To help you to understand resource consumption in Compute Engine, export usage data to BigQuery. Query the BigQuery datastore to analyze your project's virtual CPU (vCPU) usage trends, and determine the number of vCPUs that you can reclaim. If you've defined thresholds for the number of cores per project, analyze usage trends to spot anomalies and take corrective actions.

Reclaim idle resources

Use the following recommendations to identify and reclaim unused VMs and disks, such as VMs for proof-of-concept projects that have since been deprioritized:

  • Use the idle VM recommender to identify inactive VMs and persistent disks based on usage metrics.
  • Before deleting resources, assess the potential impact of the action and plan to recreate the resources if that becomes necessary.
  • Before deleting a VM, consider taking a snapshot. When you delete a VM, the attached disks are deleted, unless you've selected the Keep disk option.
  • When feasible, consider stopping VMs instead of deleting them. When you stop a VM, the instance is terminated, but disks and IP addresses are retained until you detach or delete them.

Adjust capacity to match demand

Schedule your VMs to start and stop automatically. For example, if a VM is used only eight hours a day for five days a week (that's 40 hours in the week), you can potentially reduce costs by 75 percent by stopping the VM during the 128 hours in the week when the VM is not used.

Autoscale compute capacity based on demand by using managed instance groups. You can autoscale capacity based on the parameters that matter to your business (for example, CPU usage or load-balancing capacity).

Choose appropriate machine types

Size your VMs to match your workload's compute requirements by using the VM machine type recommender.

For workloads with predictable resource requirements, tailor the machine type to your needs and save money by using custom VMs.

For batch-processing workloads that are fault-tolerant, consider using Spot VMs. High-performance computing (HPC), big data, media transcoding, continuous integration and continuous delivery (CI/CD) pipelines, and stateless web applications are examples of workloads that can be deployed on Spot VMs. For an example of how Descartes Labs reduced their analysis costs by using preemptible VMs (the older version of Spot VMs) to process satellite imagery, see the Descartes Labs case study.

Evaluate licensing options

When you migrate third-party workloads to Google Cloud, you might be able to reduce cost by bringing your own licenses (BYOL). For example, to deploy Microsoft Windows Server VMs, instead of using a premium image that incurs additional cost for the third-party license, you can create and use a custom Windows BYOL image. You then pay only for the VM infrastructure that you use on Google Cloud. This strategy helps you continue to realize value from your existing investments in third-party licenses.

If you decide to use a BYOL approach, we recommend that you do the following:

  • Provision the required number of compute CPU cores independently of memory by using custom machine types, and limit the third-party licensing cost to the number of CPU cores that you need.
  • Reduce the number of vCPUs per core from 2 to 1 by disabling simultaneous multithreading (SMT), and reduce your licensing costs by 50 percent.

If your third-party workloads need dedicated hardware to meet security or compliance requirements, you can bring your own licenses to sole-tenant nodes.

Google Kubernetes Engine

This section provides guidance to help you optimize the cost of your GKE resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Use GKE Autopilot to let GKE maximize the efficiency of your cluster's infrastructure. You don't need to monitor the health of your nodes, handle bin-packing, or calculate the capacity that your workloads need.
  • Fine-tune GKE autoscaling by using Horizontal Pod Autoscaler (HPA), Vertical Pod Autoscaler (VPA), Cluster Autoscaler (CA), or node auto-provisioning based on your workload's requirements.
  • For batch workloads that aren't sensitive to startup latency, use the optimization-utilization autoscaling profile to help improve the utilization of the cluster.
  • Use node auto-provisioning to extend the GKE cluster autoscaler, and efficiently create and delete node pools based on the specifications of pending pods without over-provisioning.
  • Use separate node pools: a static node pool for static load, and dynamic node pools with cluster autoscaling groups for dynamic loads.
  • Use Spot VMs for Kubernetes node pools when your pods are fault-tolerant and can terminate gracefully in less than 25 seconds. Combined with the GKE cluster autoscaler, this strategy helps you ensure that the node pool with lower-cost VMs (in this case, the node pool with Spot VMs) scales first.
  • Choose cost-efficient machine types (for example: E2, N2D, T2D), which provide 20–40% higher performance-to-price.
  • Use GKE usage metering to analyze your clusters' usage profiles by namespaces and labels. Identify the team or application that's spending the most, the environment or component that caused spikes in usage or cost, and the team that's wasting resources.
  • Use resource quotas in multi-tenant clusters to prevent any tenant from using more than its assigned share of cluster resources.
  • Schedule automatic downscaling of development and test environments after business hours.
  • Follow the best practices for running cost-optimized Kubernetes applications on GKE.

Cloud Run

This section provides guidance to help you optimize the cost of your Cloud Run resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Adjust the concurrency setting (default: 80) to reduce cost. Cloud Run determines the number of requests to be sent to an instance based on CPU and memory usage. By increasing the request concurrency, you can reduce the number of instances required.
  • Set a limit for the number of instances that can be deployed.
  • Estimate the number of instances required by using the Billable Instance Time metric. For example, if the metric shows 100s/s, around 100 instances were scheduled. Add a 30% buffer to preserve performance; that is, 130 instances for 100s/s of traffic.
  • To reduce the impact of cold starts, configure a minimum number of instances. When these instances are idle, they are billed at a tenth of the price.
  • Track CPU usage, and adjust the CPU limits accordingly.
  • Use traffic management to determine a cost-optimal configuration.
  • Consider using Cloud CDN or Firebase Hosting for serving static assets.
  • For Cloud Run apps that handle requests globally, consider deploying the app to multiple regions, because cross continent data transfer can be expensive. This design is recommended if you use a load balancer and CDN.
  • Reduce the startup times for your instances, because the startup time is also billable.
  • Purchase Committed Use Discounts, and save up to 17% off the on-demand pricing for a one-year commitment.

Cloud Functions

This section provides guidance to help you optimize the cost of your Cloud Functions resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Observe the execution time of your functions. Experiment and benchmark to design the smallest function that still meets your required performance threshold.
  • If your Cloud Functions workloads run constantly, consider using GKE or Compute Engine to handle the workloads. Containers or VMs might be lower-cost options for always-running workloads.
  • Limit the number of function instances that can co-exist.
  • Benchmark the runtime performance of the Cloud Functions programming languages against the workload of your function. Programs in compiled languages have longer cold starts, but run faster. Programs in interpreted languages run slower, but have a lower cold-start overhead. Short, simple functions that run frequently might cost less in an interpreted language.
  • Delete temporary files written to the local disk, which is an in-memory file system. Temporary files consume memory that's allocated to your function, and sometimes persist between invocations. If you don't delete these files, an out-of-memory error might occur and trigger a cold start, which increases the execution time and cost.

App Engine

This section provides guidance to help you optimize the cost of your App Engine resources.

In addition to the following recommendations, see the general recommendations discussed earlier:

  • Set maximum instances based on your traffic and request latency. App Engine usually scales capacity based on the traffic that the applications receive. You can control cost by limiting the number of instances that App Engine can create.
  • To limit the memory or CPU available for your application, set an instance class. For CPU-intensive applications, allocate more CPU. Test a few configurations to determine the optimal size.
  • Benchmark your App Engine workload in multiple programming languages. For example, a workload implemented in one language may need fewer instances and lower cost to complete tasks on time than the same workload programmed in another language.
  • Optimize for fewer cold starts. When possible, reduce CPU-intensive or long-running tasks that occur in the global scope. Try to break down the task into smaller operations that can be "lazy loaded" into the context of a request.
  • If you expect bursty traffic, configure a minimum number of idle instances that are pre-warmed. If you are not expecting traffic, you can configure the minimum idle instances to zero.
  • To balance performance and cost, run an A/B test by splitting traffic between two versions, each with a different configuration. Monitor the performance and cost of each version, tune as necessary, and decide the configuration to which traffic should be sent.
  • Configure request concurrency, and set the maximum concurrent requests higher than the default. The more requests each instance can handle concurrently, the more efficiently you can use existing instances to serve traffic.

What's next

Optimize cost: Storage

This document in Google Cloud Architecture Framework provides recommendations to help you optimize the usage and cost of your Cloud Storage, Persistent Disk, and Filestore resources.

The guidance in this section is intended for architects and administrators responsible for provisioning and managing storage for workloads in the cloud.

Cloud Storage

When you plan Cloud Storage for your workloads, consider your requirements for performance, data retention, and access patterns.

Storage class

Choose a storage class that suits the data-retention and access-frequency requirements of your workloads, as recommended in the following table:

Storage requirement Recommendation
Data that's accessed frequently (high-throughput analytics or data lakes, websites, streaming videos, and mobile apps). Standard storage
Low-cost storage for infrequently accessed data that can be stored for at least 30 days (for example, backups and long-tail multimedia content). Nearline storage
Infrequently accessed data that can be stored for at least 90 days (for example, data replicas for disaster recovery). Coldline storage
Lowest-cost storage for infrequently accessed data that can be stored for at least 365 days (for example, legal and regulatory archives). Archive storage

Location

Select the location for your buckets based on your requirements for performance, availability, and data redundancy.

  • Regions are recommended when the region is close to your end users. You can select a specific region, and get guaranteed redundancy within the region. Regions offer fast, redundant, and affordable storage for datasets that users within a particular geographical area access frequently.
  • Multi-regions provide high availability for distributed users. However, the storage cost is higher than for regions. Multi-region buckets are recommended for content-serving use cases and for low-end analytics workloads.
  • Dual-regions provide high availability and data redundancy. Google recommends dual-region buckets for high-performance analytics workloads and for use cases that require true active-active buckets with compute and storage colocated in multiple locations. Dual-regions let you choose where your data is stored, which can help you meet compliance requirements. For example, you can use a dual-region bucket to meet industry-specific requirements regarding the physical distance between copies of your data in the cloud.

Lifecycle policies

Optimize storage cost for your objects in Cloud Storage by defining lifecycle policies. These policies help you save money by automatically downgrading the storage class of specific objects or deleting objects based on conditions that you set.

Configure lifecycle policies based on how frequently objects are accessed and how long you need to retain them. The following are examples of lifecycle policies:

  • Downgrade policy: You expect a dataset to be accessed frequently but for only around three months. To optimize the storage cost for this dataset, use Standard storage, and configure a lifecycle policy to downgrade objects older than 90 days to Coldline storage.
  • Deletion policy: A dataset must be retained for 365 days to meet certain legal requirements and can be deleted after that period. Configure a policy to delete any object that's older than 365 days.

    To help you ensure that data that needs to be retained for a specific period (for legal or regulatory compliance) is not deleted before that date or time, configure retention policy locks.

Accountability

To drive accountability for operational charges, network charges, and data-retrieval cost, use the Requester Pays configuration where appropriate. With this configuration, the costs are charged to the department or team that uses the data, rather than the owner.

Define and assign cost-tracking labels consistently for all your buckets and objects. Automate labeling when feasible.

Redundancy

Use the following techniques to maintain the required storage redundancy without data duplication:

  • To maintain data resilience with a single source of truth, use a dual-region or multi-region bucket rather than redundant copies of data in different buckets. Dual-region and multi-region buckets provide redundancy across regions. Your data is replicated asynchronously across two or more locations, and is protected against regional outages.
  • If you enable object versioning, consider defining lifecycle policies to remove the oldest version of an object as newer versions become noncurrent. Each noncurrent version of an object is charged at the same rate as the live version of the object.
  • Disable object versioning policies when they are no longer necessary.
  • Review your backup and snapshot retention policies periodically, and adjust them to avoid unnecessary backups and data retention.

Persistent Disk

Every VM instance that you deploy in Compute Engine has a boot disk, and (optionally) one or more data disks. Each disk incurs cost depending on the provisioned size, region, and disk type. Any snapshots you take of your disks incur costs based on the size of the snapshot.

Use the following design and operational recommendations to help you optimize the cost of your persistent disks:

  • Don't over-allocate disk space. You can't reduce disk capacity after provisioning. Start with a small disk, and increase the size when required. Persistent disks are billed for provisioned capacity, not the data that's stored on the disks.
  • Choose a disk type that matches the performance characteristics of your workload. SSD provides high IOPS and throughput, but costs more than standard persistent disks.

  • Use regional persistent disks only when protecting data against zonal outages is essential. Regional persistent disks are replicated to another zone within the region, so you incur double the cost of equivalent zonal disks.

  • Track the usage of your persistent disks by using Cloud Monitoring, and set up alerts for disks with low usage.

  • Delete disks that you no longer need.

  • For disks that contain data that you might need in the future, consider archiving the data to low-cost Cloud Storage and then deleting the disks.

  • Look for and respond to the recommendations in the Recommendation Hub.

Consider also using Hyperdisks for high-performance storage and Ephemeral disks (local SSDs) for temporary storage.

Disk snapshots are incremental by default and automatically compressed. Consider the following recommendations for optimizing the cost of your disk snapshots:

  • When feasible, organize your data in separate persistent disks. You can then choose to back up disks selectively, and reduce the cost of disk snapshots.
  • When you create a snapshot, select a location based on your availability requirements and the associated network costs.
  • If you intend to use a boot-disk snapshot to create multiple VMs, create an image from the snapshot, and then use the image to create your VMs. This approach helps you avoid network charges for data traveling between the location of the snapshot and the location where you restore it.
  • Consider setting up a retention policy to minimize long-term storage costs for disk snapshots.
  • Delete disk snapshots that you no longer need. Each snapshot in a chain might depend on data stored in a previous snapshot. So deleting a snapshot doesn't necessarily delete all the data in the snapshot. To definitively delete data from snapshots, you should delete all the snapshots in the chain.

Filestore

The cost of a Filestore instance depends on its service tier, the provisioned capacity, and the region where the instance is provisioned. The following are design and operational recommendations to optimize the cost of your Filestore instances:

  • Select a service tier and storage type (HDD or SSD) that's appropriate for your storage needs.
  • Don't over-allocate capacity. Start with a small size and increase the size later when required. Filestore billing is based on provisioned capacity, not the stored data.
  • Where feasible, organize your data in separate Filestore instances. You can then choose to back up instances selectively, and reduce the cost of Filestore backups.
  • When choosing the region and zone, consider creating instances in the same zone as the clients. You're billed for data transfer traffic from the zone of the Filestore instance.
  • When you decide the region where Filestore backups should be stored, consider the data transfer charges for storing backups in a different region from the source instance.
  • Track the usage of your Filestore instances by using Cloud Monitoring, and set up alerts for instances with low usage.
  • Scale down the allocated capacity for Filestore instances that have low usage. You can reduce the capacity of instances except for the Basic tier.

What's next

Optimize cost: Databases and smart analytics

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your databases and analytics workloads in Google Cloud.

The guidance in this section is intended for architects, developers, and administrators responsible for provisioning and managing databases and analytics workloads in the cloud.

This section includes cost-optimization recommendations for the following products:

Cloud SQL

Cloud SQL is a fully managed relational database for MySQL, PostgreSQL, and SQL Server.

Monitor usage

Review the metrics on the monitoring dashboard, and validate that your deployment meets the requirements of your workload.

Optimize resources

The following are recommendations to help you optimize your Cloud SQL resources:

  • Design a high availability and disaster recovery strategy that aligns with your recovery time objective (RTO) and recovery point objective (RPO). Depending on your workload, we recommend the following:
  • Provision the database with the minimum required storage capacity.
  • To scale storage capacity automatically as your data grows, enable the automatic storage increase feature.
  • Choose a storage type, solid-state drives (SSD) or hard disk drives (HDD), that's appropriate for your use case. SSD is the most efficient and cost-effective choice for most use cases. HDD might be appropriate for large datasets (>10 TB) that aren't latency-sensitive or are accessed infrequently.

Optimize rates

Consider purchasing committed use discounts for workloads with predictable resource needs. You can save 25% of on-demand pricing for a 1-year commitment and 52% for a 3-year commitment.

Spanner

Spanner is a cloud-native, unlimited-scale, strong-consistency database that offers up to 99.999% availability.

Monitor usage

The following are recommendations to help you track the usage of your Spanner resources:

  • Monitor your deployment, and configure the node count based on CPU recommendations.
  • Set alerts on your deployments to optimize storage resources. To determine the appropriate configuration, refer to the recommended limits per node.

Optimize resources

The following are recommendations to help you optimize your Spanner resources:

  • Run smaller workloads on Spanner at much lower cost by provisioning resources with Processing Units (PUs) versus nodes; one Spanner node is equal to 1,000 PUs.
  • Improve query execution performance by using the query optimizer.
  • Construct SQL statements using best practices for building efficient execution plans.
  • Manage the usage and performance of Spanner deployments by using the Autoscaler tool. The tool monitors instances, adds or removes nodes automatically, and helps you ensure that the instances remain within the recommended CPU and storage limits.
  • Protect against accidental deletion or writes by using point-in-time recovery (PITR). Databases with longer version retention periods (particularly databases that overwrite data frequently) use more system resources and need more nodes.
  • Review your backup strategy, and choose between the following options:
    • Backup and restore
    • Export and import

Optimize rates

When deciding the location of your Spanner nodes, consider the cost differences between Google Cloud regions. For example, a node that's deployed in the us-central1 region costs considerably less per hour than a node in the southamerica-east1 region.

Bigtable

Bigtable is a cloud-native, wide-column NoSQL store for large scale, low-latency workloads.

Monitor usage

The following are recommendations to help you track the usage of your Bigtable resources:

  • Analyze usage metrics to identify opportunities for resource optimization.
  • Identify hotspots and hotkeys in your Bigtable cluster by using the Key Visualizer diagnostic tool.

Optimize resources

The following are recommendations to help you optimize your Bigtable resources:

  • To help you ensure CPU and disk usage that provides a balance between latency and storage capacity, evaluate and adjust the node count and size of your Bigtable cluster.
  • Maintain performance at the lowest cost possible by programmatically scaling your Bigtable cluster to adjust the node count automatically.
  • Evaluate the most cost-effective storage type (HDD or SSD) for your use case, based on the following considerations:

    • HDD storage costs less than SSD, but has lower performance.
    • SSD storage costs more than HDD, but provides faster and predictable performance.

    The cost savings from HDD are minimal, relative to the cost of the nodes in your Bigtable cluster, unless you store large amounts of data. HDD storage is sometimes appropriate for large datasets (>10 TB) that are not latency-sensitive or are accessed infrequently.

  • Remove expired and obsolete data using garbage collection.

  • To avoid hotspots, apply best practices for row key design.

  • Design a cost-effective backup plan that aligns with your RPO.

  • To lower the cluster usage and reduce the node count, consider adding a capacity cache for cacheable queries by using Memorystore.

Additional reading

BigQuery

BigQuery is a serverless, highly scalable, and cost-effective multicloud data warehouse designed for business agility.

Monitor usage

The following are recommendations to help you track the usage of your BigQuery resources:

  • Visualize your BigQuery costs segmented by projects and users. Identify the most expensive queries and optimize them.
  • Analyze slot utilization across projects, jobs, and reservations by using INFORMATION_SCHEMA metadata tables.

Optimize resources

The following are recommendations to help you optimize your BigQuery resources:

  • Set up dataset-level, table-level, or partition-level expirations for data, based on your compliance strategy.
  • Limit query costs by restricting the number of bytes billed per query. To prevent accidental human errors, enable cost control at the user level and project level.
  • Query only the data that you need. Avoid full query scans. To explore and understand data semantics, use the no-charge data preview options.
  • To reduce the processing cost and improve performance, partition and cluster your tables when possible.
  • Filter your query as early and as often as you can.
  • When processing data from multiple sources (like Bigtable, Cloud Storage, Google Drive, and Cloud SQL), avoid duplicating data, by using a federated access data model and querying data directly from the sources.
  • Take advantage of BigQuery's backup instead of duplicating data. See Disaster recovery scenarios for data.

Optimize rates

The following are recommendations to help you reduce the billing rates for your BigQuery resources:

  • Evaluate how you edit data, and take advantage of lower long-term storage prices.
  • Review the differences between flat-rate and on-demand pricing, and choose an option that suits your requirements.
  • Assess whether you can use batch-loading instead of streaming inserts for your data workflows. Use streaming inserts if the data loaded to BigQuery is consumed immediately.
  • To increase performance and reduce the cost of retrieving data, use cached query results.

Additional reading

Dataflow

Dataflow is a fast and cost-effective serverless service for unified stream and batch data processing.

Monitor usage

The following are recommendations to help you track the usage of your Dataflow resources:

Optimize resources

The following are recommendations to help you optimize your Dataflow resources:

  • Consider Dataflow Prime for processing big data efficiently.
  • Reduce batch-processing costs by using Flexible Resource Scheduling (FlexRS) for autoscaled batched pipelines. FlexRS uses advanced scheduling, Dataflow shuffle, and a combination of preemptible and regular VMs to reduce the cost for batch pipelines.
  • Improve performance by using the in-memory shuffle service instead of Persistent Disk and worker nodes.
  • For more responsive autoscaling, and to reduce resource consumption, use Streaming Engine, which moves pipeline execution out of the worker VMs and into the Dataflow service backend.
  • If the pipeline doesn't need access to and from the internet and other Google Cloud networks, disable public IP addresses. Disabling internet access helps you reduce network costs and improve pipeline security.
  • Follow the best practices for efficient pipelining with Dataflow.

Dataproc

Dataproc is a managed Apache Spark and Apache Hadoop service for batch processing, querying, streaming, and machine learning.

The following are recommendations to help you optimize the cost of your Dataproc resources:

What's next

Optimize cost: Networking

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of your networking resources in Google Cloud.

The guidance in this section is intended for architects and administrators responsible for provisioning and managing networking for workloads in the cloud.

Design considerations

A fundamental difference between on-premises and cloud networks is the dynamic, usage-based cost model in the cloud, compared with the fixed cost of networks in traditional data centers.

When planning cloud networks, it's critical to understand the pricing model, which is based on the traffic direction, as follows:

  • You incur no charges for traffic inbound to Google Cloud. Resources that process inbound traffic, like Cloud Load Balancing, incur costs.
  • For data transfer traffic, which includes both traffic between virtual machines (VMs) in Google Cloud and traffic from Google Cloud to the internet and to on-premises hosts, pricing is based on the following factors:
    • Whether the traffic uses an internal or external IP address
    • Whether the traffic crosses zone or region boundaries
    • Whether the traffic leaves Google Cloud
    • The distance that the traffic travels before leaving Google Cloud

When two VMs or cloud resources within Google Cloud communicate, traffic in each direction is designated as outbound data transfer at the source and inbound data transfer at the destination, and is priced accordingly.

Consider the following factors for designing cost-optimal cloud networks:

  • Geo-location
  • Network layout
  • Connectivity options
  • Network Service Tiers
  • Logging

These factors are discussed in more detail in the following sections.

Geo-location

Networking costs can vary depending on the Google Cloud region where your resources are provisioned. To analyze network bandwidth between regions, you can use VPC Flow Logs and the Network Intelligence Center. For traffic flowing between Google Cloud regions, cost can vary depending on the location of the regions even if the traffic doesn't go through the internet.

Besides the Google Cloud region, consider the zones where your resources are deployed. Depending on availability requirements, you might be able to design your applications to communicate at no cost within a zone through internal IP addresses. When considering a single-zone architecture, weigh any potential savings in networking cost with the impact on availability.

Network layout

Analyze your network layout, how traffic flows between your applications and users, and the bandwidth consumed by each application or user. The Network Topology tool provides comprehensive visibility into your global Google Cloud deployment and its interaction with the public internet, including an organization-wide view of the topology, and associated network performance metrics. You can identify inefficient deployments, and take necessary actions to optimize your regional and intercontinental data transfer costs.

Connectivity options

When you need to push a large volume of data (TBs or PBs) frequently from on-premises environments to Google Cloud, consider using Dedicated Interconnect or Partner Interconnect. A dedicated connection can be cheaper when compared with costs associated with traversing the public internet or using a VPN.

Use Private Google Access when possible to reduce cost and improve your security posture.

Network Service Tiers

Google's premium network infrastructure (Premium Tier), is used by default for all services. For resources that don't need the high performance and low latency that Premium Tier offers, you can choose Standard Tier, which costs less.

When choosing a service tier, consider the differences between the tiers and the limitations of Standard Tier. Fine-tune the network to the needs of your application, and potentially reduce the networking cost for services that can tolerate more latency and don't require an SLA.

Logging

VPC Flow Logs, Firewall Rule Logging, and Cloud NAT logging let you analyze network logs and identify opportunities to reduce cost.

For VPC Flow Logs and Cloud Load Balancing, you can also enable sampling, which can reduce the volume of logs written to the database. You can vary the sampling rate from 1.0 (all log entries are retained) to 0.0 (no logs are kept). For troubleshooting or custom use cases, you can choose to always collect telemetry for a particular VPC network or subnet, or monitor a specific VM Instance or virtual interface.

Design recommendations

To optimize network traffic, we recommend the following:

  • Design your solutions to bring applications closer to your user base. Use Cloud CDN to reduce traffic volume and latency, and take advantage of CDN's lower pricing to serve content that you expect many users to access frequently.
  • Avoid synchronizing data globally across regions that are distant from the end user or that can incur high networking costs. If an application is used only within a region, avoid cross-region data processing.
  • Ensure that communication between VMs within a zone is routed through their internal IP addresses, and not routed externally.
  • Reduce data transfer cost and client latency by compressing data output.
  • Analyze spending patterns and identify opportunities to control cost by observing outbound and inbound traffic flows for critical projects using VPC Flow Logs.
  • When designing your networks in the cloud, consider the trade-off between the high availability that a distributed network offers and the cost savings from centralizing traffic within a single zone or region.

To optimize the price that you pay for networking services, we recommend the following:

  • If the server location is not a constraint, assess the cost at different regions, and select the most cost-efficient region. For general outbound traffic, like content served by a group of web servers, prices can vary depending on the region where the servers are provisioned.
  • To reduce the cost of moving high volumes of data frequently to the cloud, use a direct connection between the on-premises and Google Cloud networks. Consider using Dedicated Interconnect or Partner Interconnect.
  • Choose an appropriate service tier for each environment: that is, Standard Tier for development and test environments, and Premium Tier for production.

What's next

Optimize cost: Cloud operations

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the cost of monitoring and managing your resources in Google Cloud.

The guidance in this section is intended for cloud users who are responsible for monitoring and controlling the usage and cost of their organization's resources in the cloud.

Google Cloud Observability is a collection of managed services that you can use to monitor, troubleshoot, and improve the performance of your workloads in Google Cloud. These services include Cloud Monitoring, Cloud Logging, Error Reporting, Cloud Trace, and Cloud Profiler. One of the benefits of managed services in Google Cloud is that the services are usage-based. You pay only for what you use and by the volume of data, with free monthly data-usage allotments, and unlimited access to Google Cloud metrics and audit logs.

Cloud Logging

The following are recommendations to help you optimize the cost of your Logging operations:

  • Filter billing reports to show Logging costs.
  • Reduce the volume of logs ingested and stored, by excluding or filtering unnecessary log entries.
  • Verify whether the exclusion filters are adequate by monitoring the billing/bytes_ingested and billing/monthly_bytes_ingested metrics in the Google Cloud console.
  • Offload and export logs to lower-cost storage.
  • When streaming logs from third-party applications, reduce log volumes by using the logging agent on only production instances or by configuring it to send less data.

Cloud Monitoring

The following are recommendations to help you optimize the cost of your Monitoring operations:

  • Optimize metrics and label usage by limiting the number of labels. Avoid labels with high cardinality. For example, if you use an IP address as a label, each IP address would have a one-item label series, resulting in numerous labels when you have many VMs.
  • Reduce the volume of detailed metrics for applications that don't require these metrics, or remove the monitoring agent, especially for nonessential environments.
  • Minimize the ingestion volume by reducing the number of custom metrics that your application sends.

Cloud Trace

The following are recommendations to help you optimize the cost of your Trace operations:

  • If you use Trace as an export destination for your OpenCensus traces, reduce the volume of traces that are ingested, by using the sampling feature in OpenCensus.
  • Limit the usage of Trace, and control cost by using quotas. You can enforce span quotas using the API-specific quota page in the Google Cloud console.

What's next

Google Cloud Architecture Framework: Performance optimization

This category in the Google Cloud Architecture Framework describes the performance optimization process and best practices to optimize the performance of workloads in Google Cloud.

The information in this document is intended for architects, developers, and administrators who plan, design, deploy, and manage workloads in Google Cloud.

Optimizing the performance of workloads in the cloud can help your organization operate efficiently, improve customer satisfaction, increase revenue, and reduce cost. For example, when the backend processing time of an application decreases, users experience faster response times, which can lead to higher user retention and more revenue.

There might be trade-offs between performance and cost. But sometimes, optimizing performance can help you reduce cost. ​​For example, autoscaling helps provide predictable performance when the load increases by ensuring that the resources aren't overloaded. Autoscaling also helps you reduce cost during periods of low load by removing unused resources.

In this category of the Architecture Framework, you learn to do the following:

Performance optimization process

This document in the Google Cloud Architecture Framework provides an overview of the performance optimization process.

Performance optimization is a continuous process, not a one-time activity. The following diagram shows the stages in the performance optimization process:

Performance optimization process

The following is an overview of the stages in the performance optimization process:

Define performance requirements

Before you start to design and develop the applications that you intend to deploy or migrate to the cloud, determine the performance requirements. Define the requirements as granularly as possible for each layer of the application stack: frontend load balancing, web or applications servers, database, and storage. For example, for the storage layer of the stack, decide on the throughput and I/O operations per second (IOPS) that your applications need.

Design and deploy your applications

Design your applications by using elastic and scalable design patterns that can help you meet the performance requirements. Consider the following guidelines for designing applications that are elastic and scalable:

  • Architect the workloads for optimal content placement.
  • Isolate read and write traffic.
  • Isolate static and dynamic traffic.
  • Implement content caching. Use data caches for internal layers.
  • Use managed services and serverless architectures.

Google Cloud provides open source tools that you can use to benchmark the performance of Google Cloud services with other cloud platforms.

Monitor and analyze performance

After you deploy your applications, continuously monitor performance by using logs and alerts, analyze the data, and identify performance issues. As your applications grow and evolve, reassess your performance requirements. You might have to redesign some parts of the applications to maintain or improve performance.

Optimize performance

Based on the performance of your applications and changes in requirements, configure the cloud resources to meet the current performance requirements. For example, resize the resources or set up autoscaling. When you configure the resources, evaluate opportunities to use recently released Google Cloud features and services that can help further optimize performance.

The performance optimization process doesn't end at this point. Continue the cycle of monitoring performance, reassessing requirements when necessary, and adjusting the cloud resources to maintain and improve performance.

What's next

Monitor and analyze performance

This document in the Google Cloud Architecture Framework describes the services in the Google Cloud Observability that you can use to record, monitor, and analyze the performance of your workloads.

Monitor performance metrics

Use Cloud Monitoring to analyze trends of performance metrics, analyze the effects of experiments, define alerts for critical metrics, and perform retrospective analyses.

Log critical data and events

Cloud Logging is an integrated logging service that you can use to store, analyze, monitor, and set alerts for log data and events. Cloud Logging can collect logs from the services of Google Cloud and other cloud providers.

Analyze code performance

Code that performs poorly can increase the latency of your applications and the cost of running them. Cloud Profiler helps you identify and address performance issues by continuously analyzing the performance of CPU-intensive or memory-intensive functions that an application uses.

Collect latency data

In complex application stacks and microservices-based architectures, assessing latency in inter-service communication and identifying performance bottlenecks can be difficult. Cloud Trace and OpenTelemetry tools help you collect latency data from your deployments at scale. These tools also help you analyze the latency data efficiently.

Monitor network performance

The Performance Dashboard of the Network Intelligence Center gives you a comprehensive view of performance metrics for the Google network and the resources in your project. These metrics can help you determine the cause of network-related performance issues. For example, you can identify whether a performance issue is the result of a problem in your project or the Google network.

What's next

Optimize compute performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your Compute Engine, Google Kubernetes Engine (GKE), and serverless resources.

Compute Engine

This section provides guidance to help you optimize the performance of your Compute Engine resources.

Autoscale resources

Managed instance groups (MIGs) let you scale your stateless apps deployed on Compute Engine VMs efficiently. Autoscaling helps your apps continue to deliver predictable performance when the load increases. In a MIG, a group of Compute Engine VMs is launched based on a template that you define. In the template, you configure an autoscaling policy, which specifies one or more signals that the autoscaler uses to scale the group. The autoscaling signals can be schedule-based, like start time or duration, or based on target metrics such as average CPU utilization. For more information, see Autoscaling groups of instances.

Disable SMT

Each virtual CPU (vCPU) that you allocate to a Compute Engine VM is implemented as a single hardware multithread. By default, two vCPUs share a physical CPU core. This architecture is called simultaneous multi-threading (SMT).

For workloads that are highly parallel or that perform floating point calculations (such as transcoding, Monte Carlo simulations, genetic sequence analysis, and financial risk modeling), you can improve performance by disabling SMT. For more information, see Set the number of threads per core.

Use GPUs

For workloads such as machine learning and visualization, you can add graphics processing units (GPUs) to your VMs. Compute Engine provides NVIDIA GPUs in passthrough mode so that your VMs have direct control over the GPUs and the associated memory. For graphics-intensive workloads such as 3D visualization, you can use NVIDIA RTX virtual workstations. After you deploy the workloads, monitor the GPU usage and review the options for optimizing GPU performance.

Use compute-optimized machine types

Workloads like gaming, media transcoding, and high performance computing (HPC) require consistently high performance per CPU core. Google recommends that you use compute-optimized machine types for the VMs that run such workloads. Compute-optimized VMs are built on an architecture that uses features like non-uniform memory access (NUMA) for optimal and reliable performance.

Tightly coupled HPC workloads have a unique set of requirements for achieving peak efficiency in performance. For more information, see the following documentation:

Choose appropriate storage

Google Cloud offers a wide range of storage options for Compute Engine VMs: Persistent disks, local solid-state drive (SSD) disks, Filestore, and Cloud Storage. For design recommendations and best practices to optimize the performance of each of these storage options, see Optimize storage performance.

Google Kubernetes Engine

This section provides guidance to help you optimize the performance of your Google Kubernetes Engine (GKE) resources.

Autoscale resources

You can automatically resize the node pools in a GKE cluster to match the current load by using the cluster autoscaler feature. Autoscaling helps your apps continue to deliver predictable performance when the load increases. The cluster autoscaler resizes node pools automatically based on the resource requests (rather than actual resource utilization) of the Pods running on the nodes. When you use autoscaling, there can be a trade-off between performance and cost. Review the best practices for configuring cluster autoscaling efficiently.

Use C2D VMs

You can improve the performance of compute-intensive containerized workloads by using C2D machine types. You can add C2D nodes to your GKE clusters by choosing a C2D machine type in your node pools.

Disable SMT

Simultaneous multi-threading (SMT) can increase application throughput significantly for general computing tasks and for workloads that need high I/O. But for workloads in which both the virtual cores are compute-bound, SMT can cause inconsistent performance. To get better and more predictable performance, you can disable SMT for your GKE nodes by setting the number of vCPUs per core to 1.

Use GPUs

For compute-intensive workloads like image recognition and video transcoding, you can accelerate performance by creating node pools that use GPUs. For more information, see Running GPUs.

Use container-native load balancing

Container-native load balancing enables load balancers to distribute traffic directly and evenly to Pods. This approach provides better network performance and improved visibility into network latency between the load balancer and the Pods. Because of these benefits, container-native load balancing is the recommended solution for load balancing through Ingress.

Define a compact placement policy

Tightly coupled batch workloads need low network latency between the nodes in the GKE node pool. ​​You can deploy such workloads to single-zone node pools, and ensure that the nodes are physically close to each other by defining a compact placement policy. For more information, see Define compact placement for GKE nodes.

Serverless compute services

This section provides guidance to help you optimize the performance of your serverless compute services in Google Cloud: Cloud Run and Cloud Functions. These services provide autoscaling capabilities, where the underlying infrastructure handles scaling automatically. By using these serverless services, you can reduce the effort to scale your microservices and functions, and focus on optimizing performance at the application level.

For more information, see the following documentation:

What's next

Review the best practices for optimizing the performance of your storage, networking, database, and analytics resources:

Optimize storage performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your storage resources in Google Cloud.

Cloud Storage

This section provides best practices to help you optimize the performance of your Cloud Storage operations.

Assess bucket performance

Assess the performance of your Cloud Storage buckets by using the gsutil perfdiag command. This command tests the performance of the specified bucket by sending a series of read and write requests with files of different sizes. You can tune the test to match the usage pattern of your applications. Use the diagnostic report that the command generates to set performance expectations and identify potential bottlenecks.

Cache frequently accessed objects

To improve the read latency for frequently accessed objects that are publicly accessible, you can configure such objects to be cached. Although caching can improve performance, stale content could be served if a cache has the old version of an object.

Scale requests efficiently

As the request rate for a bucket increases, Cloud Storage automatically increases the I/O capacity for the bucket by distributing the request load across multiple servers. To achieve optimal performance when scaling requests, follow the best practices for ramping up request rates and distributing load evenly.

Enable multithreading and multiprocessing

When you use gsutil to upload numerous small files, you can improve the performance of the operation by using the -m option. This option causes the upload request to be implemented as a batched, parallel (that is, multithreaded and multiprocessing) operation. Use this option only when you perform operations over a fast network connection. For more information, see the documentation for the -m option in Global Command-Line Options.

Upload large files as composites

To upload large files, you can use a strategy called parallel composite uploads. With this strategy, the large file is split into chunks, which are uploaded in parallel and then recomposed in the cloud. Parallel composite uploads can be faster than regular upload operations when network bandwidth and disk speed are not limiting factors. However, this strategy has some limitations and cost implications. For more information, see Parallel composite uploads.

Persistent disks and local SSDs

This section provides best practices to help you optimize the performance of your Persistent Disks and Local SSDs that are attached to Compute Engine VMs.

The performance of persistent disks and local SSDs depends on the disk type and size, VM machine type, and number of vCPUs. Use the following guidelines to manage the performance of your persistent disks and local SSDs:

Filestore

This section provides best practices to help you optimize the performance of your Filestore instances. You can use Filestore to provision fully managed Network File System (NFS) file servers for Compute Engine VMs and GKE clusters.

  • When you provision a Filestore instance, choose a service tier that meets the performance and capacity requirements of your workload.
  • For client VMs that run cache-dependent workloads, use a machine type that helps optimize the network performance of the Filestore instance. For more information, see Recommended client machine type.
  • To optimize the performance of Filestore instances for client VMs that run Linux, Google recommends specific NFS mount settings. For more information, see Linux client mount options.
  • To minimize network latency, provision your Filestore instances in regions and zones that are close to where you plan to use the instances.
  • Monitor the performance of your Filestore instances, and set up alerts by using Cloud Monitoring.

What's next

Review the best practices for optimizing the performance of your compute, networking, database, and analytics resources:

Optimize networking and API performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your networking resources and APIs in Google Cloud.

Network Service Tiers

Network Service Tiers lets you optimize the network cost and performance of your workloads. You can choose from the following tiers:

  • Premium Tier uses Google's highly reliable global backbone to help you achieve minimal packet loss and latency. Traffic enters and leaves the Google network at a global edge point of presence (PoP) that's close to your end user. We recommend using Premium Tier as the default tier for optimal performance. Premium Tier supports both regional and global external IP addresses for VMs and load balancers.
  • Standard Tier is available only for resources that use regional external IP addresses. Traffic enters and leaves the Google network at an edge PoP that's closest to the Google Cloud location where your workload runs. The pricing for Standard Tier is lower than Premium Tier. Standard Tier is suitable for traffic that isn't sensitive to packet loss and that doesn't have low latency requirements.

You can view the network latency for Standard Tier and Premium Tier for each cloud region in the Network Intelligence Center Performance Dashboard.

Jumbo frames

Virtual Private Cloud (VPC) networks have a default maximum transmission unit (MTU) of 1460 bytes. However, you can configure your VPC networks to to support an MTU of up to 8896 (jumbo frames).

With a higher MTU, the network needs fewer packets to send the same amount of data, thus reducing the bandwidth used up by TCP/IP headers. This leads to a higher effective bandwidth for the network.

For more information about intra-VPC MTU and the maximum MTU of other connections, see the Maximum transmission unit page in the VPC documentation.

VM performance

Compute Engine VMs have a maximum egress bandwidth that in part depends upon the machine type. One aspect of choosing an appropriate machine type is to consider how much traffic you expect the VM to generate.

The Network bandwidth page contains a discussion and table of network bandwidths for Compute Engine machine types.

If your inter-VM bandwidth requirements are very high, consider VMs that support Tier_1 networking.

Cloud Load Balancing

This section provides best practices to help you optimize the performance of your Cloud Load Balancing instances.

Deploy applications close to your users

Provision your application backends close to the location where you expect user traffic to arrive at the load balancer. The closer your users or client applications are to your workload servers, the lower the network latency between the users and the workload. To minimize latency to clients in different parts of the world, you might have to deploy the backends in multiple regions. For more information, see Best practices for Compute Engine regions selection.

Choose an appropriate load balancer type

The type of load balancer that you choose for your application can determine the latency that your users experience. For information about measuring and optimizing application latency for different load balancer types, see Optimizing application latency with load balancing.

Enable caching

To accelerate content serving, enable caching and Cloud CDN as part of your default external HTTP load balancer configuration. Make sure that the backend servers are configured to send the response headers that are necessary for static responses to be cached.

Use HTTP when HTTPS isn't necessary

Google automatically encrypts traffic between proxy load balancers and backends at the packet level. Packet-level encryption makes Layer 7 encryption using HTTPS between the load balancer and the backends redundant for most purposes. Consider using HTTP rather than HTTPS or HTTP/2 for traffic between the load balancer and your backends. By using HTTP, you can also reduce the CPU usage of your backend VMs. However, when the backend is an internet network endpoint group (NEG), use HTTPS or HTTP/2 for traffic between the load balancer and the backend. This helps ensure that your traffic is secure on the public internet. For optimal performance, we recommend benchmarking your application's traffic patterns.

Network Intelligence Center

Google Cloud Network Intelligence Center provides a comprehensive view of the performance of the Google Cloud network across all regions. Network Intelligence Center helps you determine whether latency issues are caused by problems in your project or in the network. You can also use this information to select the regions and zones where you should deploy your workloads to optimize network performance.

Use the following tools provided by Network Intelligence Center to monitor and analyze network performance for your workloads in Google Cloud:

  • Performance Dashboard shows latency between Google Cloud regions and between individual regions and locations on the internet. Performance Dashboard can help you determine where to place workloads for best latency and help determine when an application issue might be due to underlying network issues.

  • Network Topology shows a visual view of your Virtual Private Cloud (VPC) networks, hybrid connectivity with your on-premises networks, and connectivity to Google-managed services. Network Topology provides real-time operational metrics that you can use to analyze and understand network performance and identify unusual traffic patterns.

  • Network Analyzer is an automatic configuration monitoring and diagnostics tool. It verifies VPC network configurations for firewall rules, routes, configuration dependencies, and connectivity for services and applications. It helps you identify network failures, and provides root cause analysis and recommendations. Network Analyzer provides prioritized insights to help you analyze problems with network configuration, such as high utilization of IP addresses in a subnet.

API Gateway and Apigee

This section provides recommendations to help you optimize the performance of the APIs that you deploy in Google Cloud by using API Gateway and Apigee.

API Gateway lets you create and manage APIs for Google Cloud serverless backends, including Cloud Functions, Cloud Run, and App Engine. These services are managed services, and they scale automatically. But as the applications that are deployed on these services scale, you might need to increase the quotas and rate limits for API Gateway.

Apigee provides the following analytics dashboards to help you monitor the performance of your managed APIs:

If you use Apigee Integration, consider the system-configuration limits when you build and manage your integrations.

What's next

Review the best practices for optimizing the performance of your compute, storage, database, and analytics resources:

Optimize database performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your databases in Google Cloud.

Cloud SQL

The following recommendations help you to optimize the performance of your Cloud SQL instances running SQL Server, MySQL, and PostgreSQL databases.

For more information, see the following documentation:

Bigtable

This section provides recommendations to help you optimize the performance of your Bigtable instances.

Plan capacity based on performance requirements

You can use Bigtable in a broad spectrum of applications, each with a different optimization goal. For example, for batch data-processing jobs, throughput might be more important than latency. For an online service that serves user requests, you might need to prioritize lower latency over throughput. When you plan capacity for your Bigtable clusters, consider the tradeoffs between throughput and latency. For more information, see Plan your Bigtable capacity.

Follow schema-design best practices

Your tables can scale to billions of rows and thousands of columns, enabling you to store petabytes of data. When you design the schema for your Bigtable tables, consider the schema design best practices.

Monitor performance and make adjustments

Monitor the CPU and disk usage for your instances, analyze the performance of each cluster, and review the sizing recommendations that are shown in the monitoring charts.

Spanner

This section provides recommendations to help you optimize the performance of your Spanner instances.

Choose a primary key that prevents a hotspot

A hotspot is a single server that is forced to handle many requests. When you choose the primary key for your database, follow the schema design best practices to prevent a hotspot.

Follow best practices for SQL coding

The SQL compiler in Spanner converts each declarative SQL statement that you write into an imperative query execution plan. Spanner uses the execution plan to run the SQL statement. When you construct SQL statements, follow SQL best practices to make sure that Spanner uses execution plans that yield optimal performance.

Use query options to manage the SQL query optimizer

Spanner uses a SQL query optimizer to transform SQL statements into efficient query execution plans. The query execution plan that the optimizer produces might change slightly when the query optimizer itself evolves, or when the database statistics are updated. You can minimize the potential for performance regression when the query optimizer or the database statistics change by using query options.

Visualize and tune the structure of query execution plans

To analyze query performance issues, you can visualize and tune the structure of the query execution plans by using the query plan visualizer.

Use operations APIs to manage long-running operations

For certain method calls, Spanner creates long-running operations, which might take a substantial amount of time to complete. For example, when you restore a database, Spanner creates a long-running operation to track restore progress. To help you monitor and manage long-running operations, Spanner provides operations APIs. For more information, see Managing long-running operations.

Follow best practices for bulk loading

Spanner supports several options for loading large amounts of data in bulk. The performance of a bulk-load operation depends on factors such as partitioning, the number of write requests, and the size of each request. To load large amounts of data efficiently, follow bulk-loading best practices.

Monitor and control CPU utilization

The CPU utilization of your Spanner instance can affect request latencies. An overloaded backend server can cause higher request latencies. Spanner provides CPU utilization metrics to help you investigate high CPU utilization. For performance-sensitive applications, you might need to reduce CPU utilization by increasing the compute capacity.

Analyze and solve latency issues

When a client makes a remote procedure call to Spanner, the API request is first prepared by the client libraries. The request then passes through the Google Front End and the Cloud Spanner API frontend before it reaches the Spanner database. To analyze and solve latency issues, you must measure and analyze the latency for each segment of the path that the API request traverses. For more information, see Spanner end-to-end latency guide.

Launch applications after the database reaches the warm state

As your Spanner database grows, it divides the key space of your data into splits. Each split is a range of rows that contains a subset of your table. To balance the overall load on the database, Spanner dynamically moves individual splits independently and assigns them to different servers. When the splits are distributed across multiple servers, the database is considered to be in a warm state. A database that's warm can maximize parallelism and deliver improved performance. Before you launch your applications, we recommend that you warm up your database with test data loads.

What's next

Review the best practices for optimizing the performance of your compute, storage, networking, and analytics resources:

Optimize analytics performance

This document in the Google Cloud Architecture Framework provides recommendations to help you optimize the performance of your analytics workloads in Google Cloud.

BigQuery

This section provides recommendations to help you optimize the performance of queries in BigQuery.

Optimize query design

Query performance depends on factors like the number of bytes that your queries read and write, and the volume of data that's passed between slots. To optimize the performance of your queries in BigQuery, apply the best practices that are described in the following documentation:

Define and use materialized views efficiently

To improve the performance of workloads that use common and repeated queries, you can use materialized views. There are limits to the number of materialized views that you can create. Don't create a separate materialized view for every permutation of a query. Instead, define materialized views that you can use for multiple patterns of queries.

Improve JOIN performance

You can use materialized views to reduce the cost and latency of a query that performs aggregation on top of a JOIN. Consider a case where you join a large fact table with a few small dimension tables, and then perform an aggregation on top of the join. It might be practical to rewrite the query to first perform the aggregation on top of the fact table with foreign keys as grouping keys. Then, join the result with the dimension tables. Finally, perform a post-aggregation.

Dataflow

This section provides recommendations to help you optimize query performance of your Dataflow pipelines.

When you create and deploy pipelines, you can configure execution parameters, like the Compute Engine machine type that should be used for the Dataflow worker VMs. For more information, see Pipeline options.

After you deploy pipelines, Dataflow manages the Compute Engine and Cloud Storage resources that are necessary to run your jobs. In addition, the following features of Dataflow help optimize the performance of the pipelines:

You can monitor the performance of Dataflow pipelines by using the web-based monitoring interface or the Dataflow gcloud CLI.

Dataproc

This section describes best practices to optimize the performance of your Dataproc clusters.

Autoscale clusters

To ensure that your Dataproc clusters deliver predictable performance, you can enable autoscaling. Dataproc uses Hadoop YARN memory metrics and an autoscaling policy that you define to automatically adjust the number of worker VMs in a cluster. For more information about how to use and configure autoscaling, see Autoscaling clusters.

Provision appropriate storage

Choose an appropriate storage option for your Dataproc cluster based on your performance and cost requirements:

  • If you need a low-cost Hadoop-compatible file system (HCFS) that Hadoop and Spark jobs can read from and write to with minimal changes, use Cloud Storage. The data stored in Cloud Storage is persistent, and can be accessed by other Dataproc clusters and other products such as BigQuery.
  • If you need a low-latency Hadoop Distributed File System (HDFS) for your Dataproc cluster, use Compute Engine persistent disks attached to the worker nodes. The data stored in HDFS storage is transient, and the storage cost is higher than the Cloud Storage option.
  • To get the performance advantage of Compute Engine persistent disks and the cost and durability benefits of Cloud Storage, you can combine both of the storage options. For example, you can store your source and final datasets in Cloud Storage, and provision limited HDFS capacity for the intermediate datasets. When you decide on the size and type of the disks for HDFS storage, consider the recommendations in the Persistent disks and local SSDs section.

Reduce latency when using Cloud Storage

To reduce latency when you access data that's stored in Cloud Storage, we recommend the following:

  • Create your Cloud Storage bucket in the same region as the Dataproc cluster.
  • Disable auto.purge for Apache Hive-managed tables stored in Cloud Storage.
  • When using Spark SQL, consider creating Dataproc clusters with the latest versions of available images . By using the latest version, you can avoid performance issues that might remain in older versions, such as slow INSERT OVERWRITE performance in Spark 2.x.
  • To minimize the possibility of writing many files with varying or small sizes to Cloud Storage, you can configure the Spark SQL parameters spark.sql.shuffle.partitions and spark.default.parallelism or the Hadoop parameter mapreduce.job.reduces.

Monitor and adjust storage load and capacity

The persistent disks attached to the worker nodes in a Dataproc cluster hold shuffle data. To perform optimally, the worker nodes need sufficient disk space. If the nodes don't have sufficient disk space, the nodes are marked as UNHEALTHY in the YARN NodeManager log. If this issue occurs, either increase the disk size for the affected nodes, or run fewer jobs concurrently.

Enable EFM

When worker nodes are removed from a running Dataproc cluster, such as due to downscaling or preemption, shuffle data might be lost. To minimize job delays in such scenarios, we recommend that you enable Enhanced Flexibility Mode (EFM) for clusters that use preemptible VMs or that only autoscale the secondary worker group.

What's next

Review the best practices for optimizing the performance of your compute, storage, networking, and database resources:

What's new in the Architecture Framework

This document lists significant changes to the Google Cloud Architecture Framework.

November 28, 2023

November 9, 2023

September 8, 2023

  • Cost optimization category:

    • Added information about using tags for cost allocation and governance.
    • Updated the guidance for identifying labeling anomalies.

    For more information, see Track and allocate cost using tags or labels.

August 28, 2023

August 23, 2023

  • Cost optimization category:
    • Added guidance about optimizing Spanner resource usage for small workloads by using Processing Units instead of nodes.

August 18, 2023

August 9, 2023

July 13, 2023

  • System design:
  • Cost optimization:
    • Added guidance about Google Cloud Hyperdisk and local SSDs in the Persistent Disk section.

June 23, 2023

June 15, 2023

March 30, 2023

September 16, 2022

August 10, 2022

August 4, 2022

July 13, 2022

June 27, 2022

June 13, 2022

June 1, 2022

May 7, 2022

May 4, 2022

February 25, 2022

  • Changes to the security category:

    • Updated compliance best practices to discuss automation.

December 15, 2021

October 25, 2021

October 7, 2021

  • Major refresh of all the categories.


  1. Anna Berenberg and Brad Calder, Deployment Archetypes for Cloud Applications, ACM Computing Surveys, Volume 55, Issue 3, Article No.: 61, pp 1-48