Google Cloud Architecture Framework

The Google Cloud Architecture Framework provides recommendations and describes best practices to help architects, developers, administrators, and other cloud practitioners design and operate a cloud topology that's secure, efficient, resilient, high-performing, and cost-effective.

A cross-functional team of experts at Google validates the design recommendations and best practices that make up the Architecture Framework. The team curates the Architecture Framework to reflect the expanding capabilities of Google Cloud, industry best practices, community knowledge, and feedback from you. For a summary of the significant changes, see What's new.

The design guidance in the Architecture Framework applies to applications built for the cloud and for workloads migrated from on-premises to Google Cloud, hybrid cloud deployments, and multi-cloud environments.

The Google Cloud Architecture Framework is organized into six categories (also known as pillars), as shown in the following diagram:

Google Cloud Architecture Framework

System design
This category is the foundation of the Google Cloud Architecture Framework. Define the architecture, components, modules, interfaces, and data needed to satisfy cloud system requirements, and learn about Google Cloud products and features that support system design.
Operational excellence
Efficiently deploy, operate, monitor, and manage your cloud workloads.
Security, privacy, and compliance
Maximize the security of your data and workloads in the cloud, design for privacy, and align with regulatory requirements and standards.
Reliability
Design and operate resilient and highly available workloads in the cloud.
Cost optimization
Maximize the business value of your investment in Google Cloud.
Performance optimization
Design and tune your cloud resources for optimal performance.

If you have any questions or need help, join our open discussion forums and get expert recommendations in the Architecture Framework space of the Google Cloud Community. The community space also has a series of articles with questions and practical guidance to help you address any challenges around designing and operating your cloud architecture.

Google Cloud Architecture Framework: System design

System design is the foundational category of the Google Cloud Architecture Framework. This category provides design recommendations and describes best practices and principles to help you define the architecture, components, modules, interfaces, and data on a cloud platform to satisfy your system requirements. You also learn about Google Cloud products and features that support system design.

The documents in the system design category assume that you understand basic system design principles. These documents don't assume that you are familiar with cloud concepts and Google Cloud products.

For complex cloud migration and deployment scenarios, we recommend that you use Google Cloud consulting services. Our consultants provide expertise on best practices and guiding principles to help you succeed in your cloud journey. Google Cloud also has a strong ecosystem of partners, from large global systems integrators to partners with a deep specialization in a particular area like machine learning. We recommend that you engage Google Cloud partners to accelerate your digital transformation and improve business outcomes.

In the system design category of the Architecture Framework, you learn to do the following:

Core principles of system design

This document in the Google Cloud Architecture Framework describes the core principles of system design. A robust system design is secure, reliable, scalable, and independent. It lets you apply changes atomically, minimize potential risks, and improve operational efficiency. To achieve a robust system design, we recommend that you follow four core principles.

Document everything

When you start to move your workloads to the cloud or build your applications, a major blocker to success is lack of documentation of the system. Documentation is especially important for correctly visualizing the architecture of your current deployments.

A properly documented cloud architecture establishes a common language and standards, which enable cross-functional teams to communicate and collaborate effectively. It also provides the information that's necessary to identify and guide future design decisions. Documentation should be written with your use cases in mind, to provide context for the design decisions.

Over time, your design decisions will evolve and change. The change history provides the context that your teams require to align initiatives, avoid duplication, and measure performance changes effectively over time. Change logs are particularly valuable when you onboard a new cloud architect who is not yet familiar with your current system design, strategy, or history.

Simplify your design and use fully managed services

Simplicity is crucial for system design. If your architecture is too complex to understand, it will be difficult to implement the design and manage it over time. Where feasible, use fully managed services to minimize the risks, time, and effort associated with managing and maintaining baseline systems.

If you're already running your workloads in production, test with managed services to see how they might help to reduce operational complexities. If you're developing new workloads, then start simple, establish a minimal viable product (MVP), and resist the urge to over-engineer. You can identify exceptional use cases, iterate, and improve your systems incrementally over time.

Decouple your architecture

Decoupling is a technique that's used to separate your applications and service components into smaller components that can operate independently. For example, you might break up a monolithic application stack into separate service components. In a decoupled architecture, an application can run its functions independently, regardless of the various dependencies.

A decoupled architecture gives you increased flexibility to do the following:

  • Apply independent upgrades.
  • Enforce specific security controls.
  • Establish reliability goals for each subsystem.
  • Monitor health.
  • Granularly control performance and cost parameters.

You can start decoupling early in your design phase or incorporate it as part of your system upgrades as you scale.

Use a stateless architecture

A stateless architecture can increase both the reliability and scalability of your applications.

Stateful applications rely on various dependencies to perform tasks, such as locally cached data. Stateful applications often require additional mechanisms to capture progress and restart gracefully. Stateless applications can perform tasks without significant local dependencies by using shared storage or cached services. A stateless architecture enables your applications to scale up quickly with minimum boot dependencies. The applications can withstand hard restarts, have lower downtime, and provide better performance for end users.

The system design category describes recommendations to make your applications stateless or to utilize cloud-native features to improve capturing machine state for your stateful applications.

What's next

Select geographic zones and regions

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on geographic requirements. You learn how to select optimal geographic zones and regions based on availability and proximity, to support compliance, optimize costs, and implement load balancing.

When you select a region or multiple regions for your business applications, you consider criteria including service availability, end-user latency, application latency, cost, and regulatory or sustainability requirements. To support your business priorities and policies, balance these requirements and identify the best tradeoffs. For example, the most compliant region might not be the most cost-efficient region or it might not have the lowest carbon footprint.

Deploy over multiple regions

Regions are independent geographic areas that consist of multiple zones. A zone is a deployment area for Google Cloud resources within a region; each zone represents a single failure domain within a region.

To help protect against expected downtime (including maintenance) and help protect against unexpected downtime like incidents, we recommend that you deploy fault-tolerant applications that have high availability and deploy your applications across multiple zones in one or more regions. For more information, see Geography and regions, Application deployment considerations, and Best practices for Compute Engine regions selection.

Multi-zonal deployments can provide resiliency if multi-region deployments are limited due to cost or other considerations. This approach is especially helpful in preventing zonal or regional outages and in addressing disaster recovery and business continuity concerns. For more information, see Design for scale and high availability.

Select regions based on geographic proximity

Latency impacts the user experience and affects costs associated with serving external users. To minimize latency when serving traffic to external users, select a region or set of regions that are geographically close to your users and where your services run in a compliant way. For more information, see Cloud locations and the Compliance resource center.

Select regions based on available services

Select a region based on the available services that your business requires. Most services are available across all regions. Some enterprise-specific services might be available in a subset of regions with their initial release. To verify region selection, see Cloud locations.

Choose regions to support compliance

Select a specific region or set of regions to meet geographic regulatory or compliance regulations that require the use of certain geographies, for example General Data Protection Regulation (GDPR) or data residency. To learn more about designing secure systems, see Compliance offerings and Data residency, operational transparency, and privacy for European customers on Google Cloud.

Compare pricing of major resources

Regions have different cost rates for the same services. To identify a cost-efficient region, compare pricing of the major resources that you plan to use. Cost considerations differ depending on backup requirements and resources like compute, networking, and data storage. To learn more, see the Cost optimization category.

Use Cloud Load Balancing to serve global users

To improve the user experience when you serve global users, use Cloud Load Balancing to help provide a single IP address that is routed to your application. To learn more about designing reliable systems, see Google Cloud Architecture Framework: Reliability.

Use the Cloud Region Picker to support sustainability

Google has been carbon neutral since 2007 and is committed to being carbon-free by 2030. To select a region by its carbon footprint, use the Google Cloud Region Picker. To learn more about designing for sustainability, see Cloud sustainability.

What's next

Learn how to manage your cloud resources using Resource Manager, the Google Cloud resource hierarchy, and the Organization Policy Service.

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Manage cloud resources

This document in the Google Cloud Architecture Framework provides best practices to organize and manage your resources in Google Cloud.

Resource hierarchy

Google Cloud resources are arranged hierarchically in organizations, folders, and projects. This hierarchy lets you manage common aspects of your resources like access control, configuration settings, and policies. For best practices to design the hierarchy of your cloud resources, see Decide a resource hierarchy for your Google Cloud landing zone.

Resource labels and tags

This section provides best practices for using labels and tags to organize your Google Cloud resources.

Use a simple folder structure

Folders let you group any combination of projects and subfolders. Create a simple folder structure to organize your Google Cloud resources. You can add more levels as needed to define your resource hierarchy so that it supports your business needs. The folder structure is flexible and extensible.To learn more, see Creating and managing folders.

Use folders and projects to reflect data governance policies

Use folders, subfolders, and projects to separate resources from each other to reflect data governance policies within your organization. For example, you can use a combination of folders and projects to separate financial, human resources, and engineering.

Use projects to group resources that share the same trust boundary. For example, resources for the same product or microservice can belong to the same project. For more information, see Decide a resource hierarchy for your Google Cloud landing zone.

Use tags and labels at the outset of your project

Use labels and tags when you start to use Google Cloud products, even if you don't need them immediately. Adding labels and tags later on can require manual effort that can be error prone and difficult to complete.

A tag provides a way to conditionally allow or deny policies based on whether a resource has a specific tag. A label is a key-value pair that helps you organize your Google Cloud instances. For more information on labels, see requirements for labels, a list of services that support labels, and label formats.

Resource Manager provides labels and tags to help you manage resources, allocate and report on cost, and assign policies to different resources for granular access controls. For example, you can use labels and tags to apply granular access and management principles to different tenant resources and services. For information about VM labels and network tags, see Relationship between VM labels and network tags.

You can use labels for multiple purposes, including the following:

  • Managing resource billing: Labels are available in the billing system, which lets you separate cost by labels. For example, you can label different cost centers or budgets.
  • Grouping resources by similar characteristics or by relation: You can use labels to separate different application lifecycle stages or environments. For example, you can label production, development, and testing environments.

Assign labels to support cost and billing reporting

To support granular cost and billing reporting based on attributes outside of your integrated reporting structures (like per-project or per-product type), assign labels to resources. Labels can help you allocate consumption to cost centers, departments, specific projects, or internal recharge mechanisms. For more information, see the Cost optimization category.

Avoid creating large numbers of labels

Avoid creating large numbers of labels. We recommend that you create labels primarily at the project level, and that you avoid creating labels at the sub-team level. If you create overly granular labels, it can add noise to your analytics. To learn about common use cases for labels, see Common uses of labels.

Avoid adding sensitive information to labels

Labels aren't designed to handle sensitive information. Don't include sensitive information in labels, including information that might be personally identifiable, like an individual's name or title.

Anonymize information in project names

Follow a project naming pattern like COMPANY_INITIAL_IDENTIFIER-ENVIRONMENT-APP_NAME, where the placeholders are unique and don't reveal company or application names. Don't include attributes that can change in the future, for example, a team name or technology.

Apply tags to model business dimensions

You can apply tags to model additional business dimensions like organization structure, regions, workload types, or cost centers. To learn more about tags, see Tags overview, Tag inheritance, and Creating and managing tags. To learn how to use tags with policies, see Policies and tags. To learn how to use tags to manage access control, see Tags and access control.

Organizational policies

This section provides best practices for configuring governance rules on Google Cloud resources across the cloud resource hierarchy.

Establish project naming conventions

Establish a standardized project naming convention, for example, SYSTEM_NAME-ENVIRONMENT (dev, test, uat, stage, prod).

Project names have a 30-character limit.

Although you can apply a prefix like COMPANY_TAG-SUB_GROUP/SUBSIDIARY_TAG, project names can become out of date when companies go through reorganizations. Consider moving identifiable names from project names to project labels.

Automate project creation

To create projects for production and large-scale businesses, use an automated process like the Deployment Manager or the Google Cloud project factory Terraform module. These tools do the following:

  • Automatically create development, test, and production environments or projects that have the appropriate permissions.
  • Configure logging and monitoring.

The Google Cloud project factory Terraform module helps you to automate the creation of Google Cloud projects. In large enterprises, we recommend that you review and approve projects before you create them in Google Cloud. This process helps to ensure the following:

  • Costs can be attributed. For more information, see the Cost optimization category.
  • Approvals are in place for data uploads.
  • Regulatory or compliance requirements are met.

When you automate the creation and management of Google Cloud projects and resources, you get the benefit of consistency, reproducibility, and testability. Treating your configuration as code lets you version and manage the lifecycle of your configuration together with your software artifacts. Automation lets you support best practices like consistent naming conventions and labeling of resources. As your requirements evolve, automation simplifies project refactoring.

Audit your systems regularly

To ensure that requests for new projects can be audited and approved, integrate with your enterprise's ticketing system or a standalone system that provides auditing.

Configure projects consistently

Configure projects to consistently meet your organization's needs. Include the following when you set up projects:

  • Project ID and naming conventions
  • Billing account linking
  • Networking configuration
  • Enabled APIs and services
  • Compute Engine access configuration
  • Logs export and usage reports
  • Project removal lien

Decouple and isolate workloads or environments

Quotas and limits are enforced at the project level. To manage quotas and limits, decouple and isolate workloads or environments at the project level. For more information, see Working with quotas.

Decoupling environments is different from data classification requirements. Separating data from infrastructure can be expensive and complex to implement, so we recommend that you implement data classification based on data sensitivity and compliance requirements.

Enforce billing isolation

Enforce billing isolation to support different billing accounts and cost visibility per workload and environment. For more information, see Create, modify, or close your self-serve Cloud Billing account and Enable, disable, or change billing for a project.

To minimize administrative complexity, use granular access management controls for critical environments at the project level, or for workloads that spread across multiple projects. When you curate access control for critical production applications, you ensure that workloads are secured and managed effectively.

Use the Organization Policy Service to control resources

The Organization Policy Service gives policy administrators centralized and programmatic control over your organization's cloud resources so that they can configure constraints across the resource hierarchy. For more information, see Add an organization policy administrator.

Use the Organization Policy Service to comply with regulatory policies

To meet compliance requirements, use the Organization Policy Service to enforce compliance requirements for resource sharing and access. For example, you can limit sharing with external parties or determine where to deploy cloud resources geographically. Other compliance scenarios include the following:

  • Centralizing control to configure restrictions that define how your organization's resources can be used.
  • Defining and establishing policies to help your development teams remain within compliance boundaries.
  • Helping project owners and their teams make system changes while maintaining regulatory compliance and minimizing concerns about breaking compliance rules.

Limit resource sharing based on domain

A restricted sharing organization policy helps you to prevent Google Cloud resources from being shared with identities outside your organization. For more information, see Restricting identities by domain and Organization policy constraints.

Disable service account and key creation

To help improve security, limit the use of Identity and Access Management (IAM) service accounts and corresponding keys. For more information, see Restricting service account usage.

Restrict the physical location of new resources

Restrict the physical location of newly created resources by restricting resource locations. To see a list of constraints that give you control of your organization's resources, see Organization Policy Service constraints.

What's next

Learn how to choose and manage compute, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Choose and manage compute

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on compute requirements. You learn how to choose a compute platform and a migration approach, design and scale workloads, and manage operations and VM migrations.

Computation is at the core of many workloads, whether it refers to the execution of custom business logic or the application of complex computational algorithms against datasets. Most solutions use compute resources in some form, and it's critical that you select the right compute resources for your application needs.

Google Cloud provides several options for using time on a CPU. Options are based on CPU types, performance, and how your code is scheduled to run, including usage billing.

Google Cloud compute options include the following:

  • Virtual machines (VM) with cloud-specific benefits like live migration.
  • Bin-packing of containers on cluster-machines that can share CPUs.
  • Functions and serverless approaches, where your use of CPU time can be metered to the work performed during a single HTTP request.

Choosing compute

This section provides best practices for choosing and migrating to a compute platform.

Choose a compute platform

When you choose a compute platform for your workload, consider the technical requirements of the workload, lifecycle automation processes, regionalization, and security.

Evaluate the nature of CPU usage by your app and the entire supporting system, including how your code is packaged and deployed, distributed, and invoked. While some scenarios might be compatible with multiple platform options, a portable workload should be capable and performant on a range of compute options.

The following table describes Google Cloud compute platform options:

Compute platform Use cases Recommended products
Serverless
  • Your first app.
  • Focus on data and processing logic and on app development, rather than maintaining infrastructure operations.
  • Cloud Run: Put your business logic in containers by using this fully managed serverless option. Cloud Run is designed for workloads that are compute intensive, but not always on. Scale cost effectively from 0 (no traffic) and define the CPU and RAM of your tasks and services. Deploy with a single command and Google automatically provisions the right amount of resources.
  • Cloud Functions: Separate your code into flexible pieces of business logic without the infrastructure concerns of load balancing, updates, authentication, or scaling.
  • App Engine: Develop modern web applications and scalable mobile backends. App Engine lets you build and host web applications on Google's infrastructure.
Kubernetes Complex microservice architectures that need additional services like Istio to manage service mesh control.
  • Google Kubernetes Engine: An open source container-orchestration engine that automates deploying, scaling, and managing containerized apps.
Compute Engine You want to create and run VMs from predefined and customizable VM families that support your application and workload requirements, as well as third-party software and services.
  • Compute Engine: Add graphics processing units (GPUs) to your VM instances. You can use these GPUs to accelerate specific workloads on your instances like machine learning and data processing.

To select appropriate machine types based on your requirements, see Recommendations for machine families.

For more information, see Choosing compute options.

Choose a compute migration approach

If you're migrating your existing applications from another cloud or from on-premises, use one of the following Google Cloud products to help you optimize for performance, scale, cost, and security.

Migration goal Use case Recommended product
Lift and shift Migrate or extend your VMware workloads to Google Cloud in minutes. Google Cloud VMware Engine
Lift and shift Move your VM-based applications to Compute Engine. Migrate to Virtual Machines
Upgrade to containers Modernize traditional applications into built-in containers on Google Kubernetes Engine. Migrate to Containers

To learn how to migrate your workloads while aligning internal teams, see VM Migration lifecycle and Building a Large Scale Migration Program with Google Cloud.

Designing workloads

This section provides best practices for designing workloads to support your system.

Evaluate serverless options for simple logic

Simple logic is a type of compute that doesn't require specialized hardware or machine types like CPU-optimized machines. Before you invest in Google Kubernetes Engine (GKE) or Compute Engine implementations to abstract operational overhead and optimize for cost and performance, evaluate serverless options for lightweight logic.

Decouple your applications to be stateless

Where possible, decouple your applications to be stateless to maximize use of serverless computing options. This approach lets you use managed compute offerings, scale applications based on demand, and optimize for cost and performance. For more information about decoupling your application to design for scale and high availability, see Design for scale and high availability.

Use caching logic when you decouple architectures

If your application is designed to be stateful, use caching logic to decouple and make your workload scalable. For more information, see Database best practices.

Use live migrations to facilitate upgrades

To facilitate Google maintenance upgrades, use live migration by setting instance availability policies. For more information, see Set VM host maintenance policy.

Scaling workloads

This section provides best practices for scaling workloads to support your system.

Use startup and shutdown scripts

For stateful applications, use startup and shutdown scripts where possible to start and stop your application state gracefully. A graceful startup is when a computer is turned on by a software function and the operating system is allowed to perform its tasks of safely starting processes and opening connections.

Graceful startups and shutdowns are important because stateful applications depend on immediate availability to the data that sits close to the compute, usually on local or persistent disks, or in RAM. To avoid running application data from the beginning for each startup, use a startup script to reload the last saved data and run the process from where it previously stopped on shutdown. To save the application memory state to avoid losing progress on shutdown, use a shutdown script. For example, use a shutdown script when a VM is scheduled to be shut down due to downscaling or Google maintenance events.

Use MIGs to support VM management

When you use Compute Engine VMs, managed instance groups (MIGs) support features like autohealing, load balancing, autoscaling, auto updating, and stateful workloads. You can create zonal or regional MIGs based on your availability goals. You can use MIGs for stateless serving or batch workloads and for stateful applications that need to preserve each VM's unique state.

Use pod autoscalers to scale your GKE workloads

Use horizontal and vertical Pod autoscalers to scale your workloads, and use node auto-provisioning to scale underlying compute resources.

Distribute application traffic

To scale your applications globally, use Cloud Load Balancing to distribute your application instances across more than one region or zone. Load balancers optimize packet routing from Google Cloud edge networks to the nearest zone, which increases serving traffic efficiency and minimizes serving costs. To optimize for end-user latency, use Cloud CDN to cache static content where possible.

Automate compute creation and management

Minimize human-induced errors in your production environment by automating compute creation and management.

Managing operations

This section provides best practices for managing operations to support your system.

Use Google-supplied public images

Use public images supplied by Google Cloud. The Google Cloud public images are regularly updated. For more information, see List of public images available on Compute Engine.

You can also create your own images with specific configurations and settings. Where possible, automate and centralize image creation in a separate project that you can share with authorized users within your organization. Creating and curating a custom image in a separate project lets you update, patch, and create a VM using your own configurations. You can then share the curated VM image with relevant projects.

Use snapshots for instance backups

Snapshots let you create backups for your instances. Snapshots are especially useful for stateful applications, which aren't flexible enough to maintain state or save progress when they experience abrupt shutdowns. If you frequently use snapshots to create new instances, you can optimize your backup process by creating a base image from that snapshot.

Use a machine image to enable VM instance creation

Although a snapshot only captures an image of the data inside a machine, a machine image captures machine configurations and settings, in addition to the data. Use a machine image to store all of the configurations, metadata, permissions, and data from one or more disks that are needed to create a VM instance.

When you create a machine from a snapshot, you must configure instance settings on the new VM instances, which requires a lot of work. Using machine images lets you copy those known settings to new machines, reducing overhead. For more information, see When to use a machine image.

Capacity, reservations, and isolation

This section provides best practices for managing capacity, reservations, and isolation to support your system.

Use committed-use discounts to reduce costs

You can reduce your operational expenditure (OPEX) cost for workloads that are always on by using committed use discounts. For more information, see the Cost optimization category.

Choose machine types to support cost and performance

Google Cloud offers machine types that let you choose compute based on cost and performance parameters. You can choose a low-performance offering to optimize for cost or choose a high-performance compute option at higher cost. For more information, see the Cost optimization category.

Use sole-tenant nodes to support compliance needs

Sole-tenant nodes are physical Compute Engine servers that are dedicated to hosting only your project's VMs. Sole-tenant nodes can help you to meet compliance requirements for physical isolation, including the following:

  • Keep your VMs physically separated from VMs in other projects.
  • Group your VMs together on the same host hardware.
  • Isolate payments processing workloads.

For more information, see Sole-tenant nodes.

Use reservations to ensure resource availability

Google Cloud lets you define reservations for your workloads to ensure those resources are always available. There is no additional charge to create reservations, but you pay for the reserved resources even if you don't use them. For more information, see Consuming and managing reservations.

VM migration

This section provides best practices for migrating VMs to support your system.

Evaluate built-in migration tools

Evaluate built-in migration tools to move your workloads from another cloud or from on-premises. For more information, see Migration to Google Cloud. Google Cloud offers tools and services to help you migrate your workloads and optimize for cost and performance. To receive a free migration cost assessment based on your current IT landscape, see Google Cloud Rapid Assessment & Migration Program.

Use virtual disk import for customized operating systems

To import customized supported operating systems, see Importing virtual disks. Sole-tenant nodes can help you meet your hardware bring-your-own-license requirements for per-core or per-processor licenses. For more information, see Bringing your own licenses.

Recommendations

To apply the guidance in the Architecture Framework to your own environment, we recommend that you do following:

  • Review Google Cloud Marketplace offerings to evaluate whether your application is listed under a supported vendor. Google Cloud supports running various open source systems and various third-party software.

  • Consider Migrate to Containers and GKE to extract and package your VM-based application as a containerized application running on GKE.

  • Use Compute Engine to run your applications on Google Cloud. If you have legacy dependencies running in a VM-based application, verify whether they meet your vendor requirements.

  • Evaluate using Google Cloud Internal TCP/UDP Load Balancing to scale your decoupled architecture. For more information, see Internal TCP/UDP Load Balancing overview.

  • Evaluate your options for switching from traditional on-premises use cases like HA-Proxy usage. For more information, see best practice for floating IP address.

  • Use VM Manager to manage operating systems for your large VM fleets running windows or Linux on Compute Engine, and apply consistent configuration policies.

  • Consider using GKE Autopilot and let Google SRE fully manage your clusters.

  • Use Anthos Config Management for policy and configuration management across your GKE clusters.

  • Ensure availability and scalability of machines in specific regions and zones. Google Cloud can scale to support your compute needs. However, if you need a lot of specific machine types in a specific region or zone, work with your account teams to ensure availability. For more information, see Reservations for Compute Engine.

What's next

Learn networking design principles, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Design your network infrastructure

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on networking design. You learn how to choose and implement Virtual Private Cloud (VPC), and how to test and manage network security.

Core principles

Networking design is critical to successful system design because it helps you optimize for performance and secure application communications with internal and external services. When you choose networking services, it's important to evaluate your application needs and evaluate how the applications will communicate with each other. For example, while some components require global services, other components might need to be geo-located in a specific region.

Google's private network connects regional locations to more than 100 global network points of presence. Google Cloud uses software-defined networking and distributed systems technologies to host and deliver your services around the world. Google's core element for networking within Google Cloud is the global VPC. VPC uses Google's global high-speed network to link your applications across regions while supporting privacy and reliability. Google ensures that your content is delivered with high throughput by using technologies like Bottleneck Bandwidth and Round-trip propagation time (BBR) congestion-control intelligence.

Developing your cloud networking design includes the following steps:

  1. Design the workload VPC architecture. Start by identifying how many Google Cloud projects and VPC networks you require.
  2. Add inter-VPC connectivity. Design how your workloads connect to other workloads in different VPC networks.
  3. Design hybrid network connectivity. Design how your workload VPCs connect to on-premises and other cloud environments.

When you design your Google Cloud network, consider the following:

To see a complete list of VPC specifications, see Specifications.

Workload VPC architecture

This section provides best practices for designing workload VPC architectures to support your system.

Consider VPC network design early

Make VPC network design an early part of designing your organizational setup in Google Cloud. Organizational-level design choices can't be easily reversed later in the process. For more information, see Best practices and reference architectures for VPC design and Decide the network design for your Google Cloud landing zone.

Start with a single VPC network

For many use cases that include resources with common requirements, a single VPC network provides the features that you need. Single VPC networks are simple to create, maintain, and understand. For more information, see VPC Network Specifications.

Keep VPC network topology simple

To ensure a manageable, reliable, and well-understood architecture, keep the design of your VPC network topology as simple as possible.

Use VPC networks in custom mode

To ensure that Google Cloud networking integrates seamlessly with your existing networking systems, we recommend that you use custom mode when you create VPC networks. Using custom mode helps you integrate Google Cloud networking into existing IP address management schemes and it lets you control which cloud regions are included in the VPC. For more information, see VPC.

Inter-VPC connectivity

This section provides best practices for designing inter-VPC connectivity to support your system.

Choose a VPC connection method

If you decide to implement multiple VPC networks, you need to connect those networks. VPC networks are isolated tenant spaces within Google's Andromeda software-defined network (SDN). There are several ways that VPC networks can communicate with each other. Choose how you connect your network based on your bandwidth, latency, and service level agreement (SLA) requirements. To learn more about the connection options, see Choose the VPC connection method that meets your cost, performance, and security needs.

Use Shared VPC to administer multiple working groups

For organizations with multiple teams, Shared VPC provides an effective tool to extend the architectural simplicity of a single VPC network across multiple working groups.

Use simple naming conventions

Choose simple, intuitive, and consistent naming conventions. Doing so helps administrators and users to understand the purpose of each resource, where it's located, and how it's differentiated from other resources.

Use connectivity tests to verify network security

In the context of network security, you can use connectivity tests to verify that traffic you intend to prevent between two endpoints is blocked. To verify that traffic is blocked and why it's blocked, define a test between two endpoints and evaluate the results. For example, you might test a VPC feature that lets you define rules that support blocking traffic. For more information, see Connectivity Tests overview.

Use Private Service Connect to create private endpoints

To create private endpoints that let you access Google services with your own IP address scheme, use Private Service Connect. You can access the private endpoints from within your VPC and through hybrid connectivity that terminates in your VPC.

Secure and limit external connectivity

Limit internet access only to those resources that need it. Resources with only a private, internal IP address can still access many Google APIs and services through Private Google Access.

Use Network Telemetry to enhance visibility into your cloud network

Identify traffic and access patterns that can impose security or operational risks to your organization in near real time. Network Telemetry provides both network and security operations with in-depth, responsive logs for Google Cloud networking services.

What's next

Learn best practices for storage management, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.

Select and implement a storage strategy

This document in the Google Cloud Architecture Framework provides best practices to deploy your system based on storage. You learn how to select a storage strategy and how to manage storage, access patterns, and workloads.

To facilitate data exchange and securely back up and store data, organizations need to choose a storage plan based on workload, input/output operations per second (IOPS), latency, retrieval frequency, location, capacity, and format (block, file, and object).

Cloud Storage provides reliable, secure object storage services, including the following:

In Google Cloud, IOPS scales according to your provisioned storage space. Storage types like Persistent Disk require manual replication and backup because they are zonal or regional. By contrast, object storage is highly available and it automatically replicates data across a single region or across multiple regions.

Storage type

This section provides best practices for choosing a storage type to support your system.

Evaluate options for high-performance storage needs

Evaluate persistent disks or local solid-state drives (SSD) for compute applications that require high-performance storage. Cloud Storage is an immutable object store with versioning. Using Cloud Storage with Cloud CDN helps optimize for cost, especially for frequently accessed static objects.

Filestore supports multi-write applications that need high-performance shared space. Filestore also supports legacy and modern applications that require POSIX-like file operations through Network File System (NFS) mounts.

Cloud Storage supports use cases such as creating data lakes and addressing archival requirements. Make tradeoff decisions based on how you choose Cloud Storage class due to access and retrieval costs, especially when you configure retention policies. For more information, see Design an optimal storage strategy for your cloud workload.

All storage options are by default encrypted at rest and in-transit using Google-managed keys. For storage types such as Persistent Disk and Cloud Storage, you can either supply your own key or manage them through Cloud Key Management Service (Cloud KMS). Establish a strategy for handling such keys before you employ them on production data.

Choose Google Cloud services to support storage design

To learn about the Google Cloud services that support storage design, use the following table:

Google Cloud service Description
Cloud Storage Provides global storage and retrieval of any amount of data at any time. You can use Cloud Storage for multiple scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users through direct download.

For more information, see the following:
Persistent Disk A high-performance block storage for Google Cloud. Persistent Disk provides SSD and hard disk drive (HDD) storage that you can attach to instances running in Compute Engine or Google Kubernetes Engine (GKE).
  • Regional disks provide durable storage and replication of data between two zones in the same region. If you need higher IOPS and low latency, Google Cloud offers Filestore.
  • Local SSDs are physically attached to the server that hosts your virtual machine instance. You can use local SSDs as temporary disk space.
Filestore A managed file storage service for applications that require a file system interface and a shared file system for data. Filestore gives users a seamless experience for standing up managed Network Attached Storage (NAS) with their Compute Engine and GKE instances.
Cloud Storage for Firebase Built for app developers who need to store and serve user-generated content, such as photos or videos. All your files are stored in Cloud Storage buckets, so they are accessible from both Firebase and Google Cloud.

Choose a storage strategy

To select a storage strategy that meets your application requirements, use the following table:

Use case Recommendations
You want to store data at scale at the lowest cost, and access performance is not an issue. Cloud Storage
You are running compute applications that need immediate storage.

For more information, see Optimizing Persistent Disk and Local SSD performance.
Persistent Disk or Local SSD
You are running high-performance workloads that need read and write access to shared space. Filestore
You have high-performance computing (HPC) or high-throughput computing (HTC) use cases. Using clusters for large-scale technical computing in the cloud

Choose active or archival storage based on storage access needs

A storage class is a piece of metadata that is used by every object. For data that is served at a high rate with high availability, use the Standard Storage class. For data that is infrequently accessed and can tolerate slightly lower availability, use the Nearline Storage, Coldline Storage, or Archive Storage class. For more information about cost considerations for choosing a storage class, see Cloud Storage pricing.

Evaluate storage location and data protection needs for Cloud Storage

For a Cloud Storage bucket located in a region, data contained within it is automatically replicated across zones within the region. Data replication across zones protects the data if there is a zonal failure within a region.

Cloud Storage also offers locations that are geo-redundant, which means data is replicated across multiple, geographically separate data centers. For more information, see Bucket locations.

Use Cloud CDN to improve static object delivery

To optimize the cost to retrieve objects and minimize access latency, use Cloud CDN. Cloud CDN uses the Cloud Load Balancing external HTTP(S) load balancer to provide routing, health checking, and anycast IP address support. For more information, see Setting up Cloud CDN with cloud buckets.

Storage access pattern and workload type

This section provides best practices for choosing storage access patterns and workload types to support your system.

Use Persistent Disk to support high-performance storage access

Data access patterns depend on how you design system performance. Cloud Storage provides scalable storage, but it isn't an ideal choice when you run heavy compute workloads that need high throughput access to large amounts of data. For high-performance storage access, use Persistent Disk.

Use exponential backoff when implementing retry logic

Use exponential backoff when implementing retry logic to handle 5XX, 408, and 429 errors. Each Cloud Storage bucket is provisioned with initial I/O capacity. For more information, see Request rate and access distribution guidelines. Plan a gradual ramp-up for retry requests.

Storage management

This section provides best practices for storage management to support your system.

Assign unique names to every bucket

Make every bucket name unique across the Cloud Storage namespace. Don't include sensitive information in a bucket name. Choose bucket and object names that are difficult to guess. For more information, see Bucket naming guidelines and Object naming guidelines.

Keep Cloud Storage buckets private

Unless there is a business-related reason, ensure that your Cloud Storage bucket isn't anonymously or publicly accessible. For more information, see Overview of access control.

Assign random object names to distribute load evenly

Assign random object names to facilitate performance and avoid hotspotting. Use a randomized prefix for objects where possible. For more information, see Use a naming convention that distributes load evenly across key ranges.

Use public access prevention

To prevent access at the organization, folder, project, or bucket level, use public access prevention. For more information, see Using public access prevention.

What's next

Learn about Google Cloud database services and best practices, including the following:

Explore other categories in the Architecture Framework such as reliability, operational excellence, and security, privacy, and compliance.