Architecting disaster recovery for locality-restricted workloads

This document discusses how you can use Google Cloud to architect for disaster recovery (DR) to meet location-specific requirements. For some regulated industries, workloads must adhere to these requirements. In this scenario, one or more of the following requirements apply:

  • Data at rest must be restricted to a specified country.
  • Data must be processed in the country where it resides.
  • Workloads are accessible only from predefined locations.
  • Data must be encrypted by using keys that the customer manages.
  • If you are using cloud services, each cloud service must provide a minimum of two locations that are redundant to each other. For an example of location redundancy requirements, see the Cloud computing compliance criteria catalogue (C5).

This article is the first part of a series that discusses disaster recovery (DR) in Google Cloud. This part provides an overview of the DR planning process: what you need to know in order to design and implement a DR plan.

The series consists of these parts:

Terminology

Before you begin architecting for DR for locality-restricted workloads, it's a good idea to review locality terminology used in Google Cloud.

Google Cloud provides services in regions throughout the Americas, Europe and the Middle East, and Asia Pacific. For example, London (europe-west2) is a region in Europe, and Oregon (us-west1) is a region in North America. Some Google Cloud products group multiple regions into a specific multi-region location which is accessible in the same way that you would use a region.

Regions are further divided into zones where you deploy certain Google Cloud resources like virtual machines, Kubernetes clusters, or Cloud SQL databases. Resources on Google Cloud are multi-regional, regional, or zonal. Some resources and products that are by default designated as multi-regional can also be restricted to a region. The different types of resources are explained as follows:

  • Multi-regional resources are designed by Google Cloud to be redundant and distributed in and across regions. Mutli-regional resources are resilient to the failure of a single region.
  • Regional resources are redundantly deployed across multiple zones in a region, and are resilient to the failure of a zone within the region.
  • Zonal resources operate in a single zone. If a zone becomes unavailable, all zonal resources in that zone are unavailable until service is restored. Consider a zone as a single-failure domain. You need to architect your applications to mitigate the effects of a single zone becoming unavailable.

The following table shows the current relationship between regions, zones, and locations for Europe.

Region Zones Location
europe-north1 a, b, c Hamina, Finland
europe-west1 b, c, d St. Ghislain, Belgium
europe-west2 a, b, c London, England, UK
europe-west3 a, b, c Frankfurt, Germany
europe-west4 a, b, c Eemshaven, Netherlands
europe-west6 a, b, c Zürich, Switzerland

For more information, see Geography and regions.

Planning for DR for locality-restricted workloads

The approach you take to designing your application depends on the type of workload and the locality requirements you must meet. Also consider why you must meet those requirements because what you decide directly influences your DR architecture.

Start by reading the Google Cloud Disaster recovery planning guide. And as you consider locality-restricted workloads, focus on the requirements discussed in this planning section.

Define your locality requirements

Before you start your design, define your locality requirements by answering these questions:

  • Where is the data at rest? The answer dictates what services you can use and the high availability (HA) and DR methods you can employ to achieve your RTO/RPO values. Use the Cloud locations page to determine what products are in scope.
  • Can you use encryption techniques to mitigate the requirement? If you are able to mitigate locality requirements by employing encryption techniques using Cloud External Key Manager and Cloud Key Management Service, you can use multi-regional and dual-regional services and follow the standard HA/DR techniques outlined in Disaster recovery scenarios for data.
  • Can data be processed outside of where it rests? You can use products such as Anthos to provide a hybrid environment to address your requirements or implement product-specific controls such as load-balancing Compute Engine instances across multiple zones in a region. Use the Organization policy Resource Location constraint to restrict where resources can be deployed .

    If data can be processed outside of where it needs to be at rest, you can design the "processing" parts of your application by following the guidance in Disaster recovery building blocks and Disaster recovery scenarios for applications.

    Configure a VPC Security Controls perimeter to control who can access the data and to restrict what resources can process the data.

  • Can you use more than one region? If you can use more than one region, you can use many of the techniques outlined in the Disaster Recovery series. Check the multi-region and region constraints for Google Cloud products.

  • Do you need to restrict who can access your application? Google Cloud has several products and features that help you restrict who can access your applications:

    • Identity-Aware Proxy (IAP). Verifies a user's identity and then determines whether that user should be permitted to access an application. Organization policy uses the domain-restricted sharing constraint to define the allowed Cloud Identity or Google Workspace IDs that are permitted in IAM policies.
    • Product-specific locality controls. Refer to each product you want to use in your architecture for appropriate locality constraints. For example, use Anthos configuration manager to apply locality-specific policies to your Anthos-managed GKE clusters. If you're using Cloud Storage, create buckets in specified regions.

Identify the services that you can use

Identify what services can be used based on your locality and regional granularity requirements. Designing applications that are subject to locality restrictions requires understanding what products can be restricted to what region and what controls can be applied to enforce location restriction requirements.

Identify the regional granularity for your application and data

Identify the regional granularity for your application and data by answering these questions:

  • Can you use multi-regional services in your design? By using multi-regional services, you can create highly available resilient architectures.
  • Does access to your application have location restrictions? Use these Google Cloud products to help enforce where your applications can be accessed from:
  • Is your data at rest restricted to a specific region? If you use managed services, ensure that the services you are using can be configured so that your data stored in the service is restricted to a specific region. For example, use BigQuery locality restrictions to dictate where your datasets are stored and backed up to.
  • What regions do you need to restrict your application to? Some Google Cloud products do not have regional restrictions. Use the Cloud locations page and the product-specific pages to validate what regions you can use the product in and what mitigating features if any are available to restrict your application to a specific region.

Meeting locality restrictions using Google Cloud products

This section details features and mitigating techniques for using Google Cloud products as part of your DR strategy for locality-restricted workloads. We recommend reading this section along with Disaster recovery building blocks.

Organization policies

The Organization Policy Service gives you centralized control over your Google Cloud resources. Using organization policies, you can configure restrictions across your entire resource hierarchy. Consider the following policy constraints when architecting for locality-restricted workloads:

  • Domain-restricted sharing: By default, all user identities are allowed to be added to IAM policies. The allowed/denied list must specify one or more Cloud Identity or Google Workspace customer identities. If this constraint is active, only identities in the allowed list are eligible to be added to IAM policies.

  • Location-restricted resources: This constraint refers to the set of locations where location-based Google Cloud resources can be created. Policies for this constraint can specify as allowed or denied locations any of the following: multi-regions such as Asia and Europe, regions such as us-east1 or europe-west1, or individual zones such as europe-west1-b. For a list of supported services, see Resource locations supported services.

Encryption

If your data locality requirements concern restricting who can access the data, then implementing encryption methods might be an applicable strategy. By using external key management systems to manage keys that you supply outside of Google Cloud, you might be able to deploy a multi-region architecture to meet your locality requirements. Without the keys available, the data cannot be decrypted.

Google Cloud has two products that let you use keys that you manage:

  • Cloud External Key Manager (Cloud EKM): Cloud EKM lets you encrypt data in BigQuery and Compute Engine with encryption keys that are stored and managed in a third-party key management system that's deployed outside Google's infrastructure.
  • Customer-supplied encryption keys (CSEK): You can use CSEK with Cloud Storage and Compute Engine. Google uses your key to protect the Google-generated keys that are used to encrypt and decrypt your data.

    If you provide a customer-supplied encryption key, Google does not permanently store your key on Google's servers or otherwise manage your key. Instead, you provide your key for each operation, and your key is purged from Google's servers after the operation is complete.

When managing your own key infrastructure, you must carefully consider latency and reliability issues and ensure that you implement appropriate HA and recovery processes for your external key manager. You must also understand your RTO requirements. The keys are integral to writing the data, so RPO isn't the critical concern because no data can be safely written without the keys. The real concern is RTO because without your keys you cannot unencrypt or safely write data.

Storage

When architecting DR for locality-restricted workloads, you must ensure that data at rest is located in the region you require. You can configure Google Cloud object and file store services to meet your requirements

Cloud Storage

You can create Cloud Storage buckets that meet locality restrictions.

Beyond the features discussed in the Cloud Storage section of the Disaster Recovery Building Blocks article, when you architect for DR for locality-restricted workloads, consider geo-redundancy. Geo-redundancy refers to storing data redundantly in at least two geographic areas separated by at least 100 miles. Objects stored in multi-regions and dual-regions are geo-redundant, regardless of their storage class. Geo-redundancy ensures maximum availability of your data, even during large-scale disruptions, such as natural disasters. For dual regions, you can achieve geo-redundancy by using two specific regions. For multi-regions, you can achieve geo-redundancy by using any combination of data centers in the specified multi-region, which might include data centers that are not explicitly listed as available regions.

Data synchronization between the buckets occurs asynchronously. If you need a high degree of confidence that the data has been written to an alternative region to meet your RTO and RPO values, one strategy is to use two single-region buckets. You can then either dual-write the object or write to one bucket and have Cloud Storage copy (rewriteTo) the second bucket.

Single-region mitigation strategies when using Cloud Storage

If your requirements restrict you to using a single region—for example, London or Zürich—then you are restricted to that region and cannot implement a geo-redundant architecture using Google Cloud alone. In this scenario, consider using one or more of the following techniques:

  • Adopt a multi-cloud or hybrid strategy. This approach lets you choose another cloud or on-premises solution in the same geographic area as your Google Cloud region. You can store copies of your data in Cloud Storage buckets on-premises, or alternatively, use Cloud Storage as the target for your backup data.

    To use this approach, do the following:

    • Ensure that distance requirements are met.
    • If you are using AWS as your other cloud provider, refer to the Cloud Storage interoperability guide for how to configure access to Amazon S3 using Google Cloud tools.
    • For other clouds and on-premises solutions, consider open source solutions such as minIO and Ceph to provide an on-premises object store.
    • Consider using partner solutions to provide an on-premises object store.
    • Consider using a partner solution to implement workflows that let you write data to both Cloud Storage and an alternative cloud's object store service.
    • Consider using Cloud Composer with the gsutil command-line utility to transfer data from an on-premises object store to Cloud Storage.
    • Use the Transfer service for on-premises data to copy data stored on-premises to Cloud Storage.
  • Implement encryption techniques. If your locality requirements permit using encryption techniques as a workaround, you can then use multi-region or dual-region buckets.

Filestore

Filestore provides managed file storage that you can deploy in regions and zones according to your locality restriction requirements.

Managed databases

Disaster recovery scenarios for data describes methods for implementing backup and recovery strategies for Google Cloud managed database services. In addition to using these methods, you must also consider locality restrictions for each managed database service that you use in your architecture—for example:

  • Cloud Bigtable is available in zonal locations in a region. Production instances have a minimum of two clusters, which must be in unique zones in the region. Replication between clusters in a Cloud Bigtable instance is automatically managed by Google. Cloud Bigtable synchronizes your data between the clusters, creating a separate, independent copy of your data in each zone where your instance has a cluster. Replication makes it possible for incoming traffic to fail over to another cluster in the same instance.

  • BigQuery has locality restrictions that dictate where your datasets are stored and backed up to. Dataset locations can be regional or multi-regional. For multi-regional configurations, data is stored in a single region but is backed up in a geographically separated region to provide resilience during a regional disaster. BigQuery manages the recovery and failover process. If you select EU Multi-region, you exclude Zürich and London from being part of the multi-region configuration.

    To understand the implications of adopting single-region or multi-region BigQuery configurations, see the BigQuery documentation.

  • You can use Firestore to store your Firestore data in either a multi-region location or a regional location. Data in a multi-region location operates in a multi-zone and multi-region replicated configuration. Select a multi-region location if your locality restriction requirements permit it and you want to maximize the availability and durability of your database. multi-region locations can withstand loss of entire regions and maintain availability without data loss. Data in a regional location operates in a multi-zone replicated configuration.

  • You can configure Cloud SQL for high availability. A Cloud SQL instance configured for HA is also called a regional instance and is located in a primary and secondary zone in the configured region. In a regional instance, the configuration is made up of a primary instance and a standby instance. Ensure that you understand the typical failover time from the primary to the standby instance.

    If your requirements permit, you can configure Cloud SQL with cross-region replicas. If a disaster occurs, the read replica in a different region can be promoted. Because read replicas are not configured for HA automatically, an extra step is needed after replica promotion.

  • You can configure Cloud Spanner as either regional or multi-region. For any regional configuration, Cloud Spanner maintains three read-write replicas, each in a different Google Cloud zone in that region. Each read-write replica contains a full copy of your operational database that is able to serve read/write and read-only requests.

    Cloud Spanner uses replicas in different zones so that if a single-zone failure occurs, your database remains available. A Cloud Spanner multi-region deployment provides a consistent environment across multiple regions, including two read-write regions and one witness region containing a witness replica. You must validate that the locations of all the regions meet your locality restriction requirements.

Compute Engine

Compute Engine resources are global, regional, or zonal. Compute Engine resources such as virtual machine instances or zonal persistent disks are referred to as zonal resources. Other resources, such as static external IP addresses, are regional. Regional resources can be used by any resources in that region, regardless of zone, while zonal resources can only be used by other resources in the same zone.

Putting resources in different zones in a region isolates those resources from most types of physical infrastructure failure and infrastructure software-service failures. Also, putting resources in different regions provides an even higher degree of failure independence. This approach lets you design robust systems with resources spread across different failure domains.

For more information, see regions and zones.

Using on-premises or another cloud as a production site

You might be using a Google Cloud region that prevents you from using dual or multi-region combinations for your DR architecture. To meet locality restrictions in this case, consider using your own data center or another cloud as the production site or as the failover site.

This section discusses Google Cloud products that are optimized for hybrid workloads. DR architectures that use on-premises and Google Cloud are discussed in the Disaster Recovery Scenarios for Applications.

Anthos

Anthos is Google Cloud's open hybrid and multi-cloud application platform that helps you securely run your container-based workloads anywhere. Anthos enables consistency between on-premises and cloud environments, letting you have a consistent operating model and a single view of your Google Kubernetes Engine (GKE) clusters, no matter where you are running them.

As part of your DR strategy, Anthos simplifies the configuration and operation of HA and failover architectures across dissimilar environments (between Google Cloud and on-premises or another cloud). You can run your production Anthos clusters on-premises and if a disaster occurs, you can fail over to run the same workloads on Anthos clusters in Google Cloud.

Anthos on Google Cloud has three types of clusters:

  • Single-zone cluster. A single-zone cluster has a single control plane running in one zone. This control plane manages workloads on nodes that are running in the same zone.
  • Multi-zonal cluster. A multi-zonal cluster has a single replica of the control plane running in a single zone, and has nodes running in multiple zones
  • Regional cluster. Regional clusters replicate cluster primaries and nodes across multiple zones in a single region. For example, a regional cluster in the us-east1 region creates replicas of the control plane and nodes in three us-east1 zones: us-east1-b, us-east1-c, and us-east1-d.

Regional clusters are the most resilient to zonal outages.

Google Cloud VMware Engine

Google Cloud VMware Engine lets you run VMware workloads in the cloud. If your on-premises workloads are VMware based, you can architect your DR solution to run on the same virtualization solution that you are running on-premises. You can select the region that meets your locality requirements.

Networking

When your DR plan is based on moving data from on-premises to Google Cloud or from another cloud provider to Google Cloud, then you must address your networking strategy. For more information, see the Transferring data to and from Google Cloud section of the "Disaster recovery building blocks" document.

VPC Service Controls

When planning your DR strategy, you must ensure that the security controls that apply to your production environment also extend to your failover environment. By using VPC Service Controls, you can define a security perimeter from on-premises networks to your projects in Google Cloud.

VPC Service Controls enables a context-aware access approach to controlling your cloud resources. You can create granular access control policies in Google Cloud based on attributes like user identity and IP address. These policies help ensure that the appropriate security controls are in place in your on-premises and Google Cloud environments.

What's next