Choosing the right architecture for global data distribution

This solution describes three example architectures that you can use to distribute data across Google Cloud regions.

Many enterprises work with data from geographically dispersed locations while responding to client requests in near-real time. For example, a demand-side platform (DSP) for digital advertising might have customers who expect database response times to be less than twenty milliseconds, regardless of their geographical location or of the current network load. Implementing this sort of global DSP solution isn't possible if the network architecture is based on a single centralized database, which is vulnerable to latencies based on physical distance and which is heavily impacted by usage spikes.

You can meet these needs with a distributed architecture for data storage. Not all architectures are appropriate for all business needs, and each architecture has varying strengths and weaknesses. This solution therefore offers various Google Cloud alternatives to help implement your overall business strategy and guide your network implementation approach.

Google Cloud advantages

Google Cloud offers robust and stable network bandwidth around the world. And Google Cloud has many additional advantages:

Google Cloud is extremely flexible, and you can use it to build a global virtual network, allowing your applications to more securely communicate across regions using private IP addresses. For example, you can set up Compute Engine virtual machine (VM) instances in two regions, such as us-central1 and asia-east1. You can have these VM instances use private IP addresses to communicate directly with each other by creating a Virtual Private Cloud network. In this way, your organization can help maintain secure communications between instances.

With Google Cloud anycast IP, a single global IP address is assigned to a managed service, such as load balancing. Using anycast IP, you can create a single, global load balancer instead of configuring load balancers in every region. The global load balancer routes client requests to applications running in the nearest regions and automatically scales to meet changing demand.

Three example data distribution architectures

This section outlines three deployment architectures and discusses when the architecture is appropriate. The architectures and use cases are:

  • Hybrid deployment, consisting of Google Cloud and on-premises services. You want to maintain some on-premises services but would like to take advantage of Google Cloud features. Google Cloud is linked to your current network and incorporated into your ongoing company processes. Some or all on-premises data is copied to or incorporated with Google Cloud.

  • Hybrid deployment, consisting of Google Cloud and other cloud service provider platforms. You want to maintain your current cloud service provider operations, but would like to include some Google Cloud features and configure the two systems to communicate.

  • Google Cloud using multiple regions. You want to support near-synchronous data transfers, possibly on a global scale. Configuring Google Cloud in multiple regions allows extremely rapid and near-simultaneous data transfer across the world.

Hybrid deployment: Google Cloud and on-premises services

Combining Google Cloud with on-premises services is appropriate for use cases involving applications that store data on-premises and that also propagate data to Google Cloud.

For example, in the retail industry, primary data (also sometimes known as master data) about new products might be inserted into on-premises databases for a legacy inventory management system. The company might also need to propagate that data to a Google Cloud database that's used for online web stores. With a hybrid approach, you can build a new system that uses Google Cloud without affecting the existing on-premises system. In this architecture, Google Cloud essentially works in parallel with the on-premises network structures.

You should consider the following issues when deciding whether to implement a hybrid Google Cloud and on-premises deployment:

  • If data is both on-premises and in Google Cloud, you must decide which data to treat as primary data and where this primary data should reside. For example, you might define Google Cloud data to be the primary data. In that case, Google Cloud behaves as a data hub connecting one or multiple on-premises environments, exchanging data between them. After data is added or updated in the Google Cloud environment, the data is transmitted to on-premises systems. Alternatively, on-premises systems could hold the primary data and periodically update Google Cloud.
  • If you are developing an application for this hybrid environment, keep in mind that managed services are available only for the resources in Google Cloud. Applications that run both on-premises and in the Google Cloud environment might not be able to rely on managed services such as automated backup, redundancy, and scalability.
  • In order to keep data portable and to help ensure consistent administrative operations, it might be easier to host cross-platform data stores, such as MySQL, on virtual machines in both your on-premises and Google Cloud deployment.

Example hybrid architecture

The following diagram illustrates an example of a hybrid architecture with Google Cloud and on-premises systems.

Architecture of a hybrid system.

In the example architecture:

  • Data is exchanged between on-premises file servers and Cloud Storage. This could involve backing up local files to Google Cloud, batch processing files as input, or downloading files from Google Cloud to on-premises networks.
  • Custom applications in local data centers use REST APIs to access applications on App Engine to retrieve or submit data. REST requests are typically synchronous and block clients until results are returned. In this architecture, App Engine provides auto-scaling to grow capacity as required, which helps keep latency low for these synchronous calls.
  • Custom applications submit messages directly to Pub/Sub to store them in a replicated queue for later processing. When messages arrive at Pub/Sub, Pub/Sub returns the status immediately and doesn't block clients. Messages can be retrieved and processed asynchronously using Cloud Functions, Dataflow, applications running on Compute Engine, and other methods. Client applications in on-premises environments can also retrieve messages.
  • Data stored in on-premises databases is exported (perhaps as CSV files) and uploaded to Google Cloud for batch loading into databases managed by Cloud SQL.
  • A Firebase database is used to synchronize data between on-premises systems and Google Cloud. Applications subscribe to keys in the database and whenever values are updated, applications are notified in real time and receive updated values. Applications that interact with Firebase can be on-premises, on Google Cloud, or both.

Hybrid deployment: Google Cloud and other cloud providers

You might combine Google Cloud with other cloud providers to more effectively distribute your data, to leverage multiple fail-safe mechanisms, or to take advantage of specific Google Cloud features. This architecture is a good choice when you already have production services running on other cloud providers, but want to take advantage of Google Cloud features. For example, you might want to use BigQuery to analyze application data, as well as logs and monitoring metrics.

This architecture is similar to the hybrid on-premises and Google Cloud architecture described earlier. You should consider the following issues when implementing a hybrid deployment of Google Cloud and other cloud providers:

  • You can use open source multi-cloud client libraries such as jclouds and libcloud to help integrate APIs between Google Cloud and other cloud services.
  • Google Cloud offers ways to transfer data from Amazon Web Services (AWS), such as Storage Transfer Service and Cloud Monitoring and Cloud Logging. You can export the logs to BigQuery for further analysis.
  • Pub/Sub is a global service, and your applications don't need to know in which region Pub/Sub queues exist. You can publish messages or subscribe to globally available topics. With Google Cloud, client apps need to be aware of only a single set of IP addresses and ports. For other cloud providers queues might be specific to a region. If that's so, when you deploy apps across multiple regions, client apps need to be aware of the endpoints for every region. Keeping track of the endpoints can be cumbersome, especially if you add services from new regions.

Example architecture for Google Cloud combined with another cloud provider

The following diagram illustrates a hybrid architecture including GCP and other cloud providers.

Architecture of a system involving Google Cloud and another cloud provider.

In the example architecture:

  • Messages are exchanged between Pub/Sub and other public clouds. Pub/Sub provides a global endpoint and can act as a message hub between clouds, because applications don't need to know in which region the message queues actually exist.
  • Instances of the Cloud Monitoring agent are installed in virtual machines of other public clouds to collect metrics about CPU utilization, memory usage, process information, and so on. Cloud Monitoring monitors resource usage across hybrid cloud environments.
  • Custom applications running on virtual machines in other cloud environments use REST APIs to call applications hosted on App Engine to submit or retrieve data.
  • Storage Transfer Service directly transfers files from Amazon S3 on demand or periodically. Transferred files can be processed on Compute Engine to load into Cloud SQL.

Hybrid deployment: Google Cloud with multiple regions

An architecture based on Google Cloud in multiple regions is a good choice when your application needs to serve users globally and synchronize data between regions with minimum latency. An example is an internet-enabled video game, which must function throughout the world with near real-time synchronization between players.

This architecture takes advantage of the power of Google Cloud-managed services to reduce administrative tasks and to ease system design. Google Cloud allows you to focus on your applications without spending time on infrastructure design. You should consider the following issues when implementing a hybrid deployment of Google Cloud with multiple regions:

  • You can easily deploy multi-regional data processing services, because message publishers and subscribers can run in any region. Pub/Sub can exchange messages between applications running in different regions without you having to specify where the application is running. In this architecture, Pub/Sub messages stay entirely within Google Cloud and are not sent across the internet, resulting in lower latency.
  • Applications on Compute Engine instances can directly communicate across regions using private IP addresses within a Google Cloud VPC network.
  • You can use REST APIs to make custom applications loosely coupled. Because the architecture is fully inside the Google Cloud environment, you can use App Engine for managing applications where you expect minimal administrative tasks.
  • After distributing data across regions, you can use Dataflow or Dataproc for processing ETL or analytical workloads.

Example multi-region Google Cloud architecture

The following diagram illustrates the architecture of a Google Cloud deployment with multiple regions.

Architecture of a system involving multiple Google Cloud regions.

In the example architecture:

  • As with the hybrid cloud architecture, Cloud Monitoring monitors all compute resources and displays a consolidated global view of resource usage. Collected logs and metrics are exported to BigQuery for further analysis.
  • As with the hybrid cloud architecture, Pub/Sub is used as a message hub. Pub/Sub allows services to be loosely coupled and independent from where the application actually runs.
  • Custom applications that run on App Engine or Compute Engine directly exchange messages with other custom applications using REST APIs. This is a more tightly coupled architecture than with the hybrid architecture, and therefore achieves more predictable latency.
  • Storage Transfer Service is used to synchronize Cloud Storage buckets. Alternatively, the gsutil tool can be used for on-demand transfers between buckets across regions.

Next steps

Learn more about data management on Google Cloud: