Building Scalable Web Applications with Cloud Datastore

This article presents an overview of how to build large web applications with Datastore, a scalable, highly available, high-performance, and fully managed NoSQL database system. The article includes scenarios of full-fledged web applications that use Datastore jointly with other products in the Google Cloud (GCP) ecosystem.

Infrastructure administrators who manage database systems for large web applications face significant challenges. In particular, configuration changes can be complex and risky when unforeseen situations arise due to the stateful nature of those database systems. Before launching a new application, administrators must often spend a lot of time planning for capacity, such as the number of virtual machines (VMs), the amount of disk storage, and the network configuration that will achieve maximum throughput and minimal latency. This type of planning is challenging because there are many unknown factors, including the volume and frequency of open database connections and the evolution of usage patterns over time. Regular maintenance work is also necessary to upgrade database software and scale resources to meet growing demand.

All of this planning and maintenance takes time, money, and attention away from developing new application features, so it is important to find a balance between provisioning enough resources to handle heavy loads and overspending on unused resources. Datastore can help minimize these challenges to allow infrastructure administrators to focus on the application.

Use cases for Datastore

At a high level, Datastore is best suited for storing hierarchical, transactional data that has a flexible, non-relational schema. In particular, consider using Datastore if your web application requires some of the following:

  • Any amount of storage capacity: Datastore is agnostic to the amount of data stored. It handles amounts from kilobytes to petabytes in the same way, without affecting performance.
  • Atomicity, consistency, isolation, and durability (ACID) compliance: Datastore supports multi-document, ACID-compliant transactions.
  • Balanced consistency: Datastore offers the right balance of strong and eventual consistency. Ancestor queries and entity lookups by key are always strongly consistent, while all other queries are eventually consistent.
  • Indexing: Along with primary indexes, Datastore supports secondary and composite indexes.
  • Multi-tenancy: Datastore supports multi-tenant databases by providing separate data partitions, called namespaces, for multiple client organizations.
  • Autoscaling: Datastore scales up to tens of thousands of machines automatically, with no downtime. This scaling mechanism, which has been tested in App Engine for over 10 years, allows Datastore to serve millions of requests per second. Customers pay only for their actual usage based on storage size and the number of operations. See more information on Datastore pricing.
  • Security: Datastore encrypts all data automatically before it is written to disk. Datastore also offers identity and access management (IAM), which you can use to permit more granular access to specific resources and prevent unwanted access to other resources.
  • Redundancy: Datastore offers two levels of redundancy that are based on two different types of replication in multiple locations:

    • Regional replication: data is replicated in at least three different zones within the same region, which makes the database resilient to zonal outages. This type of replication is ideal if your top priority is achieving low write latency, in which case you might want to co-locate your application's compute machines in the same region.
    • Multi-region replication: data is replicated in multiple zones across at least two different regions, which results in increased availability and redundancy but higher write latency. A witness node is deployed in a third region to act as a tiebreaker between the two replicated regions, as the following diagram illustrates.

Replication scheme for a multi-region database in Datastore

Figure 1. Replication scheme for a multi-region database in Datastore.

Datastore's ability to meet all the preceding requirements makes it well suited for a wide range of use cases, including:

  • User profiles to customize a user's experience based on their past activities and preferences. With Datastore's flexible schema, you can evolve the structure of user profiles over time—for example, by adding new properties to support new features in your application. Schema changes happen with no downtime, and performance doesn't degrade even as the number of users grows.
  • Real-time inventories, such as product catalogs for a retailer. You can use Datastore's rich, nested entities to store vast amounts of non-homogeneous, sparse data for diverse products without the need to overspecialize the structure.
  • User session management, such as for shopping carts in retail or for a multipart processing form for booking events. Datastore's support for ACID transactions helps ensure that users can lock down certain items for a period of time until their transaction is complete.
  • State mutations, such as in gaming to maintain a consistent state for all players. You can use Datastore's ACID transactions to propagate mutations across massive numbers of concurrent users. For a specific example, see Fast and Reliable Ranking in Datastore for a large game service.
  • A persistent write-through cache, such as a simple-to-use key-value store. You can use Datastore's high availability and durability to persist state and prevent data loss in the event of an application crash.

Alternative storage options

When evaluating Datastore as a potential database solution, be sure to verify that its production limits are acceptable for your application's requirements. While Datastore is versatile and applicable in many instances, other storage options in GCP might be a better fit for certain scenarios.

Extremely low latency

Datastore prioritizes durability and availability over latency by doing cross-region or cross-zone synchronous writes. If your application demands consistent sub–10 millisecond latency when reading or writing data, consider using an in-memory database like Memcached or Redis. You can also use an in-memory database as a cache for Datastore, which is discussed later in the platform-as-a-service (PaaS) scenario. For more information about Memcached, refer to the documentation on using it with App Engine and with Google Kubernetes Engine.

Extreme loads

Datastore can handle operations on a massive scale. However, to support complex features like replication and transactions, Datastore makes some tradeoffs that might slow performance on systems that deal with specific types of extreme loads. If your application generates extremely heavy writes, such as continuous data ingestion from Internet of Things sensors, consider using Cloud Bigtable for greater data ingestion capabilities at the expense of transactions and secondary indexes. If your application often displays the same information to users, such as a player leaderboard in gaming, consider client-side caching to reduce load by avoiding unnecessary requests to the server.

SQL schema and semantics

In order to make complex and efficient queries, Datastore uses a SQL-like syntax that is called GQL. However, Datastore is a non-relational database, so it does not support relational schemas or queries that use SQL semantics. In particular, Datastore does not support join operations, inequality filtering on multiple properties, or filtering on data that is based on results of a subquery. If your application requires SQL support, use Cloud SQL for non-horizontal scales or Cloud Spanner for larger horizontal and global scales.

Data analytics

Datastore is optimized for online transaction processing (OLTP). If your application requires a storage option for full table scans and interactive querying in an online analytical processing (OLAP) system, use BigQuery. If your application needs both OLTP and OLAP systems, use Datastore as your OLTP system and export Datastore data to BigQuery for analysis, as discussed later in the data analytics platform scenario.

Real-time mobile applications

Datastore is designed to store massive amounts of data that are accessible on the server side. If your application is mobile-based and requires real-time functionality, such as automatic push updates that are triggered by data changes, consider using Firebase. The Firebase platform provides many mobile-focused features, including crash reporting, user management and authentication, and user event analytics.

Datastore's integration with other GCP products

A large web application requires more than just a scalable database system. It also needs scalable web servers, a caching system, a storage solution for backups, and potentially a data warehouse with an extract, transform, load (ETL) pipeline to support data analytics or machine learning (ML) workloads. While Datastore is well suited to manage the single source of truth that is central to your application, it also integrates with multiple other GCP products to serve a wide range of needs. The following diagram illustrates this flexibility.

Overview of Datastore's integration with other GCP products

Figure 2. Overview of Datastore's integration with other GCP products.

Programs that run on Compute Engine, Google Kubernetes Engine, App Engine, and Cloud Functions can use Datastore to read and write transactional data at scale. The runtime environment you choose depends on the level of management and customization that your development and infrastructure teams require—for example:

  • Compute Engine gives you full control over virtual machines (VMs).
  • GKE facilitates the deployment and orchestration of your containers.
  • App Engine manages the entire infrastructure so you can focus on code.
  • Cloud Functions enables you to deploy event-driven microservices.

Programs interact with Datastore by using low-level REST or RPC APIs, or one of the cross-platform client libraries that are available for the C#, Go, Java, Node.js, PHP, Python, and Ruby languages.

Datastore supports data exports to Cloud Storage for archival or disaster recovery purposes, or to BigQuery for data analytics, either directly or through Cloud Storage. You can also write a custom Dataflow pipeline to, for example, filter on entities with a property that is set to true, or to preprocess data before loading it into BigQuery for analysis. For regular backups, you can use App Engine (flexible environment only) or Cloud Functions to schedule the execution of a Dataflow pipeline, or you can use cron jobs with Compute Engine.

Finally, you can use Datastore as part of a machine learning pipeline. For example, the pipeline could process data in AI Platform, either directly by using Datastore's Python library or indirectly by exporting data to Cloud Storage or BigQuery.

Reference architectures

This section describes scenarios for building large web applications that combine Datastore with other GCP products. It covers different types of functionality, such as daily exports, caching, data processing, and training models for machine learning, and provides a reference architecture for each scenario.

The scenarios and architectures are not prescriptive. Instead, they aim to highlight the breadth of possible uses for Datastore in building scalable web applications. They might also inspire you to change your own web application, because you can reorganize and adapt their components to fulfill your specific requirements.

Scenario 1: Simple autoscaling infrastructure with daily database snapshots

In this scenario, a managed instance group of Compute Engine VMs serves a large web application. The instance group scales automatically based on increases or decreases in load, and Cloud HTTP(S) Load Balancer routes incoming traffic between available instances. The VMs interact directly with Datastore to read and write transactional data. Daily snapshots of the database are stored in Cloud Storage, and object lifecycle management rules periodically move old snapshots to Cloud Storage Coldline to reduce the cost of storing archived data.

Simple autoscaling infrastructure with daily database snapshots

Figure 3. Simple autoscaling infrastructure with daily database snapshots.

Scenario 2: Container-based infrastructure

In this scenario, a gaming platform supports concurrent access by tens of thousands of players. The platform is composed of hundreds of containerized microservices that are hosted on GKE. The platform has three layers of microservices, all deployed as autoscaling Kubernetes pods: NGINX Controller pods, frontend pods, and backend pods.

Cloud SSL proxy load-balances incoming TCP and SSL traffic between NGINX Controller pods, which handle HTTP load balancing between frontend pods by using session stickiness. This approach ensures that requests from the same client are always forwarded to the same frontend pod. To reduce latency, frontend pods can cache client-specific information and player configurations locally. Backend pods can reduce latency by managing particular areas of the game's world and cache area-specific information locally. Frontend and backend pods write the cached information at the same time into Datastore for long-term storage. This method ensures that no state is lost if frontend and backend pods get terminated.

Container-based gaming platform powered by Kubernetes Engine and
Datastore

Figure 4. Container-based gaming platform powered by Kubernetes Engine and Datastore.

For more information about deploying the NGINX Controller on GCP, follow the hands-on guide. For an alternate reference architecture for a complex gaming platform, see a more detailed solution.

Scenario 3: Function-based infrastructure

In this scenario, an event management platform exposes an API to mobile and desktop user applications. Each API endpoint is implemented as a microservice that is deployed on Cloud Functions. Each API endpoint provides a specific service such as listing events, returning venue information, or booking tickets. Microservices are implemented in JavaScript and use the Node.js client library for Datastore to read and write persistent data. With this fully decoupled serverless architecture, you can scale each service independently.

Serverless function-based infrastructure writing persistent data to
Datastore

Figure 5. Serverless function-based infrastructure writing persistent data to Datastore.

Scenario 4: Platform as a service

In this scenario, you use the App Engine standard environment, the PaaS solution from Google, to build a retail application. The application is decomposed into independent App Engine microservices, each managing its own area of responsibility: user profiles, purchase records, product information, and customer service.

App Engine handles the automatic scaling of each microservice and the automatic load balancing between instances. Each microservice is implemented by using whichever of the supported programming languages (Java, Python, PHP, and Go) best suits its purpose, and each microservice uses the App Engine standard client libraries to access Datastore. To speed up user interactions and lighten database loads, App Engine caches the results of frequent database queries in its built-in Memcached component.

Autoscaling web application managed by the App Engine standard
environment, Datastore, and Memcached

Figure 6. Autoscaling web application managed by the App Engine standard environment, Datastore, and Memcached.

Scenario 5: Data analytics platform

In this scenario, a web platform is divided into two main compartments: business operations and business analytics. Business operations consist of a web application that responds to client requests. The application is built by using an autoscaling group of Compute Engine instances that are load-balanced by Cloud HTTP(S) Load Balancer. The application reads and writes data directly to the database in Datastore. Snapshots of the database are created daily and stored in Cloud Storage.

The business analytics compartment monitors and gains insights into the performance of business operations in order to plan and implement future improvements. The system loads historical data daily into BigQuery from the database snapshots that are saved in Cloud Storage. The system also loads live application logs from Compute Engine instances into BigQuery in real time through Pub/Sub. This design allows data analysts to explore both historical and live data and generate interactive visual dashboards in Google Data Studio that business analysts and other decision-makers can use.

Data analytics platform that manages both historical data from Datastore and live, real-time data through Pub/Sub

Figure 7. Data analytics platform that manages both historical data from Datastore and live, real-time data through Pub/Sub.

Scenario 6: Machine learning platform

In this scenario, an online store application offers a wide range of products for sale. The application is built by using the autoscaling group of Compute Engine instances that are load-balanced by Cloud HTTP(S) Load Balancer. All information about users and products is saved in Datastore.

A machine learning pipeline manages a recommendation engine that provides customized promotions for users. Historical data is regularly extracted from Datastore and transformed by Dataflow in batch mode. The transformed data is stored in Cloud Storage and used as features by AI Platform to train the recommendation engine's model. Each iteration of the trained model is saved to Cloud Storage and loaded into AI Platform to generate live promotion offers that are displayed on the application's user interface. The overall process is orchestrated through Apache Airflow running on Compute Engine. Data scientists experiment with new machine learning models through Datalab.

Machine learning platform that processes historical data from
Datastore to generate live promotion offers for users

Figure 8. Machine learning platform that processes historical data from Datastore to generate live promotion offers for users.

What's next