Structuring Data for Strong Consistency

Note: Developers building new applications are strongly encouraged to use the NDB Client Library, which has several benefits compared to this client library, such as automatic entity caching via the Memcache API. If you are currently using the older DB Client Library, read the DB to NDB Migration Guide

Datastore provides high availability, scalability and durability by distributing data over many machines and using synchronous replication over a wide geographic area. However, there is a tradeoff in this design, which is that the write throughput for any single entity group is limited to about one commit per second, and there are limitations on queries or transactions that span multiple entity groups. This page describes these limitations in more detail and discusses best practices for structuring your data to support strong consistency while still meeting your application's write throughput requirements.

Strongly-consistent reads always return current data, and, if performed within a transaction, will appear to come from a single, consistent snapshot. However, queries must specify an ancestor filter in order to be strongly-consistent or participate in a transaction, and transactions can involve at most 25 entity groups. Eventually-consistent reads do not have those limitations, and are adequate in many cases. Using eventually-consistent reads can allow you to distribute your data among a larger number of entity groups, enabling you to obtain greater write throughput by executing commits in parallel on the different entity groups. But, you need to understand the characteristics of eventually-consistent reads in order to determine whether they are suitable for your application:

  • The results from these reads might not reflect the latest transactions. This can occur because these reads do not ensure that the replica they are running on is up-to-date. Instead, they use whatever data is available on that replica at the time of query execution. Replication latency is almost always less than a few seconds.
  • A committed transaction that spanned multiple entities might appear to have been applied to some of the entities and not others. Note, though, that a transaction will never appear to have been partially applied within a single entity.
  • The query results can include entities that should not have been included according to the filter criteria, and might exclude entities that should have been included. This can occur because indexes might be read at a different version than the entity itself is read at.

To understand how to structure your data for strong consistency, compare two different approaches for a simple guestbook application. The first approach creates a new root entity for each entity that is created:

import webapp2
from google.appengine.ext import db

class Guestbook(webapp2.RequestHandler):
  def post(self):
    greeting = Greeting()
    ...

It then queries on the entity kind Greeting for the ten most recent greetings.

import webapp2
from google.appengine.ext import db

class MainPage(webapp2.RequestHandler):
  def get(self):
    self.response.out.write('<html><body>')
    greetings = db.GqlQuery("SELECT * "
                            "FROM Greeting "
                            "ORDER BY date DESC LIMIT 10")

However, because you are using a non-ancestor query, the replica used to perform the query in this scheme might not have seen the new greeting by the time the query is executed. Nonetheless, nearly all writes will be available for non-ancestor queries within a few seconds of commit. For many applications, a solution that provides the results of a non-ancestor query in the context of the current user's own changes will usually be sufficient to make such replication latencies completely acceptable.

If strong consistency is important to your application, an alternate approach is to write entities with an ancestor path that identifies the same root entity across all entities that must be read in a single, strongly-consistent ancestor query:

import webapp2
from google.appengine.ext import db

class Guestbook(webapp2.RequestHandler):
  def post(self):
    guestbook_name=self.request.get('guestbook_name')
    greeting = Greeting(parent=guestbook_key(guestbook_name))
    ...

You will then be able to perform a strongly-consistent ancestor query within the entity group identified by the common root entity:

import webapp2
from google.appengine.ext import db

class MainPage(webapp2.RequestHandler):
  def get(self):
    self.response.out.write('<html><body>')
    guestbook_name=self.request.get('guestbook_name')

    greetings = db.GqlQuery("SELECT * "
                            "FROM Greeting "
                            "WHERE ANCESTOR IS :1 "
                            "ORDER BY date DESC LIMIT 10",
                            guestbook_key(guestbook_name))

This approach achieves strong consistency by writing to a single entity group per guestbook, but it also limits changes to the guestbook to no more than 1 write per second (the supported limit for entity groups). If your application is likely to encounter heavier write usage, you might need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, an ancestor query, or any operation within a transaction, you will always see the most recently written data.