Hide
Java

Structuring Data for Strong Consistency

The Google App Engine High Replication Datastore (HRD) provides very high availability and durability by using masterless synchronous replication over a wide geographic area. This means that once a transaction commit succeeds, the caller can be assured that the changes have already been accepted by a majority of replicas in different locations. However, there is a tradeoff in this design, which is that the write throughput for any single entity group is limited to about one transaction per second, and the database cannot guarantee that queries spanning multiple entity groups ("non-ancestor queries") will see completely consistent and current data.

The following types of inconsistency may occur in a non-ancestor query:

  • The results may not reflect the latest transactions, returning slightly stale data. This can occur because your query can execute on any replica, and non-ancestor queries do not ensure that the replica they are running on is up-to-date. Instead, they use the latest data that had already been applied to that replica at the time of query execution.
  • A transaction that spans multiple entities may appear to have been applied to one of the entities and not another. Note, though, that a transaction will never appear to have been partially applied within a single entity.
  • A query may include entities in the result set that should not have been included, or exclude entities that should have been included. This can occur because the result set may be determined by the state of the indexes rather than the state of the entity itself, and transactions may be applied to the indexes either before or after they are applied to the entity.

In order to avoid these types of inconsistency, you need to use an ancestor query, limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group contains all transactions that were committed before to the start of the query. If your application relies on strongly-consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency while still meeting your application's write throughput requirements.

To understand how to structure your data for strong consistency, compare two different approaches for the guestbook example application from the App Engine Getting Started exercise. The first approach creates a new root entity for each greeting:

import com.google.appengine.api.datastore.Entity;

Entity greeting = new Entity("Greeting");
// No parent key specified, so Greeting is a root entity.

greeting.setProperty("user", user);
greeting.setProperty("date", date);
greeting.setProperty("content", content);

It then queries on the entity kind Greeting for the ten most recent greetings.

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();

Query query = new Query("Greeting")
                    .addSort("date", Query.SortDirection.DESCENDING);

List<Entity> greetings = datastore.prepare(query)
                                  .asList(FetchOptions.Builder.withLimit(10));

However, because we are using a non-ancestor query, the replica used to perform the query in this scheme may not have seen the new greeting by the time the query is executed. Nonetheless, nearly all writes will be available for non-ancestor queries within a few seconds of commit. For many applications, a solution that provides the results of a non-ancestor query in the context of the current user's own changes will usually be sufficient to make such replication latencies completely acceptable.

If strong consistency is important to your application, an alternate approach is to write entities with an ancestor path that identifies the same root entity across all entities that must be read in a single, strongly-consistent ancestor query:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;

String guestbookName = req.getParameter("guestbookName");
Key guestbookKey = KeyFactory.createKey("Guestbook", guestbookName);
String content = req.getParameter("content");
Date date = new Date();

// Place greeting in same entity group as guestbook
Entity greeting = new Entity("Greeting", guestbookKey);
greeting.setProperty("user", user);
greeting.setProperty("date", date);
greeting.setProperty("content", content);

You will then be able to perform a strongly-consistent ancestor query within the entity group identified by the common root entity:

import com.google.appengine.api.datastore.DatastoreService;
import com.google.appengine.api.datastore.DatastoreServiceFactory;
import com.google.appengine.api.datastore.Entity;

DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();

Key guestbookKey = KeyFactory.createKey("Guestbook", guestbookName);
Query query = new Query("Greeting", guestbookKey)
                    .setAncestor(guestbookKey)
                    .addSort("date", Query.SortDirection.DESCENDING);

List<Entity> greetings = datastore.prepare(query)
                                  .asList(FetchOptions.Builder.withLimit(10));

This approach achieves strong consistency by writing to a single entity group per guestbook, but it also limits changes to the guestbook to no more than 1 write per second (the supported limit for entity groups). If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, an ancestor query, or any operation within a transaction, you will always see the most recently written data.