The App Engine Datastore supports a fast and efficient means of querying large data sets while providing a relatively flexible set of SQL-like features. We accomplish this by shifting most of the querying cost from query time to write time through prebuilt indexes. Doing so requires knowing the form of all queries that can be possibly run in your application.
However, there are situations where it's impossible to know the exact form of a query ahead of time, like when the filters of the query are constructed dynamically based on user input. In these cases, all possible combinations of filters must be supported by the indexes defined by the application. Traditionally, this has required a combinatorial explosion in the number of indexes defined. Recent enhancements to the App Engine Datastore query planner have eliminated the requirement for such a proliferation in an application's indexes. This article describes how to take full advantage of these enhancements.
For brevity, we are using a shorthand notation for index definitions. The following definition denotes an index of kind
model, with property
prop1 ascending and
Index(model, prop1, -prop2)
This is equivalent to the following configuration in
indexes: - kind: model ancestor: no properties: - name: prop1 - name: prop2 direction: desc
or this configuration in
<datastore-indexes> <datastore-index kind="model" ancestor="false"> <property name="prop1" direction="asc" /> <property name="prop2" direction="desc" /> </datastore-index> </datastore-indexes>
Consider the following definition for a "Photo" entity:
||List of strings||Tokenized keywords|
||Float||Aggregate user rating|
||Integer||Number of comments|
||Integer||Number of downloads|
If you know that you will be running queries of the form
SELECT * FROM Photo WHERE license=<reuse> ORDER BY date_added DESC
The following index must be added to your application at deployment time:
Index(Photo, license, -date_added)
As an example, let's create an "Advanced Search" feature for photos. Users can specify the following options:
To allow for this type of search, there must be an index for every combination of filters and orders. As App Engine automatically builds ascending and descending indexes on properties, we do not need to include indexes where there are no filters and a single order; however, the following indexes are needed:
Index(Photo, owner_id, -date_added) Index(Photo, owner_id, -rating) Index(Photo, owner_id, -comments) Index(Photo, owner_id, -downloads) Index(Photo, size, -date_added) Index(Photo, size, -rating) Index(Photo, size, -comments) Index(Photo, size, -downloads) ... Index(Photo, owner_id, size, -date_added) Index(Photo, owner_id, size, -rating) Index(Photo, owner_id, size, -comments) Index(Photo, owner_id, size, -downloads) ... Index(Photo, aspect, coloration, license, owner_id, size, -date_added) Index(Photo, aspect, coloration, license, owner_id, size, -rating) Index(Photo, aspect, coloration, license, owner_id, size, -comments) Index(Photo, aspect, coloration, license, owner_id, size, -downloads)
The total number of indexes is 2^(number of filters) * (number of different orders) = 2 ^ 5 * 4 = 128 indexes
It is possible to specify this many indexes, but doing so has risks:
- Potential to exceed the index cap (200)
- Greatly increased storage cost per entity (as this cost includes the size of the index entries)
However, specifying many indexes also has some benefits:
- Query performance is as fast as possible.
- Query performance is consistent and not dependent on shape of data.
You may have noticed that the last example did not include a
tag property. Exposing the filters on
tag to the user complicates the system greatly. The most desirable way to expose this in an "Advanced Search" box is to allow the user to specify a search string, which is tokenized into tags and a filter added for each tag.
For example, "outside, family" would be tokenized into "outside" and "family" and the filters "tag = 'outside' AND tag = 'family'" would be added to the query that fetches the photos. Including up to two tag filters in the combinatorial explosion of query possibilities results in: 2 ^ (# of filters) * (number of different orders) = 2 ^ 7 * 4 = 512 indexes.
Creating this many indexes:
- Exceeds the index cap.
- Supports only two tags when it is impossible to know how many tags the user will enter.
- Causes the number of index values to explode exponentially (Python | Java), because a multivalued property such as
tagis repeated in an index. As a result, the number of index values per entity rapidly rises and the storage cost of each entity increases accordingly.
If we look at how the index values exploded, we find that we can list
tag a maximum of five times in a single index for there to be any chance of returning a result. The following table shows the number of values generated in each index when storing an entity with a tag property containing N values. It also shows the maximum number of tag values for that entity:
|Index||# of index values||Maximum # of tags|
|Built-in (asc, desc)||2N||2500 (capped at 1000 for a single list property)|
||N||1666 (capped at 1000 for a single list property)|
||N ^ 2||69|
||N ^ 3||16|
||N ^ 4||8|
||N ^ 5||5|
Improved Query Planner
Luckily, the App Engine datastore now provides a new and improved query planner that can shift some of the query cost from write time to query time. It does this efficiently using the zigzag merge join algorithm, which tends to scale with the size of the result set much like querying against the indexes described above under Advanced Search. By design, the datastore supports only operations that scale in this fashion. The performance of operations that scale with the size of the entire database will deteriorate as the datastore grows.
The query planner can merge the results of any scans of indexes that are sorted by the same properties and find the results that are common to all such indexes (using the zigzag merge join algorithm). For example,
Index(Photo, owner, -date_added) and
Index(Photo, size, -date_added) are both ordered by
-date_added. This means that these indexes are sufficient if you want to find entities that have both
size=1 ordered by
date_added descending. Previously, an additional index
Index(Photo, owner, size, -date_added) would have been required.
The new query planner can also merge results from multiple sections of the same index. To find entities where the list of tags contained both "family" and "outside" ordered by
date_added descending (
SELECT * FROM Photo WHERE tag = 'family' AND tag = 'outside' ORDER BY date_added DESC), you need only the single index
Index(Photo, tag, -date_added). Before the improvements, you would have needed the exploding index
Index(Photo, tag, tag, -date_added).
To support the entire advanced photo search (including
tag filters), the following is the minimal list of indexes required to run any combination of user-selected filters and sort orders:
Index(Photo, owner_id, -date_added) Index(Photo, owner_id, -rating) Index(Photo, owner_id, -comments) Index(Photo, owner_id, -downloads) Index(Photo, size, -date_added) Index(Photo, size, -rating) Index(Photo, size, -comments) Index(Photo, size, -downloads) ... Index(Photo, tag, -date_added) Index(Photo, tag, -rating) Index(Photo, tag, -comments) Index(Photo, tag, -downloads)
The number of index entries needed is (number of filters + 1) * (number of orders) = 7 * 4 = 28. This is a much more manageable number. Additionally, none of these indexes are exploding, so the additional storage cost of entities is similarly small. With these indexes, a photo can have up to 1666 tags (capped at 1000) per photo, and there is no limit on the number of tags you can filter on. The drawback is that the performance of a query that uses zigzag merge join depends on the shape of the data. We'll discuss this further in the following section.
Because the zigzag merge join algorithm that enables this type of query combines filter components at read time instead of write time, the performance can vary depending on what filters are imposed. The best-case performance of zigzag merge join is
O(R), where R is the size of the result set
while the worst case performance is
O(S), where S is the size of smallest set of entities matching a single index scan
An index scan usually maps to a single filter in the query: for example, using the indexes listed above, this query
WHERE owner_id='12345' AND size=1 AND tag='family' ORDER BY date_added DESC
maps to an index scan per filter on the following indexes:
Index(Photo, owner_id, -date_added) Index(Photo, size, -date_added) Index(Photo, tag, -date_added)
Index(Photo, owner_id, size, -date_added) were defined, it would map to two index scans on the following indexes:
Index(Photo, owner_id, size, -date_added) # Satisfies both 'owner_id=12345' and 'size=1' Index(Photo, tag, -date_added)
The actual performance depends on the shape of the data. Specifically, the average number of entities considered for each result returned is O(S/R). This indicates that poor performance is likely when many entities match each scan, but few entities match the query as a whole (R is small and S is large).
Four things mitigate this risk:
- The query planner looks up the actual entity only when it knows that entity matches the entire query.
- The zigzag algorithm does not need to find all results to return the next result. This means if you request the first 10 results, you only pay the latency for finding those 10 results. Additionally, App Engine uses streaming and asynchronous prefetching to hide this latency as much as possible.
- The zigzag algorithm will skip large sections of false positive results (results present in some, but not all, scans), so the worst-case performance happens only if false positive results are perfectly interwoven (in sort order) between scans.
- The latency is based on the number of entities found in each "index scan", not the number of entities that match each "filter". This means that you can use better indexes when needed.
For example, consider a scenario in which you have a large number of panoramic images and a lot of black-and-white images, but very few panoramic images that are black-and-white. In this scenario, queries will be slow if they contain filters on these properties that use the following indexes:
Index(Photo, aspect, ...) Index(Photo, coloration, ...)
However, the following index will produce very few results for the same filters, which in turn lowers the value of S and improves performance:
Index(Photo, aspect, coloration, ...)
This approach costs only an additional index and index entry per entity, but it substantially improves performance. Figuring out what indexes are optimal for your app is a difficult problem. The answer can change as the shape of your data changes. However, there is no harm in experimentation if you keep the base indexes needed. For example, sampling query performance can give you a good idea of what queries are common and what queries are slow. Adding indexes to improve the performance of queries that are both common and slow is likely a step in the right direction.
As of the 1.5.3 SDKs, you can test the validity of your index definitions in the development web server. To do so, configure the development server so it does not auto-generate indexes. If you have a Python application, you can disable auto-generated indexes in the development web server using
If you have a Java application, you can disable auto-generated indexes in the development web server by specifying
After you've disabled auto-generation, if a query cannot be satisfied by the available indexes, the development web server throws an exception (NeedIndexError in Python, or a DatastoreNeedIndexException in Java). Additionally, the exception will list both the minimal index and the index it would have auto-generated. The minimal index is the smallest single index that satisfies the given query. It is meant to give you an idea of what is needed, but is not necessarily to be used verbatim.