Documents and Indexes

The Search API provides a model for indexing documents that contain structured data. You can search an index, and organize and present search results. The API supports full text matching on string fields. Documents and indexes are saved in a separate persistent store optimized for search operations. The Search API can index any number of documents. The App Engine Datastore may be more appropriate for applications that need to retrieve very large result sets.

Overview

The Search API is based on four main concepts: documents, indexes, queries, and results.

Documents

A document is an object with a unique ID and a list of fields containing user data. Each field has a name and a type. There are several types of fields, identified by the kinds of values they contain:

  • Atom Field - an indivisible character string
  • Text Field - a plain text string that can be searched word by word
  • HTML Field - a string that contains HTML markup tags, only the text outside the markup tags can be searched
  • Number Field - a floating point number
  • Date Field - a date object
  • Geopoint Field - a data object with latitude and longitude coordinates

The maximum size of a document is 1 MB.

Indexes

An index stores documents for retrieval. You can retrieve a single document by its ID, a range of documents with consecutive IDs, or all the documents in an index. You can also search an index to retrieve documents that satisfy given criteria on fields and their values, specified as a query string. You can manage groups of documents by putting them into separate indexes.

There is no limit to the number of documents in an index or the number of indexes you can use. The total size of all the documents in a single index is limited to 10GB by default but may be increased to up to 200GB by submitting a request.

Queries

To search an index, you construct a query, which has a query string and possibly some additional options. A query string specifies conditions for the values of one or more document fields. When you search an index you get back only those documents in the index with fields that satisfy the query.

The simplest query, sometimes called a "global search" is a string that contains only field values. This search uses a string that searches for documents that contain the words "rose" and "water":

index.search("rose water");

This one searches for documents with date fields that contain the date July 4, 1776, or text fields that include the string "1776-07-04":

index.search("1776-07-04");

A query string can also be more specific. It can contain one or more terms, each naming a field and a constraint on the field's value. The exact form of a term depends on the type of the field. For instance, assuming there is a text field called "product", and a number field called "price", here's a query string with two terms:

// search for documents with pianos that cost less than $5000
index.search("product = piano AND price < 5000");

Query options, as the name implies, are not required. They enable a variety of features:

  • Control how many documents are returned in the search results.
  • Specify what document fields to include in the results. The default is to include all the fields from the original document. You can specify that the results only include a subset of fields (the original document is not affected).
  • Sort the results.
  • Create "computed fields" for documents using FieldExpressions and abridged text fields using snippets.
  • Support paging through the search results by returning only a portion of the matched documents on each query (using offsets and cursors)

Search results

A call to search() can only return a limited number of matching documents. Your search may find more documents than can be returned in a single call. Each search call returns an instance of the Results class, which contains information about how many documents were found and how many were returned, along with the list of returned documents. You can repeat the same search, using cursors or offsets to retrieve the complete set of matching documents.

Additional training material

In addition to this documentation, you can read the two-part training class on the Search API at the Google Developer's Academy. (Although the class uses the Python API, you may find the additional discussion of the Search concepts useful.)

Documents and fields

The Document class represents documents. Each document has a document identifier and a list of fields.

Document identifier

Every document in an index must have a unique document identifier, or doc_id. The identifier can be used to retrieve a document from an index without performing a search. By default, the Search API automatically generates a doc_id when a document is created. You can also specify the doc_id yourself when you create a document. A doc_id must contain only visible, printable ASCII characters (ASCII codes 33 through 126 inclusive) and be no longer than 500 characters. A document identifier cannot begin with an exclamation point ('!'), and it can't begin and end with double underscores ("__").

While it is convenient to create readable, meaningful unique document identifiers, you cannot include the doc_id in a search. Consider this scenario: You have an index with documents that represent parts, using the part's serial number as the doc_id. It will be very efficient to retrieve the document for any single part, but it will be impossible to search for a range of serial numbers along with other field values, such as purchase date. Storing the serial number in an atom field solves the problem.

Document fields

A document contains fields that have a name, a type, and a single value of that type. Two or more fields can have the same name, but different types. For instance, you can define two fields with the name "age": one with a text type (the value "twenty-two"), the other with a number type (value 22).

Field names

Field names are case sensitive and can only contain ASCII characters. They must start with a letter and can contain letters, digits, or underscore. A field name cannot be longer than 500 characters.

Multi-valued fields

A field can contain only one value, which must match the field's type. Field names do not have to be unique. A document can have multiple fields with the same name and same type, which is a way to represent a field with multiple values. (However, date and number fields with the same name can't be repeated.) A document can also contain multiple fields with the same name and different field types.

Field types

There are three kinds of fields that store java.lang.String character strings; collectively we refer to them as string fields:

  • Text Field: A string with maximum length 1024**2 characters.
  • HTML Field: An HTML-formatted string with maximum length 1024**2 characters.
  • Atom Field: A string with maximum length 500 characters.

There are also three field types that store non-textual data:

  • Number Field: A double precision floating point value between -2,147,483,647 and 2,147,483,647.
  • Date Field: A java.util.Date.
  • Geopoint Field: A point on earth described by latitude and longitude coordinates

The field types are specified using the Field.FieldType enums TEXT, HTML, ATOM, NUMBER, DATE, and GEO_POINT.

Special treatment of string and date fields

When a document with date, text, or HTML fields is added to an index, some special handling occurs. It's helpful to understand what's going on "under the hood" in order to use the Search API effectively.

Tokenizing string fields

When an HTML or text field is indexed, its contents are tokenized. The string is split into tokens wherever whitespace or special characters (punctuation marks, hash sign, etc.) appear. The index will include an entry for each token. This enables you to search for keywords and phrases comprising only part of a field's value. For instance, a search for "dark" will match a document with a text field containing the string "it was a dark and stormy night", and a search for "time" will match a document with a text field containing the string "this is a real-time system".

In HTML fields, text within markup tags is not tokenized, so a document with an HTML field containing "it was a <strong>dark</strong> night" will match a search for "night", but not for "strong". If you want to be able to search markup text, store it in a text field.

Atom fields are not tokenized. A document with an atom field that has the value "bad weather" will only match a search for the entire string "bad weather". It will not match a search for "bad" or "weather" alone.

Tokenizing Rules
  • The underscore (_) and ampersand (&) characters do not break words into tokens.

  • These whitespace characters always break words into tokens: space, carriage return, line feed, horizontal tab, vertical tab, form feed, and NULL.

  • These characters are treated as punctuation, and will break words into tokens:

!"%()
*,-|/
[]]^`
:=>?@
{}~$

  • The characters in the following table usually break words into tokens, but they can be handled differently depending on the context in which they appear:
Character Rule
< In an HTML field the "less than" sign indicates the start of an HTML tag which is ignored.
+ A string of one or more "plus" signs is treated as a part of the word if it appears at the end of the word (C++).
# The "hash" sign is treated as a part of the word if it is preceded by a, b, c, d, e, f, g, j, or x (a# - g# are musical notes; j# and x# are programming language, c# is both.) If a term is preceded by '#' (#google), it is treated as a hashtag and the hash becomes part of the word.
' Apostrophe is a letter if it precedes the letter "s" followed by a word-break, as in "John's hat".
. If a decimal point appears between digits, this is part of a number (i.e., the decimal-separator). This can also be part of a word if used in an acronym (A.B.C).
- The dash is part of a word if used in an acronym (I-B-M).
  • All other 7-bit characters other than letters and digits ('A-Z', 'a-z', '0-9') are handled as punctuation and break words into tokens.

  • Everything else is parsed as a UTF-8 character.

Acronyms

Tokenization uses special rules to recognize acronyms (strings like "I.B.M.", "a-b-c", or "C I A"). An acronym is a string of single alphabetic characters, with the same separator character between all of them. The valid separators are the period, dash, or any number of spaces. The separator character is removed from the string when an acronym is tokenized. So the example strings mentioned above become the tokens "ibm", "abc", and "cia". The original text remains in the document field.

When dealing with acronyms, note that:

  • An acronym cannot contain more than 21 letters. A valid acronym string with more than 21 letters will be broken into a series of acronyms, each 21 letters or less