The Search API provides a model for indexing documents that contain structured data. You can search an index, and organize and present search results. The API supports full text matching on string fields. Documents and indexes are saved in a separate persistent store optimized for search operations. The Search API can index any number of documents. The App Engine Datastore may be more appropriate for applications that need to retrieve very large result sets.
To view the contents of the
search
package, see the
search
package reference.
Overview
The Search API is based on four main concepts: documents, indexes, queries, and results.
Documents
A document is an object with a unique ID and a list of fields containing user data. Each field has a name and a type. There are several types of fields, identified by the kinds of values they contain:
- Atom Field - an indivisible character string.
- Text Field - a plain text string that can be searched word by word.
- HTML Field - a string that contains HTML markup tags, only the text outside the markup tags can be searched.
- Number Field - a floating point number.
- Time Field - a
time.Time
value, which is stored with millisecond precision. - Geopoint Field - a data object with latitude and longitude coordinates.
The maximum size of a document is 1 MB.
Indexes
An index stores documents for retrieval. You can retrieve a single document by its ID, a range of documents with consecutive IDs, or all the documents in an index. You can also search an index to retrieve documents that satisfy given criteria on fields and their values, specified as a query string. You can manage groups of documents by putting them into separate indexes.
There is no limit to the number of documents in an index or the number of indexes you can use. The total size of all the documents in a single index is limited to 10GB by default. Those with the App Engine Admin role can submit a request from the Google Cloud console App Engine Search page to increase the size up to 200GB.
Queries
To search an index, you construct a query, which has a query string and possibly some additional options. A query string specifies conditions for the values of one or more document fields. When you search an index you get back only those documents in the index with fields that satisfy the query.
The simplest query, sometimes called a "global search" is a string that contains only field values. This search uses a string that searches for documents that contain the words "rose" and "water":
This one searches for documents with date fields that contain the date July 4, 1776, or text fields that include the string "1776-07-04":
A query string can also be more specific. It can contain one or more terms, each naming a field and a constraint on the field's value. The exact form of a term depends on the type of the field. For instance, assuming there is a text field called "Product", and a number field called "Price", here's a query string with two terms:
Query options, as the name implies, are not required. They enable a variety of features:
- Control how many documents are returned in the search results.
- Specify what document fields to include in the results. The default is to include all the fields from the original document. You can specify that the results only include a subset of fields (the original document is not affected).
- Sort the results.
- Create "computed fields" for documents using
FieldExpressions
and abridged text fields using snippets. - Support paging through the search results by returning only a portion of the matched documents on each query (using offsets and cursors)
We recommend logging query strings in your application if you wish to keep a record of queries that have been executed.
Search results
ASearch
call returns an
Iterator
value, which can be used to return the complete set of matching documents.
Additional training material
In addition to this documentation, you can read the two-part training class on the Search API at the Google Developer's Academy. (Although the class uses the Python API, you may find the additional discussion of the Search concepts useful.)
Documents and fields
Documents are represented by Go structs, comprising a list of fields. Documents can also be represented by any type implementing theFieldLoadSaver
interface.
Document identifier
Every document in an index must have a unique document identifier, or
docID
.
The identifier can be used to retrieve a document from an index without performing
a search. By default, the Search API automatically generates a
docID
when
a document is created. You can also specify the
docID
yourself when you
create a document. A
docID
must contain only visible, printable ASCII
characters (ASCII codes 33 through 126 inclusive) and be no longer than 500
characters. A document identifier cannot begin with an exclamation point ('!'),
and it can't begin and end with double underscores ("__").
While it is convenient to create readable, meaningful unique document identifiers,
you cannot include the
docID
in a search. Consider this scenario: You
have an index with documents that represent parts, using the part's serial
number as the
docID
. It will be very efficient to retrieve the document
for any single part, but it will be impossible to search for a range of serial
numbers along with other field values, such as purchase date. Storing the serial
number in an atom field solves the problem.
Document fields
A document contains fields that have a name, a type, and a single value of that type. Two or more fields can have the same name, but different types. For instance, you can define two fields with the name "age": one with a text type (the value "twenty-two"), the other with a number type (value 22).
Field names
Field names are case sensitive and can only contain ASCII characters. They must start with a letter and can contain letters, digits, or underscore. A field name cannot be longer than 500 characters.
Multi-valued fields
A field can contain only one value, which must match the field's type. Field names do not have to be unique. A document can have multiple fields with the same name and same type, which is a way to represent a field with multiple values. (However, date and number fields with the same name can't be repeated.) A document can also contain multiple fields with the same name and different field types.
Field types
There are three kinds of fields that store character strings; collectively we refer to them as string fields:
- Text Field: A string with maximum length 1024**2 characters.
- HTML Field: An HTML-formatted string with maximum length 1024**2 characters.
- Atom Field: A string with maximum length 500 characters.
There are also three field types that store non-textual data:
- Number Field: A double precision floating point value between -2,147,483,647 and 2,147,483,647.
- Time Field - a
time.Time
value, which is stored with millisecond precision. - Geopoint Field: A point on earth described by latitude and longitude coordinates.
The string field types are Go's built-in
string
type and the search
package's HTML
and Atom
types.
Number fields are represented with Go's built-in
float64
type, time fields use the
time.Time
type, and geopoint fields use
the appengine
package's
GeoPoint
type.
Special treatment of string and time fields
When a document with time , text, or HTML fields is added to an index, some special handling occurs. It's helpful to understand what's going on "under the hood" in order to use the Search API effectively.
Tokenizing string fields
When an HTML or text field is indexed, its contents are tokenized. The string is split into tokens wherever whitespace or special characters (punctuation marks, hash sign, backslash, etc.) appear. The index will include an entry for each token. This enables you to search for keywords and phrases comprising only part of a field's value. For instance, a search for "dark" will match a document with a text field containing the string "it was a dark and stormy night", and a search for "time" will match a document with a text field containing the string "this is a real-time system".
In HTML fields, text within markup tags is not tokenized, so a document with an
HTML field containing it was a <strong>dark</strong> night
will match a
search for "night", but not for "strong". If you want to be able to search
markup text, store it in a text field.
Atom fields are not tokenized. A document with an atom field that has the value "bad weather" will only match a search for the entire string "bad weather". It will not match a search for "bad" or "weather" alone.
Tokenizing Rules
The underscore (_) and ampersand (&) characters do not break words into tokens.
These whitespace characters always break words into tokens: space, carriage return, line feed, horizontal tab, vertical tab, form feed, and NULL.
These characters are treated as punctuation, and will break words into tokens:
! " % ( ) * , - | / [ ] ^ ` : = > ? @ { } ~ $ The characters in the following table usually break words into tokens, but they can be handled differently depending on the context in which they appear:
Character Rule <
In an HTML field the "less than" sign indicates the start of an HTML tag which is ignored. +
A string of one or more "plus" signs is treated as a part of the word if it appears at the end of the word (C++). #
The "hash" sign is treated as a part of the word if it is preceded by a, b, c, d, e, f, g, j, or x (a# - g# are musical notes; j# and x# are programming language, c# is both.) If a term is preceded by '#' (#google), it is treated as a hashtag and the hash becomes part of the word. '
Apostrophe is a letter if it precedes the letter "s" followed by a word-break, as in "John's hat". .
If a decimal point appears between digits, this is part of a number (i.e., the decimal-separator). This can also be part of a word if used in an acronym (A.B.C). -
The dash is part of a word if used in an acronym (I-B-M). All other 7-bit characters other than letters and digits ('A-Z', 'a-z', '0-9') are handled as punctuation and break words into tokens.
Everything else is parsed as a UTF-8 character.
Acronyms
Tokenization uses special rules to recognize acronyms (strings like "I.B.M.", "a-b-c", or "C I A"). An acronym is a string of single alphabetic characters, with the same separator character between all of them. The valid separators are the period, dash, or any number of spaces. The separator character is removed from the string when an acronym is tokenized. So the example strings mentioned above become the tokens "ibm", "abc", and "cia". The original text remains in the document field.
When dealing with acronyms, note that:
- An acronym cannot contain more than 21 letters. A valid acronym string with more than 21 letters will be broken into a series of acronyms, each 21 letters or less.
- If the letters in an acronym are separated by spaces, all the letters must be the same case. Acronyms constructed with period and dash can use mixed case letters.
- When searching for an acronym, you can enter the canonical form of the acronym (the string without any separators), or the acronym punctuated with either the dash or the dot (but not both) between its letters. So the text "I.B.M" could be retrieved with any of the search terms "I-B-M", "I.B.M", or "IBM".
Time field accuracy
When you create a
time
field in a document you set its value to a
time.Time
.
For the purpose of indexing and searching the
time
field, any time
component is ignored and the date is converted to the number of days since
1/1/1970 UTC. This means that even though a
time
field can contain a
precise time value a date query can only specify a
time
field value in
the form yyyy-mm-dd
. This also means the sorted order of
time
fields
with the same date is not well-defined.
While the time.Time
type represents time with nanosecond precision, the Search
API stores them with only millisecond precision.
Other document properties
The rank of a document is a positive integer which determines the default ordering of documents returned from a search. By default, the rank is set at the time the document is created to the number of seconds since January 1, 2011. You can set the rank explicitly when you create a document. It's a bad idea to assign the same rank to many documents, and you should never give more than 10,000 documents the same rank.
If you specify sort options,
you can use the rank as a sort key. Note that when rank is used in a
sort expression
or field expression
it is referenced as _rank
.
DocumentMetadata
reference for more information about setting rank.
The Language property of the
Field
struct
specifies the language in which that field is encoded.
Linking from a document to other resources
You can use a document's
docID
and other fields as links to other
resources in your application. For example, if you use
Blobstore you can associate
the document with a specific blob by setting the
docID
or the value of an
Atom field to the BlobKey of the data.
Creating a document
The following code sample shows how to create a document object. The User
type
specifies the document structure, and a User
value is constructed in the usual
way.
Working with an index
Putting documents in an index
When you put a document into an index, the document is copied to persistent
storage and each of its fields is indexed according to its name, type, and the
docID
.
The following code example shows how to access an Index and put a document into it.
When you put a document into an index and the index already contains a document
with the same
docID
, the new document replaces the old one. No warning is
given. You can call
Index.Get
before creating or adding a document to an
index to check whether a specific
docID
already exists.
The Put
method returns a
docID
. If you did not specify the
docID
yourself, you can examine the result to discover the
docID
that was
generated:
Note that creating an instance of the Index
type does not guarantee that a persistent index actually exists.
A persistent index is created the first time you add a document to it with the
put
method.
Updating documents
A document cannot be changed once you've added it to an index. You can't add or
remove fields, or change a field's value. However, you can replace the document
with a new document that has the same
docID
.
Retrieving documents by docID
Use the Index.Get
method to retrieve a document from an index by its
docID
:
Searching for documents by their contents
To retrieve documents from an index, you construct a query string and call
`Index.Search`.
Search
returns an iterator that yields matching documents in order of
decreasing rank.
Deleting an index
Each index consists of its indexed documents and an index schema. To delete an index, delete all the documents in the index and then delete the index schema.
You can delete documents in an index by specifying the
docID
of
the document you wish to delete to the
Index.Delete
method.
Eventual consistency
When you put, update, or delete a document in an index, the change propagates across multiple data centers. This usually happens quickly, but the time it takes can vary. The Search API guarantees eventual consistency. This means that in some cases, a search or a retrieval of one or more documents might return results that do not reflect the most recent changes.
Index schemas
Every index has a schema that shows all the field names and field types that appear in the documents it contains. You cannot define a schema yourself. Schemas are maintained dynamically; they are updated as documents are added to an index. A simple schema might look like this, in JSON-like form:
{'comment': ['TEXT'], 'date': ['DATE'], 'author': ['TEXT'], 'count': ['NUMBER']}
Each key in the dictionary is the name of a document field. The key's value is a list of the field types used with that field name. If you have used the same field name with different field types the schema will list more than one field type for a field name, like this:
{'ambiguous-integer': ['TEXT', 'NUMBER', 'ATOM']}
Once a field appears in a schema it can never be removed. There is no way to delete a field, even if the index no longer contains any documents with that particular field name.
A schema does not define a "class" in the object-programming sense. As far as the Search API is concerned, every document is unique and indexes can contain different kinds of documents. If you want to treat collections of objects with the same list of fields as instances of a class, that's an abstraction you must enforce in your code. For instance, you could ensure that all documents with the same set of fields are kept in their own index. The index schema could be seen as the class definition, and each document in the index would be an instance of the class.
Viewing indexes in the Google Cloud console
In the Google Cloud console, you can view information about your application's indexes and the documents they contain. Clicking an index name displays the documents that index contains. You'll see all the defined schema fields for the index; for each document with a field of that name, you'll see the field's value. You can also issue queries on the index data directly from the console.
Search API quotas
The Search API has several free quotas:
Resource or API call | Free Quota |
---|---|
Total storage (documents and indexes) | 0.25 GB |
Queries | 1000 queries per day |
Adding documents to indexes | 0.01 GB per day |
The Search API imposes these limits to ensure the reliability of the service. These apply to both free and paid apps:
Resource | Safety Quota |
---|---|
Maximum query usage | 100 aggregated minutes of query execution time per minute |
Maximum documents added or deleted | 15,000 per minute |
Maximum size per index (unlimited number of indexes allowed) | 10 GB |
API usage is counted in different ways depending on the type of call:
Index.Search
: Each API call counts as one query; execution time is equivalent to the latency of the call.Index.Put
: When you add documents to indexes the size of each document and the number of documents counts towards the indexing quota.- All other Search API calls are counted based on the number of operations they
involve:
Index.Get
: 1 operation counted for each document actually returned, or 1 operation if nothing is returned.Index.Delete
: 1 operation counted for each document in the request, or 1 operation if the request is empty.
The quota on query throughput is imposed so that a single user cannot monopolize the search service. Because queries can execute simultaneously, each application is allowed to run queries that consume up to 100 minutes of execution time per one minute of clock time. If you are running many short queries, you probably will not reach this limit. Once you exceed the quota, subsequent queries will fail until the next time slice, when your quota is restored. The quota is not strictly imposed in one minute slices; a variation of the leaky bucket algorithm is used to control search bandwidth in five second increments.
More information on quotas can be found on the Quotas page. When an app tries to exceed these amounts, an insufficient quota error is returned.
Note that although these limits are enforced by the minute, the console displays the daily totals for each. Customers with Standard, Enhanced, or Premium support can request higher throughput limits by contacting their support representative.
Search API pricing
The following charges are applied to usage beyond the free quotas:
Resource | Cost |
---|---|
Total storage (documents and indexes) | $0.18 per GB per month |
Queries | $ 0.50 per 10K queries |
Indexing searchable documents | $2.00 per GB |
Additional information on pricing is on the Pricing page.