Search API Basics

Amy Unruh, Oct 2012
Google Developer Relations

Introduction

This lesson covers the basics of using the Search API: indexing content and making queries on an index. In it, you'll learn how to

  • Create a search index
  • Add content to it via an index document
  • Make simple full-text search queries on that indexed data

Objectives

Learn the basics of using the App Engine Search API.

Prerequisites

Indexes

App Engine's Search API operates through an Index object. This object lets you store data via an index document, retrieve documents using search queries, modify documents, and delete documents.

Each index has an index name and, optionally, a namespace. The name uniquely identifies the index within a given namespace. It must be a visible, printable ASCII string not starting with !. Whitespace characters are excluded. You can create multiple Index objects, but any two such objects that have the same index name in the same namespace reference the same index.

You can use namespaces and indexes to organize your documents. For the example product search application, all the product documents are in one index, with another index containing information about store locations. We can filter a query on the product category if we want to search for, say, only books.

In your code, you create an Index object by specifying the index name:

from google.appengine.api import search
index = search.Index(name='productsearch1')

or

index = search.Index(name='yourindex', namespace='yournamespace')

The underlying document index will be created at first access if it does not already exist; you don't have to create it explicitly.

You can't currently delete indexes, though you can delete documents from them, as will be described in the next class, A Deeper Look at the Python Search API.

Documents

Documents hold an index's searchable content. A document is a container for structuring indexable data. From a technical point of view, a Document object represents a uniquely identified collection of fields, identified by a document ID. Fields are named, typed values. Documents do not have kinds in the same sense as Datastore entities.

In our example application, for instance, our product categories are books and HD televisions. The store has a rather limited selection of products. Each product document in the example application always includes the following core fields, defined by docs.Product class variables:

  • CATEGORY (set to books or hd_televisions)
  • PID (product ID)
  • PRODUCT_NAME
  • DESCRIPTION
  • PRICE
  • AVG_RATING
  • UPDATED (date of last update)
Product document fields
Figure 1: Product document fields.

The books and HD televisions categories each have some additional fields of their own. For books, the extra fields are:

  • title
  • author
  • publisher
  • pages
  • isbn

For HD televisions, they are:

  • brand
  • tv_type
  • size

The application itself enforces an application-level semantic consistency for documents of each product type. That is, all product documents will always include the same core fields, all books have the same set of additional fields, and so on. However, a search index doesn't impose any cross-document schematic consistency on the fields that are used, so there is no explicit concept of querying for "product" documents specifically.

Field types

Each document field has a unique field type. The type can be any of the following, which is defined in the Python module search:

  • TextField: A plain text string.
  • HtmlField: HTML-formatted text. If your string is HTML, use this field type, as the Search API can take the markup into account when creating result snippets and in document scoring.
  • AtomField: A string treated as a single token. A query will not match if it includes only a substring rather than the full field value.
  • NumberField: A numeric (integer or floating-point) value.
  • DateField: A date with no time component.
  • GeoField: A geographical location, denoted by a GeoPoint object specifying latitude and longitude coordinates.

For text fields (TextField, HtmlField, and AtomField), the values should be Unicode strings.

Example: Building product document fields and creating a document

To construct a Document object, you build a list of its fields, define its document ID if desired, and then pass this information to the Document constructor.

The example application uses the TextField, AtomField, NumberField, and DateField field types for product documents.

Defining the product document fields

The core product fields (those which are included in all product documents) look like this, where we assume the value arguments of the constructors below are set to appropriate values:

from google.appengine.api import search
...
fields = [
      search.TextField(name=docs.Product.PID, value=pid), # the product id
      # The 'updated' field is set to the current date.
      search.DateField(name=docs.Product.UPDATED,
                       value=datetime.datetime.now().date()),
      search.TextField(name=docs.Product.PRODUCT_NAME, value=name),
      search.TextField(name=docs.Product.DESCRIPTION, value=description),
      # The category names are atomic
      search.AtomField(name=docs.Product.CATEGORY, value=category),
      # The average rating starts at 0 for a new product.
      search.NumberField(name=docs.Product.AVG_RATING, value=0.0),
      search.NumberField(name=docs.Product.PRICE, value=price) ]

Note that the category field is typed as AtomField. Atom fields are useful for things like categories, where exact matches are desired; Text fields are better for strings like titles or descriptions. One of our example categories is hd televisions. If we search for just televisions, we will not get a match (assuming that that string is not contained in another product field). But, if we search for the full field string, hd televisions, we will match on the category field.

The example application also includes fields specific to individual product categories. These are added to the field list as well, depending on the category. For example, for the television category, there are additional fields for size (a number field), brand, and tv_type (text fields). Books have a different set of fields.

Creating Documents

Given the field list, we can create a document object. For each product document, we'll set its document ID to be the predefined unique ID of that product:

d = search.Document(doc_id=product_id, fields=fields)

This design has some advantages for us (as we'll discuss in the follow-on class to this one), but if we didn't specify the document ID, one would be generated for us automatically when the document is added to an index.

Example: Using geopoints in store location documents

The Search API supports Geosearch on documents that include fields of type GeoField. If your documents contain such fields, you can query an index for matches based on distance comparisons.

A location is defined by the GeoPoint class, which stores latitude and longitude coordinates. The latitude specifies the angular distance, in degrees, north or south of the equator. The longitude specifies the angular distance, again in degrees, east or west of the prime meridian. For example, the location of the Opera House in Sydney is defined by GeoPoint(-33.857, 151.215). To store a geopoint in a document, you need to add a GeoField field with a GeoPoint object set as its value.

Here is how the fields for the store location documents in the product search application are constructed:

from google.appengine.api import search
...
geopoint = search.GeoPoint(latitude, longitude)
fields = [search.TextField(name=docs.Store.STORE_NAME, value=storename),
             search.TextField(name=docs.Store.STORE_ADDRESS, value=store_address),
             search.GeoField(name=docs.Store.STORE_LOCATION, value=geopoint)  ]

Indexing documents

Before you can query a document's contents, you must add the document to an index, using the Index object's put() method. Indexing allows the document to be searched with the Search API's query language and query options.

You can specify your own document ID when constructing a document. The document ID must be a visible, printable ASCII string not starting with !. Whitespace characters are excluded. (As we'll see later, if you index a document using the ID of an existing document, that existing document will be reindexed). If you don't specify a document ID, a unique numeric ID will be generated automatically when the document is added to the index.

You can add documents one at a time, or alternatively you can add a list of documents in batch, which is more efficient. Here's how to construct a document, given a fields list, and add it to an index:

from google.appengine.api import search

# Here we do not specify a document ID, so one will be auto-generated on put.
d = search.Document(fields=fields)
try:
  add_result = search.Index(name=INDEX_NAME).put(d)
except search.Error:
  # ...

You should catch and handle any exceptions resulting from the put(), which will be of type search.Error.

If you want to specify the document ID, pass it to the Document constructor like this:

d = search.Document(doc_id=doc_id, fields=fields)

You can get the ID(s) of the document(s) that were added, via the id properties of the list of search.AddResult objects returned from the put() operation:

doc_id = add_result[0].id

Basic search queries

Adding documents to an index makes the document content searchable. You can then perform full-text search queries over the documents in the index.

There are two ways to submit a search query. Most simply, you can pass a query string to the Index object's search() method. Alternatively, you can create a Query object and pass that to the search() method. Constructing a query object allows you to specify query, sort, and result presentation options for your search.

In this lesson, we'll look at how to construct simple queries using both approaches. Recall that some search queries are not fully supported on the Development Web Server (running locally), so you'll need to run them using a deployed application.

Search using a query string

A query string can be any Unicode string that can be parsed by the Search API's query language. Once you've constructed a query string, pass it to the Index.search() method. For example:

from google.appengine.api import search

# a query string like this comes from the client
query = "stories"
try:
  index = search.Index(INDEX_NAME)
  search_results = index.search(query)
  for doc in search_results:
    # process doc ..
except search.Error:
  # ...

Search using a query object

A Query object gives you more control over your query options than does a query string. In this example, we first construct a QueryOptions object. Its arguments specify that the query should return doc_limit number of results. (If you've looked at the product search application code, you'll see more complex QueryOption objects; we'll look at these in the following class, A Deeper Look at the Python Search API). Next we construct the Query object using the query string and the QueryOptions object. We then pass the Query object to the Index.search() method, just as we did above with the query string.

from google.appengine.api import search

# a query string like this comes from the client
querystring = “stories”
try:
  index = search.Index(INDEX_NAME)
  search_query = search.Query(
      query_string=querystring,
      options=search.QueryOptions(
          limit=doc_limit))
  search_results = index.search(search_query)
except search.Error:
  # ...

Processing the query results

After you've submitted a query, matching search results are returned to the application in an iterable SearchResults object. This object includes the number of results found, the actual results returned, and an optional query cursor object.

The returned documents can be accessed by iterating on the SearchResults object. The number of results returned is the length of the object's results property. The number_found property is set to the number of hits found. Iterating on the returned object gives you the returned documents, which you can process as you like:

try:
  search_results = index.search("stories")
  returned_count = len(search_results.results)
  number_found = search_results.number_found
  for doc in search_results:
    doc_id = doc.doc_id
    fields = doc.fields
    # etc.
except search.Error:
  # ...

Summary and review

In this lesson, we've learned the basics of creating indexed documents and querying their contents. To check your knowledge, try recreating these steps yourself in your own simple application:

  • Create an Index object.
  • Build a list of document fields (say, using the TextField type) and construct a Document object with that field list. Add the document to the index.
  • Search the index using a search string consisting of a term in one of your field values. Is the document you created returned as a match?

In the next lesson, we'll take a closer look at Search API indexes.

Send feedback about...

App Engine standard environment for Python