An image search application that uses the Vision API and AutoML Vision

This article explores how to use the Vision API and AutoML Vision to power your image search and classification application. When combined with other Google Cloud services, these services make it easy for you to:

  • Search within images for detected objects and scenes.
  • Classify images into different categories based on detected image labels.
  • Use image labels and categories as search facets.

The Vision API is powered by Google's deep learning models and provides advanced computer vision capabilities, including:

  • Label detection
  • Face and landmark detection
  • Optical character recognition (OCR)
  • Explicit content detection

With AutoML Vision, you can train high-quality models to perform custom label detection using Google's Neural Architecture Search and state-of-the-art transfer learning. These technologies allow you to bring your training data to create your own vision model, with a minimum of machine learning skills required.

You can integrate these capabilities into new and existing applications through a REST API.

To see how to build the application described in this article, see Building an image search application using the Vision API and AutoML Vision.

Using label detection to make images searchable

Label detection is an image annotation feature in the Vision API and AutoML Vision. This feature predicts the most appropriate labels that describe an image. The feature identifies broad object sets across thousands of different object categories and then returns a label annotation or each detected label in an image. It also returns the following:

  • Label Identifier: An opaque entity ID for the label, such as "/m/0bt9lr".
  • Label Description: A textual description of the label, such as "dog."
  • Confidence Score: A number associated with every returned label annotation, representing the Vision API's assessment of the label's accuracy. Confidence scores range from 0 (no confidence) to 1 (very high confidence).

With AutoML Vision, you provide labeled datasets in order to train models that perform custom label detection with your labels. By combining label detection with a search index, you make images searchable in new ways. The following diagram illustrates one such approach:

label detection and search index

In this solution architecture:

  1. A user uploads an image from a client application to a Cloud Storage bucket.
  2. When a new image arrives in the bucket, Pub/Sub generates a notification.
  3. The Pub/Sub notification contains the new image file details, and the Pub/Sub topic used for notifications is configured for push delivery to the App Engine endpoint.
  4. The App Engine backend is now aware of the new file's existence. App Engine calls the Vision API on the uploaded image to process and add labels to it. These labels are also added to the search index.
  5. (Optional) App Engine adds custom labels detected using AutoML Vision to the search index.
  6. (Optional) App Engine calls AI Platform to classify images into user-defined categories using the detected labels. The category is also added to the search index.
  7. Search the detected image labels by using App Engine's Search API. The Search API provides a range of search functionality including keyword and faceted search.

Using labels for faceted searching

Faceted search is a way to expose the Vision API and AutoML Vision labels (referred to as image labels in this article) in a search interface. When you use faceted search, image labels and label counts are presented alongside search results as a navigable search facet. After users start a general keyword search by querying against different index fields, they can use the search facet to refine their search results to images containing specific image labels.

The search interface also details how many results are contained in each label refinement. Faceted search is particularly effective when the results include a large number of common labels.

Faceted search example

When you search images by using a simple keyword such as "city," the search results contain thousands of images. In that case, you need to add keywords to narrow your results, but you might be unsure which keywords to add. Faceted search helps you choose by collecting other labels attached to images found using the keyword "city." These labels are treated as facets, and frequently occurring facets are displayed in a candidate list for selection.

For example, a faceted search might show a list of top ten labels commonly attached to the search result images. This list allows you to select additional keywords from the prepopulated list. The following screenshot shows a deployed example.

deployed facet search

This screenshot illustrates how image labels are exposed as search facets. When a user performs a search, detected image labels from the set of matching documents are presented alongside search results as a clickable search facet. In this example, the Image Label facet is exposed. Selecting one of the Image Label links triggers a search query refinement and returns only images containing the selected label, such as "cityscape" or "night."

In addition, image labels are added to a document index field using App Engine's Search API. This example also displays predetermined image categories (the Mapped Category and Most Similar Category facets) as lists of additional keywords. The next section explains how to implement this feature.

Using labels to classify images

Sometimes you might want your application to expose labels directly. Other times, rather than exposing the labels directly, you might want to classify images from detected labels into predetermined categories.

For example, you might want to allow users to search for images that match predetermined categories such as "nature" or "cityscapes" in addition to searching image labels directly. In this case, you can use image labels in different ways to derive the most appropriate image category.

To enable this scenario, you can use the Vision API and AutoML Vision:

  • Vision API label detection is ideal if the API already recognizes your categories and returns them as image labels. Label detection is also helpful if your application handles images relating to diverse subject matter that can benefit from the Vision API's broad understanding.

    In both of these cases, you can use the image labels that the Vision API returns in order to determine broader category contexts in different ways—for example, image labels such as "pollution," "factory," "landfill," and "iceberg" could be used to determine a broader category, such as "climate change." For more information, see Classifying images with the Vision API later in this document.

  • AutoML Vision is ideal for custom image classification with user-provided, labeled training sets. The custom label detection feature in AutoML Vision returns user-defined labels included in the training set, which you can use to create custom image categories.

    If Vision API label detection doesn't return appropriate labels for your categorization task, we recommend using AutoML Vision to train a custom image model. For more information, see Classifying images with AutoML Vision later in this document.

Classifying images with the Vision API

The Vision API detects objects from thousands of different categories, both specific and abstract. This broad understanding can be used to classify images into predetermined categories that are useful for your application.

To enable this scenario, you need to implement a method to associate Vision API image labels with specific categories. The following sections describe two possible approaches:

  • Mapping a detected label to a predetermined category.
  • Using word vectors to find a similar category.

In both approaches, the image labels from the Vision API provide an appropriate context to classify images.

Mapping detected labels to predetermined categories

Imagine you're developing a website selling stock photography. Your user interface might allow visitors to search or browse images within predefined categories such as wildlife, nature, and cityscapes. When the Vision API returns "giraffe," "elephants," or "savanna" as image labels, you want the image automatically organized under the wildlife category.

The Vision API label detection returns broad sets of categories within images, not scores for specific predetermined categories. One simple method of mapping labels to categories would be to map Vision API labels to specific categories, where each category is associated with one or more specific Vision API labels. (For the remainder of the solution, this method is called fixed label-to-category mapping.) In this schema, the labels that the Vision API returns are compared against the list of words defining each category and the image associated with the most suitable category as determined by image label confidence scores.

The Vision API returns one or more labels for an image. For each image, the detected image labels are compared against the words defining a given category. When there are one or more direct matches, the Vision API confidence score for each matching label is summed, creating a category confidence score for each category. This score is a numerical representation of how well the words defining a predetermined category map to a given image's returned Vision API labels. The category with the highest category confidence value is chosen as the image category and added to the search index. In the case of a tie, you could add the image to the qualifying categories or define an additional heuristic for uniquely mapping the image to a single category.

The following diagram illustrates such an approach for a small set of predetermined categories.

mapping to predetermined categories

Mapping detected labels to categories works well when you are confident you can anticipate the specific image labels associated with each category—in other words, your set of words defining various predetermined categories is likely to contain a high percentage of returned labels. For example, if images often have some combination of "dog," "cat," or "bird" detected as labels, it's easy to define a predetermined category for "animals" by defining your category using these exact labels. When a wide range of labels is returned for images, it is increasingly difficult to determine how to map these labels to specific categories. In the previous example, if "horse" is detected as a label, but not "dog," "cat," or "bird," the image isn't classified correctly because "horse" isn't part of the word set defining the "animals" category.

Another limitation is that detected image labels might relate to multiple categories, further complicating the process of matching the image to the highest-scoring category. For example, if multiple categories share similar fixed label-to-category mappings, they each receive the same contribution toward their category confidence scores for these image labels. When it's important to place images into unique categories, it's necessary to find unique label mappings for each category to increase differentiation between category scores. However, it can be challenging to identify unique labels for exact matching.

Using word vectors to find the most appropriate category

Depending on the variety of detected labels returned by the Vision API, it can be difficult to create fixed mappings between labels and categories. In this case, you can use another approach that measures conceptual similarities between labels rather than directly comparing label values.

Consider an example where "bird," "parrot," "vertebrate," and "fauna" are returned as detected image labels. These labels can be compared to representative labels associated with the following set of predetermined categories:

Category Label1 Label2 Label3 Label4 Label5 Similar?
Animal cat animal parakeet canine horse Yes
Vehicle automobile truck car tram ship No

Although none of the category labels in this table precisely match the returned image labels, the Animal category is clearly the most conceptually appropriate. In other words, the detected labels are more similar to the words defining the Animal category than they are to the words defining the Vehicle category. You want your system to recognize this similarity and correctly classify the image as an Animal.

Natural language processing techniques often depend on transforming individual words into vector representations. You can mathematically manipulate these vectors to pinpoint correlations between words. Pretrained word embedding dictionaries, such as word2vec or GloVe, have already converted common words into real-number vector representations. Using these vector representations, you can calculate similarity scores between image labels and category labels. The category with the closest similarity score to the image is associated with the image.

Calculating image and category vectors using GloVe

To generate real-number vectors for images, the detected labels are converted into equivalent vector representations using GloVe, and then reduced to a single combined vector by summing the individual word vectors. Using the previous example, this method would convert the image's detected labels ("bird," "parrot," "vertebrate," and "fauna") into individual word vectors. The combined word vector for the image is created by linearly combining the individual word vectors for each label score. In this way, a single image with multiple labels is transformed into a combined vector representing its overall semantic meaning.

A combined vector is also calculated for each category using the sum of word vectors that best represents the category; this combined vector doesn't need to be all the words defining the category, and is often a relevant subset. In order to convert words to vectors for both images and categories, both image labels and category elements must exist in the pretrained GloVe embeddings.

The number of words defining each category can vary. When multiple words are used for a category, the sum of the word vectors approximates a combined semantic for all the words in the category. Using a single word for a category is also valid.

When selecting words to generate the combined vector for a category, you should avoid using words that have similar meaning across categories. Retaining such words makes it hard to disambiguate between categories. Instead, choose words that uniquely describe categories. As a hypothetical example, the words "concrete" and "asphalt" are likely too similar to sufficiently disambiguate between the Buildings and Roads categories. In contrast, "house" and "street" are more likely to be distinctive.

The number and value of the words that define various categories are different in each situation. You can experiment and run tests with different combinations to find the best result.

Calculating similarity between image and category vectors

In order to classify an image with a specific category, you can calculate the cosine similarity between the combined image vector and each combined category vector. The higher the similarity, the better the correlation between image and category. The following diagram illustrates this process:

calculating similarity between image and category vectors

Here, word vectors from GloVe are used to calculate combined image and category vectors. The cosine similarity between the combined image and category vectors is used to judge equivalence—in this case, the semantic scores are relatively close, 0.75, indicating that picture.jpg could be usefully associated with the Animals category.

To implement this approach, you deploy a simple TensorFlow model to AI Platform. The model accepts detected labels from the Vision API, converts the detected labels to a combined vector representation, and calculates the cosine similarity for each combined category vector. The TensorFlow model uses pretrained word embeddings to vectorize the image and category labels, requiring no additional training.

The following diagram shows an updated solution architecture that incorporates AI Platform:

architecture that incorporates ML

Triggering the App Engine backend

Triggering the App Engine backend is similar to the previous approach, which used fixed label-to-category mappings. In this approach, a new image uploaded to a Cloud Storage bucket triggers the App Engine endpoint. The triggered endpoint requests label detection for the image using the Vision API and calls a prediction model in AI Platform to convert the image labels to a vector representation. The prediction model then calculates a cosine similarity score for the image vector and each category vector. The category that results in the highest similarity is applied to the image and written to the search index.

Differences between methods

The following screenshots illustrate the difference between deriving categories from fixed label-to-category mappings (the left-hand screenshot) and deriving categories by using word vectors (the right-hand screenshot).

label-to-category mappings word vectors

In this example, Mapped Category: animals (using a fixed label-to-category mapping) is defined by the following set of words: "dog," "cat," "fish," "horse," "animal," "bird," "parrot," and "budgie." For cat.jpg, Vision returned the following labels: "fauna," "wildlife," and "zoo."

When the Mapped Category: animals facet value is selected in the left-hand screenshot, an image of a cat is missing. Mapped Category: animals represents a fixed label-to-category approach. Because no Vision labels match the category elements, the image isn't associated with the animals category, which requires exact matches to calculate category similarity scores. In this case, the category similarity score between image and category is zero—there is no overlap between label and category elements.

The right-hand screenshot, which incorporates GloVe word vectors, includes the cat image in Most Similar Category: animals. Using word vectors, the elements defining Most Similar Category: animals ("animal," "creature," "species," and "pet") are transformed into a combined category vector. Similarly, the image labels that are returned by cat.jpg ("fauna," "wildlife," and "zoo") are transformed into a combined image vector. Because the cosine similarity between the combined category and image vectors is sufficiently close, they are determined to have legitimate semantic similarity. In other words, the cat.jpg image labels are similar enough to the animals category elements for the TensorFlow model to place the cat image into the correct category despite not having have the precise "cat" label returned.

In both screenshots, the counters shown next to the facet values reflect the total occurrences of a given label within the displayed result set. This explains why the right-hand screenshot displays animals(1) within the Mapped Category facet, while the left-hand screenshot indicates cat.jpg wasn't directly mapped to animals using fixed label-to-category mappings.

Classifying images with AutoML Vision

Although Vision API label detection detects broad sets of categories within an image, your requirements might include categories that the Vision API doesn't detect. These categories might include highly domain-specific labels, such as those for specialized use cases (for example, categorizing proprietary machine parts). In this case, we recommend using AutoML Vision to train a custom image model with a user-provided dataset.

When training custom image models using AutoML Vision, make sure that your training images are suitable for AutoML Vision, and are representative of the images used in prediction. To illustrate this requirement, an image model intended to detect animals but trained only with pictures of four-legged creatures is unlikely to identify a bird as an animal. For comparison, the Vision API's existing label detection is sufficiently broad to account for this difference, either by returning "animal" as a specific label, or by returning related labels where the correct category could be derived. (See Classifying images with the Vision API earlier in this document.)

The following diagram shows a solution architecture that incorporates AutoML Vision into your image search application. In the example below, the user-defined labels in your training dataset are used directly as image categories.

architecture that incorporates AutoML Vision

As with Vision API classification, in this approach a new image uploaded to a Cloud Storage bucket triggers the App Engine endpoint. The triggered endpoint requests custom label detection for the image using AutoML Vision. The image label with the highest confidence score is assigned as the image category and added to the search index.

AutoML Vision can be used together with the Vision API for combined functionality. For example, you might use Vision API label detection for faceted searching across broad subject matters, and use AutoML Vision to categorize images based on custom image labels.

What's next?