Overview: Extracting and serving feature embeddings for machine learning

This article is part of a series that explores the process of extracting and serving feature embeddings for machine learning (ML). This article describes the concept of feature embeddings and why they're important. It also discusses the domains and use cases for which feature embeddings are relevant, focusing on semantic similarity of text and images. In addition, the article addresses architectures and technologies such as TensorFlow Hub (tf.Hub) that can enable extraction and serving of feature embeddings on Google Cloud at scale.

A second article in this series (Analyzing text semantic similarity using TensorFlow Hub and Dataflow) describes how to perform document semantic similarity using text embeddings.

Introduction

The objective of ML is to extract patterns from data and use those patterns to make predictions. These predictive patterns represent the relationships between the input data features and the output target to be predicted. Typically, you expect that instances with similar feature values will lead to similar predicted output. Therefore, the representation of these input features directly affects the nature and the quality of the learned patterns.

For example, suppose you want to build a price estimator for housing rentals that's based on house listing data. You need to represent each house with a feature vector of real (numeric) values, where each element of this vector represents the value of a feature of the house. Useful features in this case might include things like property size, age, number of rooms, location, and orientation (the direction that the house faces). Houses with similar features should yield similar prices.

The values assigned to features in the house pricing example either are already numeric or can be converted to numeric features through one-hot encoding. However, features representation in other cases is more complicated. Consider the following cases:

  • Identifying the topic of an article, given its title or content.
  • Finding similar magazines based on a specific magazine cover.
  • Understanding the sentiment of a customer in regard to a product, given a text-based review.
  • Generating a list of new songs for users, given the ones they listened to in the last week.
  • Suggesting similar fashion items or works of art, given one that the user is currently viewing.
  • Retrieving the most relevant answer from an FAQ list, given a natural language query.

As you can see from these examples, the input data is unstructured or includes features that contain unstructured content. These types of features include text like article titles and contents or customer product reviews; images like magazine covers, fashion items, or works of art; and audio, such as songs. To use these types of data for ML tasks, you need compact real-valued feature vector representations of these types of data. These vectors are called embeddings.

What is an embedding?

An embedding is a translation of a high-dimensional vector into a low-dimensional space. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.

Text embeddings

Consider the example of text representation. You can represent the words in an English sentence, such as the title of an article, in either of the following ways:

  • As an extremely large (high-dimensional) sparse vector in which each cell represents a separate English word, with perhaps a million elements to represent a million discrete words. The value in a cell represents the number of times that word appears in the sentence. Because a single English sentence is unlikely to use more than 50 words, nearly every cell in the vector will contain a 0.

  • As a comparatively small but dense vector (perhaps only several hundred elements). Each element represents a different characteristic of the word, and each contains a value between 0 and 1 that indicates the extent to which the word represents that characteristic. In effect, the word is semantically encoded using as many attributes as there are in the vector. This vector is an embedding, which tries to capture the semantics of the article's title.

The embedding for a given title is close in the embedding vector space to the embedding of a similar title, even if the titles' wordings are different. For example, "The squad is ready to win the football match" and "The team is prepared to achieve victory in the soccer game" have the same meaning but share almost no vocabulary. But in this embedding representation, they should be close to one another in the embedding space, because their semantic encoding is very similar.

Several models—including neural-net language models (NNLM), global vectors for word representation (GloVe), deep contextualized word representations (ELMo), and Word2vec—are designed to learn word embeddings, which are real-valued feature vectors, for each word.

For example, training a Word2vec model by using a large corpus of text, such as the English Wikipedia corpus, produces embeddings that capture meaningful distance and direction between words with semantic relationships, such as male-female, verb tenses, and even country-capital relationships. Figure 1 illustrates the embedding space for some example vectors. (For more detail, see the article Linguistic Regularities in Continuous Space Word Representations)

Semantic relationships between words
Figure 1. Semantic relationships between words

The Universal Sentence Encoder encodes text that is greater than word length into a single real-valued feature vector, such as sentences, phrases, or short paragraphs. Sentences with semantic similarity are encoded as close-distance vectors in the embedding space.

Image embeddings

Unlike text systems, image processing systems work with rich, high-dimensional datasets that represent an image that has individual raw pixel intensities. However, an image in its raw dense form might not be very useful for some tasks. For example, take the task of finding similar magazines, given the magazine cover image, or finding photographs similar to a reference photo. Comparing the raw pixels of the input picture (2,048 ✕ 2,048) to another picture to find whether they are similar is neither efficient nor effective. However, extracting lower-dimensional feature vectors (embeddings) for the image provides some indication of what the image includes, and can lead to a better comparison.

The idea is to train an image classification model, such as Inception, Deep Residual Learning (ResNet), or Network Architecture Search (NASNet), on a large image dataset (for example, ImageNet). You then use the model without the last softmax classifier part to extract a feature vector given an input test. This type of feature vector can effectively represent the image in a search or in similarity-matching tasks. Feature vectors can function as an additional input feature (vector) along with other features for an ML task. For example, in a system that recommends fashion items to someone who's shopping for clothes, you might have attributes that describe individual items, including color, size, price, type, and subtype. All of these features can be used in your recommender model, along with the extracted features from the fashion item images.

The t-SNE Map experiment illustrated in Figure 2 is one of the examples that uses embeddings to find similar artworks. These images are drawn from content in some of the Google Arts & Culture collections.

Images mapped for similarity using t-SNE, based on Google ML
Figure 2. Images mapped for similarity using t-SNE, based on Google ML

Similarly, for audio data, a lower-dimensional feature vectors can be extracted from high-dimensional power spectral density coefficients. The feature vectors can then be effectively used in various search tasks, recommendation apps, and other ML-based applications.

Collaborative filtering and embeddings

Embedding found its motivation from the task of collaborative filtering (CF). CF is a technique used by recommender systems, where the task is to make predictions about a user's interests based on the interests of many other users. As an example, imagine the task of recommending movies. Suppose you have 1,000,000 users, a catalog of 500,000 movies, and records of which movies each user has watched. For purposes of this example, you don't have (or won't use) any other information about the movies or the users, just the user's ID, the movie ID, and a watched flag. Together, this information creates a very sparse dataset.

You want to determine which movies are similar to each other based on the watching history by users. You can do this by embedding the movies into a low-dimensional space in which the movies that have been watched by a given user are nearby in the "movie preference" space (embedding). You also create embeddings for users in the same space based on the movies that they've watched. In the resulting space, user embeddings are close to the embeddings of movies that they've watched. This method enables you to recommend other movies based on those movies' proximity to a user embedding, because nearby users and movies share preferences.

Embeddings like those for users and movies can be learned using techniques like matrix factorization, single-value decomposition, neural collaborative filtering, and neural factorization machines. Figure 3 shows an example of movie embedding in two-dimensional space. The first dimension describes whether the movie is for children (negative values) or adults (positive values), while the second dimension represents the degree to which each movie is a blockbuster (positive values) or an art-house movie (negative values). (For this discussion, the assignment of negative and positive values is arbitrary and is used only to determine coordinates.) In this example, movies like "Shrek" and "Incredibles" are near each other in the embedding space, because both are considered movies for children and because they're both blockbuster movies.

The ML algorithm learns these dimension values for each movie without knowing that they are factors relevant to adults versus children, or blockbuster versus art house. For more about how a system like this is built, see Building a Recommendation System with TensorFlow.

Example of movie embedding in two-dimensional space
Figure 3. An example of movie embedding in two-dimensional space

Other types of embeddings

Embedding-based learning can also be used to represent complex data structures, such as a node in a graph, or a whole graph structure, with respect to the graph connectivity. This is useful in linkage-intensive domains, such as drug design, friendship recommendation in social networks, and protein-to-protein interaction detection. Techniques for these types of tasks include graph convolutional neural networks and graph matrix completion.

Another application of embedding is multimodal learning, where the objective is to translate information from one modality to another. For example, in automatic image captioning (illustrated in Figure 4), captions can be automatically generated for an image. (As shown in the image, the translation might not be perfect; for example, one of the captions is generated as "fruit salad" instead of something like "fruit stand.") In addition, users can search for images that match natural language description. Or they can generate music for given video footage. To enable this type of matching, you can encode these multimodal entities in the same embedding space.

A process flow showing automatic image captioning.
Figure 4. Automatic image captioning: generating a caption from embeddings of a vision model

The idea of embedding-based learning to represent the domain entities from a given context (for example, using Word2vec from words) can be applied in other domains, so that entities with similar context would have similar embeddings. These can include creating embeddings from documents (Doc2vec), from customer data (Customer2vec), from video files (Movie2vec), from graphs (Graph2vec), from genetic mappings (Gene2vec), from protein structures (Protein2vec), and others.

To summarize, embeddings can be extracted to represent:

  • Unstructured data, such as text (words, sentences, and entire docs), images, audio, and so on.
  • Entities that have no input features, only interaction context, such as a user ID and the list of list of movies that the user has watched.
  • Complex-structure data, such as graphs and networks. Examples include social networks and biochemical compounds.
  • Multimodal translation, such as captioning images and searching for images using a text description.
  • Sparse features (by converting them into dense features), such as location and occupation.
  • Entities with high dimensionality (by converting them into more compact representation), such as customer records with 300 or more demographic, social, financial, and behavioral attributes.

Use cases for embedding

Representing entities like fashion items, movies, holiday properties, and news articles as compact, real-valued feature vectors (embeddings) that encode key traits enables a set of interesting usage scenarios. This section discusses some use cases for these entity embeddings.

Similarity analysis

After your entities have been encoded in low-dimensional, real-valued feature vectors, you can perform exploratory analytics and visualization to discover interesting information and patterns. This can include:

  • Finding the nearest items to a given one. For example, as mentioned earlier, if you have a repository of news articles or research papers, you can find documents that are most similar to a given one. This technique also allows you to label the article with the topics of its neighbor articles.
  • Identifying groups of similar items in the embedding space, such as trial products, movies, or property listings. You can then study the common customer characteristics of each group in order to adjust a marketing campaign.
  • Finding the density of items in a particular item's neighborhood. This helps you to identify hot or trending topics, types of fashion items, music genres, and so on.
  • Identifying boundary or intergroup items as well as exceptional or outlier items. This can be useful for fraud detection.

Search and retrieval

Searching for similar or relevant items is one of the most important and widely used scenarios for embeddings. This application can be implemented in two different flavors: reactive and proactive.

In reactive retrieval, the user provides an item to search, and you retrieve the items that are similar to it from your data stores. For example, you might find pet pictures that are similar to the user's pet picture, or you might search for research papers similar to an article that the user provides. In this case, your system extracts the embedding of the user's item online. Then the system retrieves the items from your data store whose embeddings are most similar to the input embedding, based on a proximity metric.

In proactive retrieval, your system automatically identifies the context and converts it to a search query in order to retrieve the most similar items. For example, you might have a library of documentary videos, and your application allows users to search and upload videos. When the user logs in, the context here is just the user's ID (perhaps with some other optional attributes, such as date/time and location). This information is used as the search query to retrieve the most relevant videos for that user, based on the similarity between the user embedding and the video embeddings.

In both reactive and proactive retrieval, your system should be able to convert the implicit or explicit query to an embedding, compare that embedding to the embedding of the stored items, and retrieve the most similar ones.

Additional examples of how embeddings can be used for search and retrieval include:

  • Retrieving the most relevant news articles for a given search query.
  • Retrieving the pictures that are most similar to a user-provided picture.
  • Finding a music track most similar to the one that the user provides.
  • Finding a list of relevant games or apps, given one that the user recently installed.
  • Retrieving the most relevant answer from a FAQ list, given a natural language query.
  • Discovering new, interesting movies or songs for a given user.

Machine transfer learning

Many use cases involve encoding sparse, complex, high-dimensional, or unstructured data into embeddings to train ML models.

Using pre-trained embeddings to encode text, images, or other types of input data into feature vectors is referred to as transfer learning. In essence, transfer learning transfers information from one ML task to another one. A typical use case is to use a model trained on large amounts of data for a task where you have less data. For example, you can use pre-trained text embeddings that are trained on a large text corpus of hundreds of millions of words and sentences to train a sentiment classification model where you only have 10,000 customer reviews of a product.

With transfer learning, not only do you reuse the knowledge (embeddings) extracted from large amounts of data, but you also save thousands of CPU and GPU computing hours that would otherwise be needed to train these embeddings.

You can use feature embeddings along with other features to train your ML models. An example use case that was noted earlier is estimating the rental price of a holiday property, given:

  • Base features of the property, such as size, number of rooms, orientation (for example, facing the beach), location, average prices in the area, rental dates, season, and so on.
  • Embedded features, such as natural language descriptions of the property, and images of the property.

Embeddings can be incorporated into the model in two ways. The first way is by extracting the embeddings and using them as input feature vectors. You concatenate the base input features to the pre-trained embeddings (which you earlier extracted in an offline preprocessing step) to form one input feature vector. The key point is that these embeddings are not trainable—they are not tuned as part of training this model, and instead are treated as inputs. This approach is illustrated in Figure 5.

Using pre-trained embeddings as input
Figure 5. Using pre-trained embeddings as input

Alternatively, you can include the embeddings as a trainable layer in the model. In this approach, you include pre-trained embedding layers for the respective modalities in your model architecture. This lets you fine-tune the embedding weights during the training process for the task of the model. As shown in Figure 6, the text and image features are fed into the model in their raw form, without being embedded. Before they are joined with the base input features to form one input feature vector to the model's internal architecture, the pre-trained embeddings are plugged as trainable layers.

Using pre-trained embeddings as trainable layers
Figure 6. Using pre-trained embeddings as trainable layers

The second approach allows you to tune the embeddings with respect to the model task to produce more effective models. However, the model size is usually much larger than in the first approach, because it contains the embedding weights.

Embedding modules in TensorFlow Hub

TensorFlow Hub (tf.Hub) is a library of reusable ML modules. These modules can be pre-trained models or embeddings that are extracted from text, images, and so on.

More precisely, a module is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks. By reusing a module, you can train a model using a smaller dataset, improve generalization, or simply speed up training. Each module has an interface that allows it to be used in a replaceable way, with little or no knowledge of its internals. Modules can be applied like an ordinary Python function to build part of the TensorFlow graph, or used as a feature column in your TensorFlow estimator.

The following code snippet shows how to use the Universal Sentence Encoder (version 2) module as a function that gets the embedding of a given input text. You can use this code if you are preprocessing text to embeddings for similarity analysis or search and retrieval, or prior to training an ML model (as shown previously in Figure 5).

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
review_embeddings = embed(["The movie was super interesting!])

The following code snippet shows how to use the Universal Sentence Encoder (version 2) module as a text_embedding_column to encode an input feature (review) and use it in the DNNClassifier estimator. You can use this code to have the embedding as a trainable layer in your ML model (as shown in Figure 6).

review_column = hub.text_embedding_column("review",
                "https://tfhub.dev/google/universal-sentence-encoder/2", trainable=True)
estimator = tf.estimator.DNNClassifier(hidden_units, feature_columns=[review_column])

As you can see, the module URL specifies the publisher, the module name, and the version. Once modules are downloaded to disk, they are self-contained, and they can be used by developers who do not have access to the code and data that was used to create and train the module. In addition, if you set the trainable parameter to True, the weights of the embeddings are tuned as you train your model.

Figure 7 shows a list of text embedding and image feature vector modules available in tf.Hub. These modules are developed, trained, and published by Google.

Text embedding and image feature vector modules in tf.Hub
Figure 7. Text embedding and image feature vector modules in tf.Hub

High-level architectures

This section illustrates architectures for efficiently enabling the extraction of embeddings, to implement similarity matching, and to use ML at scale on Google Cloud.

Extracting embeddings from documents or images

If you have text documents or an image dataset, and you want to perform similarity analysis or build a real-time search and retrieval system, you need to convert your unstructured items (docs or images) into real-valued feature vectors. You can do this by using pre-trained tf.Hub modules. The tf.Hub module is not used as a trainable part of an ML model. Instead, it's used in an extract, transform, load (ETL) process to convert input data (text or image) to an output feature vector (embedding).

If you have a large input dataset (a corpus of millions of docs or images), you need to extract these embeddings at scale. To do this, you can use Dataflow, which is a fully managed, serverless, reliable service for data processing pipelines at scale on Google Cloud. You implement the pipeline using Apache Beam, an open source unified programming model that runs both streaming and batch data processing jobs. For a detailed example, see to Analyzing text semantic similarity using TensorFlow Hub and Dataflow.

Figure 8 shows how Apache Beam is used to implement an ETL pipeline to extract embeddings.

High-level architecture for extracting embeddings at scale
Figure 8. High-level architecture for extracting embeddings at scale

The figure shows the following flow:

  1. Read raw data from Cloud Storage.
  2. Process each item by calling a tf.Hub module, which returns the feature vector that represents the item. For example, if you process news articles, you can use the Universal Sentence Encoder to convert the title into a real-valued feature vector.
  3. Write the encoded data to Cloud Storage.

The data can be written as a .tfrecords file if you are planning to use the embeddings for training a TensorFlow ML model, as described earlier in the first approach under Machine transfer learning. In addition, you can use the extracted embeddings in BigQuery for exploratory similarity analysis.

Matching and retrieving similar embeddings

If you are building an image search app that accepts an image provided by the user and retrieves the most similar images in your repository, you need to perform certain tasks, which are performed in different parts of your overall architecture:

  1. Accept the user-provided image, using a frontend app.
  2. Convert the input image to a feature vector in a backend app, using a tf.Hub image module (for example, using inception_v3/feature_vector).
  3. Match and retrieve the extracted feature vector of the input images to the image embeddings stored in your database. This is also done in the backend app.

The challenge is to organize and store the feature vectors in a way that minimizes the matching and retrieval time. You don't want to compare the input query feature vector against every vector in your repository in order to find the most similar ones. For applications in which retrieval and ranking of best matches must be fast, feature vectors must be organized using some sort of indexing mechanism to reduce the number of comparisons. Approximate nearest-neighbor techniques like locality-sensitive hashing, k-d trees, vantage-point trees, and hierarchical clustering can help to improve search performance.

The feature vectors index is maintained in memory, while the actual content (the references) is maintained as a key/value store, where the key is a hash of the feature vector, and the value is the Cloud Storage URL of a doc or image. This provides low-latency retrieval. The flow of this approach is shown in Figure 9.

High-level architecture for storing and serving embeddings
Figure 9. High-level architecture for storing and serving embeddings

Training and serving ML models

As discussed in the embeddings usage scenarios section, pre-trained embeddings can be useful for training ML models and improving their generalizability, especially if you have small training datasets or you want to speed up training on large datasets. tf.Hub modules were trained on large datasets, which consumed thousands of GPU hours. In addition, you can have the tf.Hub module as part of your model so that its weights can be trainable if needed.

To manage a model that includes tf.Hub modules at scale, you can use AI Platform. AI Platform is a serverless platform that can train, tune (using the hyperparameters tuning functionality), and serve TensorFlow models at scale with minimum management required by DevOps. AI Platform supports deploying trained models as REST APIs for online predictions, as well as submitting batch prediction jobs. The process is shown in Figure 10.

High-level architecture for training and serving TensorFlow models
Figure 10. High-level architecture for training and serving TensorFlow models

What's next