Train embeddings with the Two-Tower built-in algorithm

This guide outlines how to configure and use the Two-Tower built-in algorithm to train embeddings. This step is useful for customers who do not have an in-house solution to train embeddings for Vertex Matching Engine.

Overview

The Two-Tower model trains embeddings with labeled data (pairs of similar objects). It learns to represent items of various types (such as user profiles, search queries, web documents, answer passages, or images) in the same vector space, so that similar or related items are close to each other. The Two-Tower model consists of two encoder towers: the query tower and the candidate tower. These embed independent items into a shared embedding space, enabling retrieval of similarly matched items with Vertex Matching Engine.

A Two-Tower model is trained with pairs of relevant items. Each pair consists of a query document and a candidate document. Documents can be thought of as a number of arbitrary customer-defined features including text, numeric, categorical, etc. After training, the Two-Tower built-in algorithm exports two TensorFlow SavedModels (a query and candidate encoder) which convert the corresponding documents into embeddings. The candidate encoder is used by Vertex Matching Engine to index all the items and serve them using an approximate nearest neighbor solution. The query encoder is used to generate a query embedding, given a query item, in order to find similar candidate embeddings using the index.

Input Data

The Two Tower built-in algorithm expects two inputs:

  • Training data: Pairs of documents used to train the model. The supported file formats are JSON Lines (JSONL) and TFRecord.
  • Input schema: A JSON file with the schema of the input data, plus additional feature configurations.

Training data

Training data consists of query document and candidate document pairs. Only positive pairs need to be provided, where the query and candidate documents are considered a match. Negative samples are automatically generated by the framework, leveraging non-paired documents that exist in the same batch.

The query and candidate documents consist of a number of user-defined features. The algorithm currently supports the feature types listed below.

  1. Text: List of strings, representing a text feature.
  2. Id: Scalar string, representing the unique id of an item.
  3. Categorical: List of strings, representing a categorical feature. The strings must be a subset of a vocabulary list specified in the input schema.
  4. Numeric: List of floats, representing a numeric feature.
  5. Vector: A fixed-length float vector that is passed through as is as an input.

JSONL Format

Each line is a JSON string of a single instance of the training data. Its format is a dictionary that contains two keys: query and candidate. These, in turn, contain keys representing the features of each document.

Example:

    {
        "query":
        {
            "movie_genre": ["action"],
        },
        "candidate":
        {
            "ratings": [7.3, 6.0, 8.1],
            "movie_name": ["mission impossible"]
        }
    }

TFRecord Format

Each training instance is a serialized TF Example. Each feature's name is prefixed by query_ or candidate_, to identify the document that the feature belongs to. We recommend you follow the TensorFlow tutorial about TFRecord to generate the TFRecord file.

Example:

    features {
        feature {
            key: "query_movie_genre"
            value {
                bytes_list {
                    value: "action"
                }
            }
        }
        feature {
            key: "candidate_ratings"
            value {
                float_list {
                    value: [7.3, 6.0, 8.1]
                }
            }
        }
        feature {
            key: "candidate_movie_name"
            value {
                bytes_list {
                    value: "mission impossible"
                }
            }
        }
    }

Input schema

The input schema is a JSON file that describes the schema of the training input, along with the feature configurations. Similarly to the training data, it contains two keys: query and candidate, indicating the query and candidate documents' input schemas, respectively. The values associated with these keys are themselves a dictionary, mapping the feature names to their feature type and configuration.

For the Id, Categorical, Text, and Vector features, a config with additional information is needed.

  • An Id feature requires:
    • num_buckets: Number of hash buckets of the ID feature. If the number of IDs is small, we recommend that the number of buckets equals the number of IDs.
  • A Categorical feature requires:
    • vocab: A list of all possible values (strings) for this feature. Any unknown (out-of-vocabulary) categorical value that is encountered will be encoded as a separate out-of-vocabulary value.
  • A Text feature can have:
    • embedding_module (optional): A string that specifies the underlying text embedding module (see the Text encoding section in this guide).
  • A Vector feature requires:
    • length: A positive integer that represents the length (dimension) of the vector input.

The following example provides an input schema for the example training data shown in the preceding section.

      "query":
       {
          "movie_genre": {
            "feature_type": "Categorical",
            "config": {
              "vocab": ["action", "comedy", "drama"]
            }
          }
       },
       "candidate":
       {
          "ratings": {
            "feature_type": "Numeric"
          },
          "movie_name": {
            "feature_type": "Text",
            "config": {
               "embedding_module":"gs://my_bucket/my_saved_model"
            }
          }
       }

Text Encoding

Text features often provide the most significant signals for Two-Tower models, so Vertex Matching Engine provides several different options for specifying the text encoders that are best-suited for customers' datasets. A different text encoder can be specified for each text feature. By default, the pretrained NNLM TFHub module is used.

Text encoders accept as input a one-dimensional batch of strings and output a two-dimensional float tensor of embeddings of shape [batch_size, embedding_dimension]. All text encoders for the Two-Tower built-in algorithm, custom-specified or not, must follow this format. Note that text encoders must have their parameters fixed; their weights can not be updated during training.

There are three possible options for the text embedding module, as shown below.

  • A TFHub path to a text embedding module, which will be wrapped in a hub.KerasLayer. For example, the Universal Sentence Encoder can be specified with https://tfhub.dev/google/universal-sentence-encoder/4.
  • A Cloud Storage path to a TensorFlow SavedModel. This SavedModel will also be wrapped in a hub.KerasLayer. It is important that your project has access to this Cloud Storage path.
  • A preset embedding module that Vertex Matching Engine provides (name starts with @). Presets are pre-trained text embedding modules that have been verified to work with distributed training in the Two-Tower built-in algorithm. Currently, there are two presets: NNLM (@nnlm, the default text encoder) and ALBERT (@albert-base).

Training

You can use the following command to perform training with the Two-Tower built-in algorithm. This command creates a CustomJob resource that uses a single CPU machine for training. The Flags section of this guide lists all the flags that can be used during training. When using single-node training, the parameters training_steps_per_epoch and eval_steps_per_epoch are optional.

# Set a unique name for the job to run.
JOB_NAME = f'two_tower_cpu_{dataset_name}_{timestamp}'

# URI of the Two-Tower Docker image.
LEARNER_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower'

# The region to run the job in.
REGION = 'us-central1'

# Set the training data and input schema paths.
TRAINING_DATA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/training_data/*'
INPUT_SCHEMA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/input_schema.json'

# Set a location for the output.
OUTPUT_DIR = f'gs://{your_bucket_name}/experiment/output'

# Batch size and number of epochs to train.
TRAIN_BATCH_SIZE = 100
NUM_EPOCHS = 10

!gcloud beta ai custom-jobs create \
  --display-name={JOB_NAME} \
  --worker-pool-spec=machine-type=n1-standard-8,replica-count=1,container-image-uri={LEARNER_IMAGE_URI} \
  --region={REGION} \
  --args=--training_data_path={TRAINING_DATA_PATH} \
  --args=--input_schema_path={INPUT_SCHEMA_PATH} \
  --args=--job-dir={OUTPUT_DIR} \
  --args=--train_batch_size={TRAIN_BATCH_SIZE} \
  --args=--num_epochs={NUM_EPOCHS}

Tutorial

For a detailed walkthrough on how to feed more complex inputs to the Two-Tower built-in algorithm and train on Vertex AI, launch the example notebook in Colab. This notebook also demonstrates how to train on GPUs and how to view details and logs in TensorBoard while your job is executing.

Flags

Generic training flags

The most common custom training flags are listed below. For more details, see the guide to creating a CustomJob.

  • worker-pool-spec: The worker pool configuration used by the custom job. You can specify multiple worker pool specs in order to create a custom job with multiple worker pools.

    The spec can contain the following fields, which are listed with corresponding fields in the WorkerPoolSpec API message.

    • machine-type (Required): The machine type for the pool. See the list of supported machines.
    • replica-count: The number of replicas of the machine in the pool.
    • container-image-uri: The Docker image to run on each worker. To use the Two-Tower built-in algorithm, this must be set to us-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower:latest.
  • display-name: The name of the job.

  • region: The region where you want the job to run.

Two-Tower model-specific training flags

  • training_data_path (Required): The Cloud Storage URI pattern specifying where training data is stored (for example: gs://<bucket>/<folder>/*).
  • eval_data_path: The Cloud Storage URI pattern specifying where evaluation data is stored (for example: gs://<bucket>/<folder>/*).
  • input_schema_path (Required): The Cloud Storage URI specifying where the JSON input schema is stored (for example: gs://<bucket>/<folder>/<schema_file>).
  • input_file_format: The format of the input file: jsonl (default) or tfrecord.
  • job-dir (Required): The Cloud Storage URI of a directory where you want to store training outputs.
  • candidate_data_path: The Cloud Storage URI pattern specifying where candidate data is stored (for example: gs://<bucket>/<folder>/*). Only needed when using non-default (top_k_categorical_accuracy) metrics, instead of auc. If not set, the candidate set will be created from the training and eval data.
  • train_batch_size: The batch size for training. Default value is 100.
  • eval_batch_size: The batch size for evaluation. Default value is 100.
  • eval_split: The split to use for eval dataset, if eval_data_path is not provided. Default is 0.2 (20% of data from training_data_path is used as eval data instead.)
  • metrics: The metrics used to evaluate the model. Can be either auc, top_k_categorical_accuracy or precision_at_1. Default is auc. Evaluation metrics can be found in the log files, or can be loaded into TensorBoard (specifically from gs://<job-dir>/tensorboard/validation), as explained in the TensorBoard documentation.
  • optimizer: The optimizer used to train the model. The lowercase string name of any TensorFlow 2.3 Keras optimizer is supported (sgd, adam, ftrl, etc.). See the TensorFlow optimizer documentation. Default is adagrad.
  • learning_rate: The learning rate for training with the optimizer.
  • momentum: The momentum for sgd or rmsprop optimizer, if specified.
  • num_epochs: The number of epochs for training. Default is 10.
  • num_hidden_layers: The number of hidden layers. By default, there are no hidden layers.
  • num_nodes_hidden_layer{index}: The number of nodes for the hidden layer with that specific index. For example, for the first hidden layer use num_nodes_hidden_layer1. The range of indexes is 1 to 20.
  • output_dim: A single integer representing the output encoding dimension for each tower of the Two-Tower model. Default is 64.
  • cache_input: A boolean indicating whether to cache training and evaluation data in memory. This is recommended only for small datasets.
  • training_steps_per_epoch: The number of steps per epoch to run the training for. Only needed if you are using more than 1 node for training or if your primary training node has more than 1 GPU. Default is None.
  • eval_steps_per_epoch: The number of steps per epoch to run the evaluation for. Only needed if you are using more than 1 node for training or if your primary training node has more than 1 GPU. Default is None.
  • gpu_memory_alloc: The amount of memory allocated per GPU (in MB). Default is no limit.

What's next