Train embeddings by using the Two-Tower built-in algorithm

Customers who do not have an in-house solution to train embeddings for Vertex AI Matching Engine can use the Two-Tower built-in algorithm to train embeddings.

Overview

The Two-Tower model trains embeddings by using labeled data. The Two-Tower model pairs similar types of objects, such as user profiles, search queries, web documents, answer passages, or images, in the same vector space, so that related items are close to each other. The Two-Tower model consists of two encoder towers: the query tower and the candidate tower. These towers embed independent items into a shared embedding space, which lets Matching Engine retrieve similarly matched items.

To train a Two-Tower model, Google uses pairs of relevant items. Each pair consists of a query document and a candidate document. Documents contain arbitrary customer-defined features including text, numeric, and categorical features. After training, the Two-Tower built-in algorithm exports two TensorFlow SavedModels—a query encoder and a candidate encoder—that convert the corresponding documents into embeddings. Given a query item, Matching Engine uses the query encoder to generate a query embedding, and uses the index to find similar candidate embeddings. Matching Engine uses the candidate encoder to index all the items and serve them by using an approximate nearest neighbor solution.

Input Data

The Two-Tower built-in algorithm expects two inputs:

  • Training data: Pairs of documents used to train the model. The following are supported file formats:
  • Input schema: A JSON file with the schema of the input data, plus feature configurations.

Training data

Training data consists of query document and candidate document pairs. You need to provide only positive pairs, where the query and candidate documents are considered a match.

Query and candidate documents consist of several user-defined features. The algorithm currently supports the following feature types:

Feature type Description
Text A list of strings representing a text feature.
Id A scalar string, representing the unique ID of an item.
Categorical A list of strings, representing a categorical feature. The strings must be a subset of a vocabulary list specified in the input schema.
Numeric A list of floats, representing a numeric feature.
Vector A fixed-length float vector that is input as is.

JSON Lines format

Each line is a JSON string of a single instance of the training data. The format is a dictionary that contains two keys: query and candidate. These keys, in turn, contain keys representing the features of each document.

Example:

    {
        "query":
        {
            "movie_genre": ["action"],
        },
        "candidate":
        {
            "ratings": [7.3, 6.0, 8.1],
            "movie_name": ["mission impossible"]
        }
    }

TFRecord format

Each training instance is a serialized TF example. A prefix of query_ or candidate_ identifies the document that the feature belongs to.

We recommend that you follow the TensorFlow tutorial about TFRecord to generate the TFRecord file.

Example:

    features {
        feature {
            key: "query_movie_genre"
            value {
                bytes_list {
                    value: "action"
                }
            }
        }
        feature {
            key: "candidate_ratings"
            value {
                float_list {
                    value: [7.3, 6.0, 8.1]
                }
            }
        }
        feature {
            key: "candidate_movie_name"
            value {
                bytes_list {
                    value: "mission impossible"
                }
            }
        }
    }

Input schema

The input schema is a JSON file that describes the schema of the training input, along with the feature configurations. Like the training data, the input schema contains two keys: query and candidate, indicating the query and candidate documents' input schemas, respectively. The values associated with these keys are themselves a dictionary, mapping the feature names to their feature type and configuration.

For the Id, Categorical, Text, and Vector features, you need to provide a configuration file with the following additional information:

  • An Id feature requires:
    • num_buckets: The number of hash buckets of the ID feature. If the number of IDs is small, we recommend that the number of buckets equal the number of IDs.
  • A Categorical feature requires:
    • vocab: A list of all possible values (strings) for this feature. Any unknown (out-of-vocabulary) categorical value that is encountered is encoded as a separate out-of-vocabulary value.
  • A Text feature can have:
    • embedding_module: Optional: A string that specifies the underlying text embedding module. For more information, see Text encoding.
  • A Vector feature requires:
    • length: A positive integer that represents the length (dimension) of the vector input.

The following example shows an input schema.

      "query":
       {
          "movie_genre": {
            "feature_type": "Categorical",
            "config": {
              "vocab": ["action", "comedy", "drama"]
            }
          }
       },
       "candidate":
       {
          "ratings": {
            "feature_type": "Numeric"
          },
          "movie_name": {
            "feature_type": "Text",
            "config": {
               "embedding_module":"gs://my_bucket/my_saved_model"
            }
          }
       }

Text encoding

Text features often provide the most significant signals for Two-Tower models. Matching Engine lets you specify a text encoder several ways. You can specify a different text encoder for each text feature. The default text encoder is the pretrained NNLM TFHub module.

Text encoders accept as input a one-dimensional batch of strings and output a two-dimensional float tensor of embeddings of shape [batch_size, embedding_dimension]. All text encoders for the Two-Tower built-in algorithm must follow this format. Text encoders must have their parameters fixed; their weights can't be updated during training.

The following three options for the text embedding module are possible:

  • A TFHub path to a text embedding module, wrapped in a hub.KerasLayer. For example, to specify the universal-sentence-encoder, use https://tfhub.dev/google/universal-sentence-encoder/4.
  • A Cloud Storage path to a TensorFlow SavedModel. This SavedModel is also wrapped in a hub.KerasLayer. It is important that your project has access to this Cloud Storage path.
  • A preset embedding module that Matching Engine provides (name starts with @). Presets are pre-trained text embedding modules that are verified to work with distributed training in the Two-Tower built-in algorithm. The following are two presets:
    • NNLM (@nnlm, the default text encoder)
    • ALBERT (@albert-base)

Training

To perform single-node training with the Two-Tower built-in algorithm, use the following command. This command creates a CustomJob resource that uses a single CPU virtual machine (VM) instance. For information about the flags you can use during training, see Flags. With single-node training, the parameters training_steps_per_epoch and eval_steps_per_epoch are optional.

# Set a unique name for the job to run.
JOB_NAME = f'two_tower_cpu_{dataset_name}_{timestamp}'

# URI of the Two-Tower Docker image.
LEARNER_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower'

# The region to run the job in.
REGION = 'us-central1'

# Set the training data and input schema paths.
TRAINING_DATA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/training_data/*'
INPUT_SCHEMA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/input_schema.json'

# Set a location for the output.
OUTPUT_DIR = f'gs://{your_bucket_name}/experiment/output'

# Batch size and number of epochs to train.
TRAIN_BATCH_SIZE = 100
NUM_EPOCHS = 10

!gcloud beta ai custom-jobs create \
    --display-name={JOB_NAME} \
    --worker-pool-spec=machine-type=n1-standard-8,replica-count=1,container-image-uri={LEARNER_IMAGE_URI} \
    --region={REGION} \
    --args=--training_data_path={TRAINING_DATA_PATH} \
    --args=--input_schema_path={INPUT_SCHEMA_PATH} \
    --args=--job-dir={OUTPUT_DIR} \
    --args=--train_batch_size={TRAIN_BATCH_SIZE} \
    --args=--num_epochs={NUM_EPOCHS}

Tutorial

For information about how to feed more complex inputs to the Two-Tower built-in algorithm and train on Vertex AI, launch the example notebook in Colab. This notebook also shows you how to train on GPUs and how to view details and logs in TensorBoard while your job is executing.

Flags

Use the following generic and Two-Tower model-specific training flags when training a model by using the Two-Tower built-in algorithm.

Generic training flags

The following custom training flags are the most common. For more information, see Create custom training jobs.

  • worker-pool-spec: The worker pool configuration used by the custom job. To create a custom job with multiple worker pools, specify multiple worker-pool-spec configurations.

    A worker-pool-spec can contain the following fields, which are listed with corresponding fields in the WorkerPoolSpec API message.

    • machine-type: The machine type for the pool. For a list of supported machines, see Machine types.
    • replica-count: The number of replicas of the machine in the pool.
    • container-image-uri: The Docker image to run on each worker. To use the Two-Tower built-in algorithm, the Docker image must be set to us-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower:latest.
  • display-name: The name of the job.

  • region: The region where you want the job to run.

Two-Tower model-specific training flags

  • training_data_path: The Cloud Storage URI pattern specifying where training data is stored, for example, gs://<bucket>/<folder>/*.
  • eval_data_path: The Cloud Storage URI pattern specifying where evaluation data is stored, for example, gs://<bucket>/<folder>/*.
  • input_schema_path: The Cloud Storage URI specifying where the JSON input schema is stored, for example, gs://<bucket>/<folder>/<schema_file>.
  • input_file_format: The format of the input file. input_file_format can be one of the following formats:
    • jsonl (default)
    • tfrecord
  • job-dir: The Cloud Storage URI of a directory where you want to store training outputs.
  • candidate_data_path: The Cloud Storage URI pattern specifying where candidate data is stored, for example, gs://<bucket>/<folder>/*. candidate_data_path is needed only when using non-default top_k_categorical_accuracy metrics instead of auc. If candidate_data_path isn't set, the candidate set is created from the training and evaluation data.
  • train_batch_size: The batch size for training. The default value is 100.
  • eval_batch_size: The batch size for evaluation. The default value is 100.
  • eval_split: The split to use for the evaluation dataset, if eval_data_path isn't provided. The default value is 0.2 (20% of data from training_data_path is used as evaluation data).
  • metrics: The metrics used to evaluate the model. You can find evaluation metrics in the log files, or you can load evaluation metrics into TensorBoard, specifically from gs://<job-dir>/tensorboard/validation, as explained in the TensorBoard documentation. metrics can be one of the following:

    • auc (default)
    • top_k_categorical_accuracy
    • precision_at_1
  • optimizer: The optimizer used to train the model. You can use the lowercase string name of any TensorFlow 2.3 Keras optimizer, for example, sgd, adam, ftrl. For more information, see Module: tf.keras.optimizers. The default value is adagrad.

  • learning_rate: The rate of machine learning for training with the optimizer.

  • momentum: The momentum for sgd or rmsprop optimizer, if specified.

  • num_epochs: The number of epochs for training. The default value is 10.

  • num_hidden_layers: The number of hidden layers. By default, there are no hidden layers.

  • num_nodes_hidden_layer{index}: The number of nodes for the hidden layer with that specific index. For example, for the first hidden layer use num_nodes_hidden_layer1. The range of indexes is from 1 through 20.

  • output_dim: A single integer representing the output encoding dimension for each tower of the Two-Tower model. The default value is 64.

  • cache_input: A Boolean value indicating whether to cache training and evaluation data in memory. This flag is recommended only for small datasets.

  • training_steps_per_epoch: The number of steps per epoch to run the training for. This flag is needed only if you are using more than 1 node for training or if your primary training node has more than 1 GPU. The default value is None.

  • eval_steps_per_epoch: The number of steps per epoch to run the evaluation for. This flag is needed only if you are using more than 1 node for training or your primary training node has more than 1 GPU. The default value is None.

  • gpu_memory_alloc: The amount of memory, in megabytes, allocated per GPU. The default value is no limit.

What's next