Customers who do not have an in-house solution to train embeddings for Vertex AI Matching Engine can use the Two-Tower built-in algorithm to train embeddings.
Overview
The Two-Tower model trains embeddings by using labeled data. The Two-Tower model pairs similar types of objects, such as user profiles, search queries, web documents, answer passages, or images, in the same vector space, so that related items are close to each other. The Two-Tower model consists of two encoder towers: the query tower and the candidate tower. These towers embed independent items into a shared embedding space, which lets Matching Engine retrieve similarly matched items.
To train a Two-Tower model, Google uses pairs of relevant items. Each pair consists of a query document and a candidate document. Documents contain arbitrary customer-defined features including text, numeric, and categorical features. After training, the Two-Tower built-in algorithm exports two TensorFlow SavedModels—a query encoder and a candidate encoder—that convert the corresponding documents into embeddings. Given a query item, Matching Engine uses the query encoder to generate a query embedding, and uses the index to find similar candidate embeddings. Matching Engine uses the candidate encoder to index all the items and serve them by using an approximate nearest neighbor solution.
Input Data
The Two-Tower built-in algorithm expects two inputs:
- Training data: Pairs of documents used to train the model. The following are supported file formats:
- Input schema: A JSON file with the schema of the input data, plus feature configurations.
Training data
Training data consists of query document and candidate document pairs. You need to provide only positive pairs, where the query and candidate documents are considered a match. Training with negative pairs or partial matches is not supported.
Query and candidate documents consist of several user-defined features. The algorithm currently supports the following feature types:
Feature type | Description |
---|---|
Text |
A list of strings representing a text feature. |
Id |
A scalar string, representing the unique ID of an item. |
Categorical |
A list of strings, representing a categorical feature. The strings must be a subset of a vocabulary list specified in the input schema. |
Numeric |
A list of floats, representing a numeric feature. |
Vector |
A fixed-length float vector that is input as is. |
JSON Lines format
Each line is a JSON string of a single instance of the training data. The format
is a dictionary that contains two keys: query
and candidate
. These keys, in
turn, contain keys representing the features of each document.
Example:
{ "query": { "movie_genre": ["action"], }, "candidate": { "ratings": [7.3, 6.0, 8.1], "movie_name": ["mission impossible"] } }
TFRecord format
Each training instance is a serialized TF example. A prefix of query_
or
candidate_
identifies the document that the feature belongs to.
We recommend that you follow the TensorFlow tutorial about TFRecord to generate the TFRecord file.
Example:
features { feature { key: "query_movie_genre" value { bytes_list { value: "action" } } } feature { key: "candidate_ratings" value { float_list { value: [7.3, 6.0, 8.1] } } } feature { key: "candidate_movie_name" value { bytes_list { value: "mission impossible" } } } }
Input schema
The input schema is a JSON file that describes the schema of the training input, along with the feature configurations. Like the training data, the input schema contains two keys: query
and candidate
, indicating the query and candidate documents' input schemas, respectively. The values associated with these keys
are themselves a dictionary, mapping the feature names to their feature type and
configuration.
For the Id
, Categorical
, Text
, and Vector
features, you need to provide
a configuration file with the following additional information:
- An
Id
feature requires:num_buckets
: The number of hash buckets of the ID feature. If the number of IDs is small, we recommend that the number of buckets equal the number of IDs.
- A
Categorical
feature requires:vocab
: A list of all possible values (strings) for this feature. Any unknown (out-of-vocabulary) categorical value that is encountered is encoded as a separate out-of-vocabulary value.
- A
Text
feature can have:embedding_module
: Optional: A string that specifies the underlying text embedding module. For more information, see Text encoding.
- A
Vector
feature requires:length
: A positive integer that represents the length (dimension) of the vector input.
The following example shows an input schema.
"query": { "movie_genre": { "feature_type": "Categorical", "config": { "vocab": ["action", "comedy", "drama"] } } }, "candidate": { "ratings": { "feature_type": "Numeric" }, "movie_name": { "feature_type": "Text", "config": { "embedding_module":"gs://my_bucket/my_saved_model" } } }
Text encoding
Text features often provide the most significant signals for Two-Tower models. Matching Engine lets you specify a text encoder several ways. You can specify a different text encoder for each text feature. The default text encoder is the pretrained NNLM TFHub module.
Text encoders accept as input a one-dimensional batch of strings and output a
two-dimensional float tensor of embeddings of shape [batch_size
,
embedding_dimension
]. All text encoders for the Two-Tower built-in algorithm must
follow this format. Text encoders must have their parameters fixed;
their weights can't be updated during training.
The following three options for the text embedding module are possible:
- A TFHub path to a text embedding module, wrapped in a
hub.KerasLayer
. For example, to specify the universal-sentence-encoder, usehttps://tfhub.dev/google/universal-sentence-encoder/4
. - A Cloud Storage path to a TensorFlow SavedModel. This SavedModel is
also wrapped in a
hub.KerasLayer
. It is important that your project has access to this Cloud Storage path. - A preset embedding module that Matching Engine
provides (name starts with
@
). Presets are pre-trained text embedding modules that are verified to work with distributed training in the Two-Tower built-in algorithm. The following are two presets:
Training
To perform single-node training with the Two-Tower built-in algorithm, use the
following command. This command creates a CustomJob
resource that uses a single CPU
virtual machine (VM) instance. For information about the flags you can use
during training, see Flags. With single-node
training, the parameters training_steps_per_epoch
and eval_steps_per_epoch
are optional.
# Set a unique name for the job to run. JOB_NAME = f'two_tower_cpu_{dataset_name}_{timestamp}' # URI of the Two-Tower Docker image. LEARNER_IMAGE_URI = 'us-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower' # The region to run the job in. REGION = 'us-central1' # Set the training data and input schema paths. TRAINING_DATA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/training_data/*' INPUT_SCHEMA_PATH = f'gs://cloud-samples-data/vertex-ai/matching-engine/two-tower/{dataset_name}/input_schema.json' # Set a location for the output. OUTPUT_DIR = f'gs://{your_bucket_name}/experiment/output' # Batch size and number of epochs to train. TRAIN_BATCH_SIZE = 100 NUM_EPOCHS = 10 !gcloud beta ai custom-jobs create \ --display-name={JOB_NAME} \ --worker-pool-spec=machine-type=n1-standard-8,replica-count=1,container-image-uri={LEARNER_IMAGE_URI} \ --region={REGION} \ --args=--training_data_path={TRAINING_DATA_PATH} \ --args=--input_schema_path={INPUT_SCHEMA_PATH} \ --args=--job-dir={OUTPUT_DIR} \ --args=--train_batch_size={TRAIN_BATCH_SIZE} \ --args=--num_epochs={NUM_EPOCHS}
Tutorial
For information about how to feed more complex inputs to the Two-Tower built-in algorithm and train on Vertex AI, launch the example notebook in Colab. This notebook also shows you how to train on GPUs and how to view details and logs in TensorBoard while your job is executing.
Flags
Use the following generic and Two-Tower model-specific training flags when training a model by using the Two-Tower built-in algorithm.
Generic training flags
The following custom training flags are the most common. For more information, see Create custom training jobs.
worker-pool-spec
: The worker pool configuration used by the custom job. To create a custom job with multiple worker pools, specify multipleworker-pool-spec
configurations.A
worker-pool-spec
can contain the following fields, which are listed with corresponding fields in the WorkerPoolSpec API message.machine-type
: The machine type for the pool. For a list of supported machines, see Machine types.replica-count
: The number of replicas of the machine in the pool.container-image-uri
: The Docker image to run on each worker. To use the Two-Tower built-in algorithm, the Docker image must be set tous-docker.pkg.dev/vertex-ai-restricted/builtin-algorithm/two-tower:latest
.
display-name
: The name of the job.region
: The region where you want the job to run.
Two-Tower model-specific training flags
training_data_path
: The Cloud Storage URI pattern specifying where training data is stored, for example,gs://<bucket>/<folder>/*
.eval_data_path
: The Cloud Storage URI pattern specifying where evaluation data is stored, for example,gs://<bucket>/<folder>/*
.input_schema_path
: The Cloud Storage URI specifying where the JSON input schema is stored, for example,gs://<bucket>/<folder>/<schema_file>
.input_file_format
: The format of the input file.input_file_format
can be one of the following formats:jsonl
(default)tfrecord
job-dir
: The Cloud Storage URI of a directory where you want to store training outputs.candidate_data_path
: The Cloud Storage URI pattern specifying where candidate data is stored, for example,gs://<bucket>/<folder>/*
.candidate_data_path
is needed only when using non-defaulttop_k_categorical_accuracy
metrics instead ofauc
. Ifcandidate_data_path
isn't set, the candidate set is created from the training and evaluation data.train_batch_size
: The batch size for training. The default value is100
.eval_batch_size
: The batch size for evaluation. The default value is100
.eval_split
: The split to use for the evaluation dataset, ifeval_data_path
isn't provided. The default value is0.2
(20% of data fromtraining_data_path
is used as evaluation data).metrics
: The metrics used to evaluate the model. You can find evaluation metrics in the log files, or you can load evaluation metrics into TensorBoard, specifically fromgs://<job-dir>/tensorboard/validation
, as explained in the TensorBoard documentation. It is highly recommended that you useauc
as your metric, since it is considered the most reliable metric for recommendation models.metrics
can be one of the following:auc
(default)top_k_categorical_accuracy
precision_at_1
optimizer
: The optimizer used to train the model. You can use the lowercase string name of any TensorFlow 2.3 Keras optimizer, for example,sgd
,adam
,ftrl
. For more information, see Module: tf.keras.optimizers. The default value isadagrad
.learning_rate
: The rate of machine learning for training with the optimizer.momentum
: The momentum forsgd
orrmsprop
optimizer, if specified.num_epochs
: The number of epochs for training. The default value is10
.num_hidden_layers
: The number of hidden layers. By default, there are no hidden layers.num_nodes_hidden_layer{index}
: The number of nodes for the hidden layer with that specific index. For example, for the first hidden layer usenum_nodes_hidden_layer1
. The range of indexes is from 1 through 20.output_dim
: A single integer representing the output encoding dimension for each tower of the Two-Tower model. The default value is64
.cache_input
: A Boolean value indicating whether to cache training and evaluation data in memory. This flag is recommended only for small datasets.training_steps_per_epoch
: The number of steps per epoch to run the training for. This flag is needed only if you are using more than 1 node for training or if your primary training node has more than 1 GPU. The default value isNone
.eval_steps_per_epoch
: The number of steps per epoch to run the evaluation for. This flag is needed only if you are using more than 1 node for training or your primary training node has more than 1 GPU. The default value isNone
.gpu_memory_alloc
: The amount of memory, in megabytes, allocated per GPU. The default value is no limit.
What's next
- Launch the example notebook in Colab.
- View the notebook in GitHub.
- Learn more about Matching Engine.