You can train an embedding model by using Vertex AI Pipelines with the Submatrix-wise Vector Embedding Learner (Swivel) algorithm.
Swivel is a method for generating item embeddings from an item co-occurrence matrix. For structured data, such as purchase orders, the co-occurrence matrix of items can be computed by counting the number of purchase orders that contain both product A and product B, for all products you want to generate embeddings for.
The following tutorial shows how to use the Swivel pipeline template to generate embeddings from your own text or structured data. The steps performed include:
- Setup: Enable APIs, grant permissions, prepare input data.
- Configure parameters: Set the appropriate parameter values for the Swivel job.
- Train on Vertex AI Pipelines: Create a Swivel job on Vertex AI Pipelines using the Swivel pipeline template.
Before you begin
Enable the Dataflow API service.
Configure a service account with granular permissions.
- If you don't specify a service account, your pipeline run uses the Compute Engine default service account, which has the Project Editor role by default.
- For information about how to create your own service account, see
Configure a service account with granular permissions
and grant the following roles to the service account:
roles/aiplatform.user
roles/iam.serviceAccountUser
roles/dataflow.developer
roles/compute.viewer
roles/storage.objectAdmin
Input data
You can use the Swivel pipeline template to generate embeddings for words in a text corpus or items in structured datasets. The Swivel pipeline template expects input data files to be in the following format:
Each line represents a co-occurrence context. For example, words in a sentence, a set of products in a shopping basket, a set of songs in a playlist, or a set of URLs visited in a user session.
Items in each line are separated by a space. Therefore, item names must not have spaces in them.
Input data types
There are two types of inputs: text and items. Specifying the correct one for your use case is important because the input specification affects how the input data is processed.
Text input is expected to have capitalization and punctuation, so Swivel lowercases the text and removes punctuation. By doing this, Swivel ensures that tokens like "apple", "Apple", and "APPLE!" are treated as the same token and thus have the same embeddings.
Items input is split on whitespace, and each token is expected to be an item ID. In addition, order does not matter for items input because groups of items in something like a basket are not inherently ordered. Put another way, each line for items input is treated like a set, not an array. Thus, Swivel treats each item in the line as equally related to the others.
Sample data
Copy the following sample data to your Cloud Storage bucket for testing:
- wikipedia: A text corpus dataset created from a Wikimedia dump that you can use to learn word embeddings. This dataset uses the text input type.
To copy this data, run the following command:
gsutil cp -r gs://cloud-samples-data/vertex-ai/matching-engine/swivel/wikipedia/* BUCKET_NAME/wikipedia
Replace BUCKET_NAME with the name of a Cloud Storage bucket that you manage.
- movielens_25m: A movie rating
dataset
that you can use to create embeddings for movies. This dataset is processed so
that each line contains the movies that have the same rating by the same user.
The directory also includes
movies.csv
, which maps the movie IDs to their names. This dataset uses the items input type.
To copy this data, run the following command:
gsutil cp -r gs://cloud-samples-data/vertex-ai/matching-engine/swivel/movielens_25m/data* BUCKET_NAME/movielens_25m
Replace BUCKET_NAME with the name of a Cloud Storage bucket that you manage.
Create a Swivel job using Vertex AI Pipelines
To run a pipeline, use the Google Cloud console:
Download the following precompiled pipeline template and configuration script locally.
Configure the template by using the following
swivel_template_configuration.sh
script:./swivel_template_configuration.sh \ -pipeline_suffix YOUR_PIPELINE_SUFFIX \ -project_id PROJECT_ID \ -machine_type n1-standard-16 \ -accelerator_count 2 \ -accelerator_type NVIDIA_TESLA_V100 \ -pipeline_root BUCKET_NAME \ -region YOUR_REGION \ -network_name YOUR_NETWORK_NAME
The following is a list of the parameters:
pipeline_suffix
: suffix of your pipeline name (lowercase and hyphen are allowed).machine_type
: for example,n1-standard-16
. For more information, see Configure compute resources for custom training.accelerator_count
: number of GPUs in each machine.accelerator_type
: for example, NVIDIA_TESLA_P100, NVIDIA_TESLA_V100. For more information, see GPUs.pipeline_root
: the name of your Cloud Storage bucket. Don't include thegs://
prefix.region
: Optional: the default isus-central1
.network_name
: Optional: your subnetwork.
The
swivel_template_configuration.sh
script generates a precompiled pipeline template that uses your configuration.If you don't use a private IP address and aren't required to set up a VPC network, remove the following lines in your
swivel_pipeline.json
before submitting the Swivel job:--subnetwork=regions/%REGION%/subnetworks/%NETWORK_NAME%
--no_use_public_ips
network
:projects/%PROJECT_NUMBER%/global/networks/%NETWORK_NAME%\
Create a pipeline run by using
swivel_pipeline.json
in the console.The following table shows the runtime parameters you need to set in the console that are used by the Swivel job:
Parameter Data type Description Required embedding_dim
int Dimensions of the embeddings to train. No - Default is 100 input_base
string Cloud Storage path where the input data is stored. Yes input_type
string Type of the input data. The type can be either 'text' (for Wikipedia sample) or 'items' (for MovieLens sample). Yes max_vocab_size
int Maximum vocabulary size to generate embeddings for. No - Default is 409600 num_epochs
int Number of epochs for training. No - Default is 20 - For the items input type, each line of your input data must be space-separated item IDs. Each line is tokenized by splitting on whitespace.
- For the text input type, each line of your input data is considered equivalent to a sentence. Each line is tokenized by lowercasing and splitting on whitespace.
- For the Wikipedia sample, use
input_type = text
andinput_base = {SOURCE_DATA}/wikipedia
. - For the MovieLens sample, use
input_type = items
andinput_base = {SOURCE_DATA}/movielens_25m
.
Use your trained embedding model
After the pipeline run completes successfully, the trained TensorFlow SavedModel is exported to a Cloud Storage directory:
{BUCKET_NAME}/pipeline_root/{PROJECT_NUMBER}/swivel-{TIMESTAMP}/EmbTrainerComponent_-{SOME_NUMBER}/model/model/
You can find this path in the Vertex ML Metadata section of the console.
Tutorial
For information about how to train the Swivel pipeline template on Vertex AI, explore the trained embeddings, deploy the embedding model for online serving, and launch the example notebook in Colab.
What's next
- Launch the example notebook in Colab.
- View the notebook in GitHub.
- Learn more about Vertex AI Matching Engine, which you can use with the trained embeddings that you generated.