Create embeddings by using the Swivel pipeline template

You can train an embedding model by using Vertex AI Pipelines with the Submatrix-wise Vector Embedding Learner (Swivel) algorithm.

Swivel is a method for generating item embeddings from an item co-occurrence matrix. For structured data, such as purchase orders, the co-occurrence matrix of items can be computed by counting the number of purchase orders that contain both product A and product B, for all products you want to generate embeddings for.

The following tutorial shows how to use the Swivel pipeline template to generate embeddings from your own text or structured data. The steps performed include:

  1. Setup: Enable APIs, grant permissions, prepare input data.
  2. Configure parameters: Set the appropriate parameter values for the Swivel job.
  3. Train on Vertex AI Pipelines: Create a Swivel job on Vertex AI Pipelines using the Swivel pipeline template.

Before you begin

  1. Set up your Google Cloud project for Vertex AI Pipelines.

  2. Enable the Dataflow API service.

  3. Configure a service account with granular permissions.

    • If you don't specify a service account, your pipeline run uses the Compute Engine default service account, which has the Project Editor role by default.
    • For information about how to create your own service account, see Configure a service account with granular permissions and grant the following roles to the service account:
      • roles/aiplatform.user
      • roles/iam.serviceAccountUser
      • roles/dataflow.developer
      • roles/compute.viewer
      • roles/storage.objectAdmin

Input data

You can use the Swivel pipeline template to generate embeddings for words in a text corpus or items in structured datasets. The Swivel pipeline template expects input data files to be in the following format:

  • Each line represents a co-occurrence context. For example, words in a sentence, a set of products in a shopping basket, a set of songs in a playlist, or a set of URLs visited in a user session.

  • Items in each line are separated by a space. Therefore, item names must not have spaces in them.

Input data types

There are two types of inputs: text and items. Specifying the correct one for your use case is important because the input specification affects how the input data is processed.

  1. Text input is expected to have capitalization and punctuation, so Swivel lowercases the text and removes punctuation. By doing this, Swivel ensures that tokens like "apple", "Apple", and "APPLE!" are treated as the same token and thus have the same embeddings.

  2. Items input is split on whitespace, and each token is expected to be an item ID. In addition, order does not matter for items input because groups of items in something like a basket are not inherently ordered. Put another way, each line for items input is treated like a set, not an array. Thus, Swivel treats each item in the line as equally related to the others.

Sample data

Copy the following sample data to your Cloud Storage bucket for testing:

  • wikipedia: A text corpus dataset created from a Wikimedia dump that you can use to learn word embeddings. This dataset uses the text input type.

To copy this data, run the following command:

gsutil cp -r gs://cloud-samples-data/vertex-ai/matching-engine/swivel/wikipedia/* BUCKET_NAME/wikipedia

Replace BUCKET_NAME with the name of a Cloud Storage bucket that you manage.

  • movielens_25m: A movie rating dataset that you can use to create embeddings for movies. This dataset is processed so that each line contains the movies that have the same rating by the same user. The directory also includes movies.csv, which maps the movie IDs to their names. This dataset uses the items input type.

To copy this data, run the following command:

gsutil cp -r gs://cloud-samples-data/vertex-ai/matching-engine/swivel/movielens_25m/data* BUCKET_NAME/movielens_25m

Replace BUCKET_NAME with the name of a Cloud Storage bucket that you manage.

Create a Swivel job using Vertex AI Pipelines

To run a pipeline, use the Google Cloud console:

  1. Download the following precompiled pipeline template and configuration script locally.

  2. Configure the template by using the following swivel_template_configuration.sh script:

    ./swivel_template_configuration.sh \
        -pipeline_suffix YOUR_PIPELINE_SUFFIX \
        -project_id PROJECT_ID \
        -machine_type n1-standard-16 \
        -accelerator_count 2 \
        -accelerator_type NVIDIA_TESLA_V100 \
        -pipeline_root BUCKET_NAME \
        -region YOUR_REGION \
        -network_name YOUR_NETWORK_NAME
    

    The following is a list of the parameters:

    • pipeline_suffix: suffix of your pipeline name (lowercase and hyphen are allowed).
    • machine_type: for example, n1-standard-16. For more information, see Configure compute resources for custom training.
    • accelerator_count: number of GPUs in each machine.
    • accelerator_type: for example, NVIDIA_TESLA_P100, NVIDIA_TESLA_V100. For more information, see GPUs.
    • pipeline_root: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
    • region: Optional: the default is us-central1.
    • network_name: Optional: your subnetwork.

    The swivel_template_configuration.sh script generates a precompiled pipeline template that uses your configuration.

    If you don't use a private IP address and aren't required to set up a VPC network, remove the following lines in your swivel_pipeline.json before submitting the Swivel job:

    • --subnetwork=regions/%REGION%/subnetworks/%NETWORK_NAME%
    • --no_use_public_ips
    • network: projects/%PROJECT_NUMBER%/global/networks/%NETWORK_NAME%\
  3. Create a pipeline run by using swivel_pipeline.json in the console.

    The following table shows the runtime parameters you need to set in the console that are used by the Swivel job:

    Parameter Data type Description Required
    embedding_dim int Dimensions of the embeddings to train. No - Default is 100
    input_base string Cloud Storage path where the input data is stored. Yes
    input_type string Type of the input data. The type can be either 'text' (for Wikipedia sample) or 'items' (for MovieLens sample). Yes
    max_vocab_size int Maximum vocabulary size to generate embeddings for. No - Default is 409600
    num_epochs int Number of epochs for training. No - Default is 20
    • For the items input type, each line of your input data must be space-separated item IDs. Each line is tokenized by splitting on whitespace.
    • For the text input type, each line of your input data is considered equivalent to a sentence. Each line is tokenized by lowercasing and splitting on whitespace.
    • For the Wikipedia sample, use input_type = text and input_base = {SOURCE_DATA}/wikipedia.
    • For the MovieLens sample, use input_type = items and input_base = {SOURCE_DATA}/movielens_25m.

Use your trained embedding model

After the pipeline run completes successfully, the trained TensorFlow SavedModel is exported to a Cloud Storage directory:

{BUCKET_NAME}/pipeline_root/{PROJECT_NUMBER}/swivel-{TIMESTAMP}/EmbTrainerComponent_-{SOME_NUMBER}/model/model/

You can find this path in the Vertex ML Metadata section of the console.

Tutorial

For information about how to train the Swivel pipeline template on Vertex AI, explore the trained embeddings, deploy the embedding model for online serving, and launch the example notebook in Colab.

What's next