Use Vertex AI Matching Engine

This guide explains how to configure and use Vertex AI Matching Engine to perform vector similarity searches.

Set up a VPC Network Peering connection

To reduce network latency for vector matching online queries, call the Vertex AI service endpoints from your Virtual Private Cloud (VPC) by using Private Service Access. For each Google Cloud project, only one VPC network can be peered with Matching Engine. If you already have a VPC with private services access configured, you can use that VPC to peer with Vertex AI Matching Engine.

Configuring a VPC Network Peering connection is an initial task required only one time per Google Cloud project. After this setup is done, you can make calls to the Matching Engine index from any client running inside your VPC.

The VPC Network Peering connection is required only for vector matching online queries. API calls to create, deploy, and delete indexes do not require a VPC Network Peering connection.

The following steps are to be completed by your Cloud project administrator or network administrator:

  1. To set up your Cloud projects, enable billing, and enable APIs, complete the following Before you begin steps.

  2. To avoid IP address collisions between your VPC network and our service producer's network, you must allocate an IP address range for the Matching Engine service in which the Matching Engine indexes are deployed. For more information, see Allocating IP address ranges.

    # Note: `prefix-length=16` means a CIDR block with mask /16 is reserved for
    # use by Google services. Make sure to enable the Service Networking API.
    gcloud compute addresses create $PEERING_RANGE_NAME \
        --global \
        --prefix-length=16 \
        --description="peering range for Matching Engine service" \
        --network=$NETWORK_NAME \
        --purpose=VPC_PEERING \
        --project=$PROJECT_ID
    
    # Create the VPC connection.
    gcloud services vpc-peerings connect \
        --service=servicenetworking.googleapis.com \
        --network=$NETWORK_NAME \
        --ranges=$PEERING_RANGE_NAME \
        --project=$PROJECT_ID
    

After you create a private connection, you can make online calls to a Matching Engine index from any virtual machine (VM) instance running within the peered VPC.

Example notebook

After you complete the initial VPC Network Peering setup, you can create a user-managed notebooks instance within that VPC, and issue commands from the notebook.

Launch the example notebook in Vertex AI Workbench, or view the notebook in GitHub.

Access control with IAM

Vertex AI uses Identity and Access Management (IAM) to manage access to resources. To grant access to a resource, assign one or more roles to a user, group, or service account.

To use Matching Engine, use these predefined roles to grant varying levels of access to resources at the project level.

Input data format and structure

To build a new index or update an existing index, provide vectors to Matching Engine in the format and structure described in the following sections.

Input data storage

Store your input data in a Cloud Storage bucket, in your Cloud project.

Input directory structure

Structure your input data directory as follows:

  • Batch root directory: Create a root directory for each batch of input data files. Use a single Cloud Storage directory as the root directory. In the following example, the root directory is named batch_root.
  • File naming: Place individual data files directly under batch_root and name them by using the suffix .csv, .json, or .avro, depending on which file format you use.

    • Matching Engine interprets each data file as a set of records.

      The format of the record is determined by the suffix of the file name and is described in one of the following sections.

    • Each record should have an ID and a feature vector, optionally with additional fields such as restricts and crowding.

  • Delete directory: You can create a delete subdirectory under batch_root. This directory is optional.

    • Each file directly under batch_root/delete is a text file of record IDs, with one ID in each line. Each ID must be a valid UTF-8 string.
  • All other directories and files are ignored.

  • All records from all data files, including those under delete, comprise a single batch of input. The relative ordering of records within a data file is immaterial.

  • A single ID can appear only once per batch.

    • Note: An ID cannot appear both in a regular data file and a delete data file.
  • All IDs from a data file under delete are removed from the next index version. Records from regular data files are included in the next version, potentially overwriting a value in an earlier index version.

Data file formats

Data files can be in CSV, JSON, or Avro format.

CSV

  • Encode the file using UTF-8.
  • Make each line a valid CSV to be interpreted as a single record.
  • Make the first value the id, and the id a valid UTF-8 string.
  • Make the next N values the dimension of the feature vector, which is configured when creating an index. Make each value a floating point literal as defined in the Java language spec.

JSON

  • Encode the file using UTF-8.
  • Make each line a valid JSON object to be interpreted as a record.
  • Include in each record a field named id that requires a valid UTF-8 string that is the ID of the vector.
  • Include in each record a field named embedding that requires an array of numbers. This is the feature vector.

AVRO

  • Use a valid Avro file.
  • Make records that conform to the following schema:

    {
       "type": "record",
       "name": "FeatureVector",
       "fields": [
          {"name": "id", "type": "string"},
          {"name": "embedding",
           "type": {
              "type": "array",
              "items": "float"
            }
          }
       ]
    }
    

Manage indexes

The following sections describe how to create, delete, or update indexes. For more information, see the API docs on indexes.

Index metadata file

Before you create an index, you need to configure the parameters for your index.

For example, create a file named index_metadata.json:

{
  "contentsDeltaUri": "gs://BUCKET_NAME/path",
  "config": {
    "dimensions": 100,
    "approximateNeighborsCount": 150,
    "distanceMeasureType": "DOT_PRODUCT_DISTANCE",
    "algorithm_config": {
      "treeAhConfig": {
        "leafNodeEmbeddingCount": 500,
        "leafNodesToSearchPercent": 7
      }
    }
  }
}

You can find the definition for each of these fields in Configuring indexes, or view the definitions within the following schema:

title: NearestNeighborSearch
type: object
properties:
  contentsDeltaUri:
    type: string
    description: >
      Allows inserting, updating  or deleting the contents of the Matching Engine Index.
      The string must be a valid Cloud Storage directory path. If this
      field is set when calling IndexService.UpdateIndex, then no other
      Index field can be also updated as part of the same call.
      The expected structure and format of the files this URI points to is
      described at https://cloud.google.com/vertex-ai/docs/matching-engine/using-matching-engine#input-data-format
    writeOnly: true
  isCompleteOverwrite:
    type: boolean
    description: >
      If this field is set together with contentsDeltaUri when calling IndexService.UpdateIndex,
      then existing content of the Index will be replaced by the data from the contentsDeltaUri.
    default: false
  config:
    type: object
    description: >
      The configuration of the Matching Engine Index.
    required:
    - dimensions
    - algorithmConfig
    properties:
      dimensions:
        type: integer
        format: int32
        description: >
          The number of dimensions of the input vectors.
      approximateNeighborsCount:
        type: integer
        format: int32
        description: >
          The default number of neighbors to find via approximate search before exact reordering is
          performed. Exact reordering is a procedure where results returned by an
          approximate search algorithm are reordered via a more expensive distance computation.
          Required if tree-AH algorithm is used.
      distanceMeasureType:
        description: >
          The distance measure used in nearest neighbor search.
        oneOf:
        - enum: [SQUARED_L2_DISTANCE]
          description: >
            Euclidean (L_2) Distance
        - enum: [L1_DISTANCE]
          description: >
            Manhattan (L_1) Distance
        - enum: [COSINE_DISTANCE]
          description: >
            Cosine Distance. Defined as 1 - cosine similarity.
        - enum: [DOT_PRODUCT_DISTANCE]
          description: >
            Dot Product Distance. Defined as a negative of the dot product
        default: DOT_PRODUCT_DISTANCE
      featureNormType:
        description: >
          Type of normalization to be carried out on each vector.
        oneOf:
        - enum: [UNIT_L2_NORM]
          description: >
            Unit L2 normalization type.
        - enum: [NONE]
          description: >
            No normalization type is specified.
        default: NONE
      algorithmConfig:
        description: >
          The configuration with regard to the algorithms used for efficient search.
        oneOf:
        - type: object
          description: >
             Configuration options for using the tree-AH algorithm (Shallow tree + Asymmetric Hashing).
             Please refer to this paper for more details: https://arxiv.org/abs/1908.10396
          properties:
            type:
              type: string
              enum: [treeAhConfig]
            leafNodeEmbeddingCount:
              type: integer
              format: int64
              description: >
                 Number of embeddings on each leaf node. The default value is 1000 if not set.
            leafNodesToSearchPercent:
              type: number
              format: int32
              description: >
                 The default percentage of leaf nodes that any query may be searched. Must be in
                 range 1-100, inclusive. The default value is 10 (means 10%) if not set.
        - type: object
          description: >
             Configuration options for using brute force search, which simply implements the
             standard linear search in the database for each query.
          properties:
            type:
              type: string
              enum: [bruteForceConfig]
        discriminator:
          propertyName: type

This metadata schema file is available to download from Cloud Storage.

Create an index

To create an index:

  1. Define your index metadata.
  2. Send the request by using gcloud beta ai indexes create:

    gcloud beta ai indexes create \
        --display-name=INDEX_NAME \
        --description=test \
        --metadata-file=LOCAL_PATH_TO_METADATA_FILE \
        --project=PROJECT_ID \
        --region=us-central1
    

Alternatively, you can use a curl API call with the index metadata passed directly:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexes \
-d '{
    displayName: "'${DISPLAY_NAME}'",
    description: "'${DESCRIPTION}'",
    metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       config: {
          dimensions: 100,
          approximateNeighborsCount: 150,
          distanceMeasureType: "DOT_PRODUCT_DISTANCE",
          algorithm_config: {
          treeAhConfig: {
            leafNodeEmbeddingCount: 500,
            leafNodesToSearchPercent: 7
            }
          }
       }
    }
}'

Here is the example output:

{
  "name": "projects/xxx/locations/us-central1/indexes/xxxx/operations/yyyy",
  "metadata": {...}
}

In the output, find the line with "name": "projects/xxxx/locations/us-central1/indexes/xxx/operations/yyyy". The yyyy part is your operation ID. Poll the operation until "done": true is in the response.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexes/xxxx/operations/yyyy"

List indexes

To list indexes, run gcloud beta ai indexes list:

gcloud beta ai indexes list \
    --project=PROJECT_ID \
    --region=us-central1

Update index content

To update the content of an existing index, use the IndexService.UpdateIndex method.

To replace the existing content of an Index:

  • Set Index.metadata.contentsDeltaUri to the Cloud Storage URI that includes the vectors you want to update.
  • Set isCompleteOverwrite to true.

If you set the contentsDeltaUri field when calling IndexService.UpdateIndex, then no other index fields (such as displayName, description, or userLabels) can be also updated as part of the same call.

Here is an example curl API call:

curl -X PATCH -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexes/${INDEX_ID} \
-d '{
   metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       isCompleteOverwrite: 
    },
}'

Poll the operation that is returned from the UpdateIndex call until it's done successfully.

If the Index has any associated deployments (see the Index.deployed_indexes field), then when certain changes to the original Index are done, the DeployedIndex is automatically updated asynchronously in the background to reflect these changes.

To check whether the change has been propagated, compare the update index operation finish time and the DeployedIndex.index_sync_time.

Delete index

You cannot delete the Index until all its Index.deployed_indexes have been undeployed.

curl -X DELETE -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexes/${INDEX_ID}

Deploy and manage indexes

Deploying an index includes the following three tasks:

  1. Create an IndexEndpoint if needed, or reuse an existing IndexEndpoint.
  2. Get the IndexEndpoint ID.
  3. Deploy the Index to the IndexEndpoint.

Create an IndexEndpoint within your VPC network

Before you use an index to serve online vector matching queries, you must deploy the Index to an IndexEndpoint within your VPC Network Peering network. The first step is to create an IndexEndpoint. You can deploy more than one index to an IndexEndpoint that shares the same VPC network.

Here is an example curl API call:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints \
-d '{
    display_name: "'${DISPLAY_NAME}'",
    network: "'${VPC_NETWORK_NAME}'",
}'

Here is the example output:

{
  "name": "projects/xxx/locations/us-central1/indexEndpoints/xxxx/operations/yyyy",
  "metadata": {...}
}

In the output, find the line with "name": "projects/xxxx/locations/us-central1/indexEndpoints/xxx/operations/yyyy". The yyyy part is your operation ID. Poll the operation until "done": true is in the response.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/xxxx/operations/yyyy"

Deploy an index

Deploy an index:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:deployIndex \
-d '{
  deployedIndex: {
    id: "'${DEPLOYED_INDEX_ID}'",
    index: "'${INDEX_RESOURCE_NAME}'",
    displayName: "'${DISPLAY_NAME}'"
  }
}'

As with other operations, get the operation ID from the response, and use it to poll the operation until it is done.

Enable autoscaling

Matching Engine supports autoscaling, which can automatically resize the number of nodes based on the demands of your workloads. When demand is high, nodes are added to the node pool, which won't exceed the maximum size you designate. When demand is low, the node pool scales back down to a minimum size that you designate. You can check the actual nodes in use and the changes by monitoring the current replicas.

To enable autoscaling, specify the maxReplicaCount and minReplicaCount when you deploy your index:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:deployIndex \
-d '{
  deployedIndex: {
    id: "'${DEPLOYED_INDEX_ID}'",
    index: "'${INDEX_RESOURCE_NAME}'",
    displayName: "'${DISPLAY_NAME}'"
    automaticResources: {
      minReplicaCount: 2,
      maxReplicaCount: 5
    }
  }
}'
  • If both minReplicaCount and maxReplicaCount are not set, they are set to 2 by default.
  • If only maxReplicaCount is set, minReplicaCount is set to 2 by default.
  • If only minReplicaCount is set, maxReplicaCount is set to equal minReplicaCount.

List IndexEndpoints

To list your index endpoints (IndexEndpoints) and view the information of any associated deployed index instances (DeployedIndex), run the following code:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints

Here is the response:

{
  "indexEndpoints": [
    {
      "name": "projects//locations/us-central1/indexEndpoints/",
      "displayName": "...",
      "deployedIndexes": [
        {
          "id": "",
          "index": "projects//locations/us-central1/indexes/",
          "displayName": "demo",
          "createTime": "2021-06-18T00:19:13.242212Z",
          "privateEndpoints": {
            "matchGrpcAddress": "10.29.2.5"
          },
          "indexSyncTime": "2021-08-13T19:52:48.671205Z",
          "automaticResources": {
            "minReplicaCount": 1,
            "maxReplicaCount": 1
          }
        }
        ...
      ],
      "etag": "AMEw9yP9cMX3cjWFRuyLqI6YbB2UQcb-OU3tMwx9_B2p_MUiMlsMKPWX5KCphr1vbyiv",
      "createTime": "2021-06-18T00:16:59.320793Z",
      "updateTime": "2021-06-18T00:16:59.850034Z",
      "network": "projects//global/networks/"
    },
    ...
  ]
}

For more information, see the reference documentation for IndexEndpoint.

Undeploy an index

To undeploy an index, run the following code:

curl -H "Content-Type: application/json"  \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:undeployIndex \
-d '{
  deployed_index_id: "'${DEPLOYED_INDEX_ID}'",
}'

Delete an IndexEndpoint

Before you delete an IndexEndpoint, you must undeploy all the indexes associated with it.

curl -X DELETE -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}

Query indexes to get nearest neighbors

Each DeployedIndex has a DEPLOYED_INDEX_SERVER_IP, which you can retrieve by listing IndexEndpoints. To query a DeployedIndex, connect to its DEPLOYED_INDEX_SERVER_IP at port 10000 and call the MatchRequest or BatchMatchRequest method.

The following example uses grpc_cli:

./grpc_cli call ${DEPLOYED_INDEX_SERVER_IP}:10000 google.cloud.aiplatform.container.v1.MatchService.BatchMatch 'requests: [{deployed_index_id: "", requests: [{deployed_index_id: "", float_val: [-0.1,..]}, {deployed_index_id: "", float_val: [-0.1,..]}]}]'

You must make calls to these APIs from a client running in the same VPC that the service was peered with.

For more information about how to construct the queries, launch a sample notebook, and run it in Vertex AI Workbench.

Tuning the index

Tuning the index requires setting the configuration parameters that impact the performance of deployed indexes, especially recall and latency. These parameters are set when the index is created. You can use brute force indexes to measure recall.

Configuration parameters that impact recall and latency

  1. distanceMeasureType

    The following values are supported:

    • SQUARED_L2_DISTANCE: Euclidean L2 distance
    • L1_DISTANCE: Manhattan L1 distance
    • COSINE_DISTANCE: Cosine distance defined as '1 - cosine similarity'
    • DOT_PRODUCT_DISTANCE: vDot product distance, defined as a negative of the dot product. This is the default value.

    In most cases, the embedding vectors used for similarity matching are computed by using metric learning models (also called Siamese networks or two-tower models). These models use a distance metric to compute the contrastive loss function. Ideally, the value of the distanceMeasureType parameter for the matching index matches the distance measure used by the model that produced the embedding vectors.

  2. approximateNeighborsCount

    The default number of neighbors to find by using approximate search before exact reordering is performed. Exact reordering is a procedure where results returned by an approximate search algorithm are reordered by a more expensive distance computation. Increasing this value increases recall, which can create a proportionate increase in latency.

  3. treeAhConfig.leafNodesToSearchPercent

    The percentage of leaves to be searched for each query. Increasing this value increases recall, which can also create a proportionate increase in latency. The default value is 10 or 10% of the leaves.

  4. treeAhConfig.leafNodeEmbeddingCount

    The number of embeddings for each leaf node. By default, this number is set to 1000.

    This parameter does not have a linear correlation to recall. Increasing or decreasing the value of the treeAhConfig.leafNodeEmbeddingCount parameter doesn't always increase or decrease recall. Experiment to find the optimal value. Changing the value of the treeAhConfig.leafNodeEmbeddingCount parameter generally has less affect than changing the value of the other parameters.

Using a brute force index to measure recall

To get the exact nearest neighbors, use indexes with the brute force algorithm. The brute force algorithm provides 100% recall at the expense of higher latency. Using a brute force index to measure recall is usually not a good choice for production serving, but you might find it useful for evaluating the recall of various indexing options offline.

To create an index with the brute force algorithm, specify brute_force_config in the index metadata:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-central1/indexes \
-d '{
    displayName: "'${DISPLAY_NAME}'",
    description: "'${DESCRIPTION}'",
    metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       config: {
          dimensions: 100,
          approximateNeighborsCount: 150,
          distanceMeasureType: "DOT_PRODUCT_DISTANCE",
          featureNormType: "UNIT_L2_NORM",
          algorithmConfig: {
             bruteForceConfig: {}
          }
       },
    },
}'

The example notebook shows how to use a brute force index to measure recall.

Monitor the IndexEndpoint

Google provides two metrics for monitoring the IndexEndpoint:

  • aiplatform.googleapis.com/matching_engine/current_shards

    The number of shards of the DeployedIndex. As data is added and deleted, Matching Engine automatically reshards the index to achieve optimal performance. This metric indicates the current number of shards of the deployed index.

  • aiplatform.googleapis.com/matching_engine/current_replicas

    The total number of active replica servers being used by the DeployedIndex. To match query volume, Matching Engine automatically turns up or down replica servers based on the minimum and maximum replica settings specified when deploying the index.

    If the index has multiple shards, each shard can be served by using a different number of replica servers. This metric is the total number of replica servers across all shards of the given index.

Learn how to select, query, and display these metrics in Metrics Explorer.

Quotas

Learn about Vertex AI Matching Engine quotas and how to request quota increases.

FAQ

How many IP addresses should I reserve?

If there's no restriction on the IP range you can allocate, we recommend that you reserve a large IP range like /16 to avoid a future IP-exhausted issue.

If you don't want to over allocate the IP ranges, you can do a rough estimation based on your data size and traffic. Each shard can host about 20 GB of data in Avro format, and each replica of the shard can serve about 800 to 1,000 queries per second (QPS). The accurate QPS each replica can serve depends on, for example, your embedding size, dimensions, and algorithm configurations. We strongly recommend that you do a load test to determine an accurate number.

The total number of deployed index nodes needed is (the number of shards * the number of replicas per shard). For example, if your data size is 30 GB and QPS is 1,200, you need at least 2 shards and 2 replicas per shard, which is a total of 4 deployed index nodes.

After estimating your total deployed index nodes, you can pick the IP range prefix based on the following table:

Total deployed index nodes Recommended reserved IP prefix
1 - 10 /21
11 - 25 /20
26 - 50 /19
51 - 120 /18

How do I resolve an IP exhausted error?

To resolve an IP exhausted error, complete the following steps:

  1. Check for any unused DeployedIndexes and undeploy them to free up some IP spaces.

  2. Expand existing reserved IP ranges, or allocate more IP ranges.

For more information, see IP address range exhaustion.

Why can't I reuse the deployed index ID when the previous DeployedIndex is undeployed?

UndeployIndex cleanup requires at least 10 to 20 minutes to complete even after receiving a succeeded operation confirmation. We recommend you either wait for 10 to 20 minutes before reusing the same ID, or use a different ID.

Getting support

If you encounter a problem using Matching Engine, there are two ways to get support. In either case, include the following details in your communication:

  • Your project ID.
  • The command or code you ran that triggered the problem.
  • The environment in which you ran the command or code. For example, did you run it in a Compute Engine instance or in an on-premises computer?
  • The behavior you observed and how it differs from what you expected.

Create a Cloud Customer Care ticket

If you have a Customer Care package, you can file a support ticket. For information about getting a Cloud Customer Care package, see Customer Care.

  1. In the Google Cloud Console, navigate to the Cases page.

    Go to Cases

  2. Click Create case.

    • In the Title field, type Matching Engine serving.
    • In the Category field, select Machine Learning.
    • In the Component field, select Vertex: Other.
    • In the Description field, type Vertex AI Matching Engine at the beginning of your description, then provide the requested information and respond to the questions.
    • Click Submit.

Send an email

If you do not have a Customer Care package, you can send a message to the following email address:

gcp-ann-feedback@google.com

What's next