Use Vertex Matching Engine

This guide explains how to configure and use Vertex Matching Engine to perform vector similarity searches.

Set up VPC Network Peering connection

To reduce network latency for vector matching online queries, call the Vertex AI service endpoints from your VPC via Private Service Access. For each Google Cloud project, only one VPC network can be peered with Matching Engine. If you already have a VPC with private services access configured, you can use that VPC to peer with Vertex Matching Engine.

Configuring a VPC Network Peering connection is an initial task required only once per Google Cloud project. Once this setup is done, you can make calls to Matching Engine index from any client running inside your VPC.

The VPC Network Peering connection is required only for vector matching online queries. API calls to create, deploy, and delete indexes do not require a VPC Network Peering connection.

Have your Google Cloud project administator or network administrator complete these steps:

  1. Follow these steps to set up your Google Cloud project(s), enable billing, and enable APIs.

  2. Before you create a private connection, you must allocate an IP address range to be used by Matching Engine service. This ensures that there's no IP address collision between your VPC network and our service producer's network in which the Matching Engine indexes are deployed. See more documents here.

    # NOTE: `prefix-length=16` means a CIDR block with mask /16 will be reserved
    # for use by Google services. Make sure to enable Service Networking API.
    gcloud compute addresses create $PEERING_RANGE_NAME \
      --global \
      --prefix-length=16 \
      --description="peering range for Matching Engine service" \
      --network=$NETWORK_NAME \
      --purpose=VPC_PEERING \
      --project=$PROJECT_ID
    
    # Create the VPC connection.
    gcloud services vpc-peerings connect \
      --service=servicenetworking.googleapis.com \
      --network=$NETWORK_NAME \
      --ranges=$PEERING_RANGE_NAME \
      --project=$PROJECT_ID
    

Example notebook

After you complete the initial VPC Network Peering setup, you can create a Notebooks instance within that VPC, and issue commands from the notebook.

Launch the example notebook in Notebooks, or view the notebook in GitHub.

Access control

Vertex AI uses Identity and Access Management (IAM) to manage access to resources. To grant access to a resource, assign one or more roles to a user, group, or service account.

To use Matching Engine, use these predefined roles to grant varying levels of access to resources at the project level.

Input data format and structure

This section describes the structure and format in which vectors should be provided to the Matching Engine, either to build a new index or to update an existing index.

Input data storage

Store your input data in a Cloud Storage bucket, in your Google Cloud project.

Input directory structure

This section explains how to structure your input data directory.

  • Batch root directory:: Create a root directory for each batch of input data files. This should be a single Cloud Storage directory, named batch_root in this example.
  • File naming: Place individual data files directly under batch_root and name them with the following suffixes: .csv, .json, or .avro, depending on which file format you use.

    • Matching Engine interprets each data file as a set of records,

      The format of the record is determined by the suffix of the file name and is described in one of the following sections.

    • Each record should have an id and a feature vector, optionally with additional fields supported by GFV, such as restricts and crowding.

  • Delete directory: You can create a delete subdirectory under batch_root. This directory is optional.

    • Each file directly under batch_root/delete is a text file of record ids, with one id in each line. Each id must be a valid UTF-8 string.
  • All other directories and files are ignored.

  • All records from all data files, including those under delete, comprise a single batch of input. The relative ordering of records within a data file is immaterial.

  • A single id should only appear once in a batch.

    • Note: an id also cannot appear both in a regular data file and a delete data file.
  • All ids from a data file under delete will cause it to be removed from the next index version. Records from regular data files will be included in the next version, potentially overwriting a value in an older index version.

Data file formats

This section describes the format requirements for each individual data file. Data files can be CSV, JSON, or Avro files.

CSV

  • Encode the file using UTF-8.
  • Each line should be a valid CSV, and will be interpreted as a single record.
  • The first value should be the id. It should be a valid UTF-8 string.
  • The next N values should be the feature vector. N is the dimension of the feature vector and should be configured when creating an index. Each value should be a floating point literal as defined in the Java language spec.

JSON

  • File should be encoded using UTF-8.
  • Each line should be a valid JSON object, which is interpreted as a record.
  • Each record must have a field named id, which is the id of the vector. It should be a valid UTF-8 string.
  • Each record must have a field named embedding and it should be an array of numbers. This is the feature vector.

AVRO

  • File should be a valid Avro file.
  • Avro records should have similar structure as defined in JSON format. Specifically, it should conform to the following schema:

    {
       "type": "record",
       "name": "FeatureVector",
       "fields": [
          {"name": "id", "type": "string"},
          {"name": "embedding",
           "type": {
              "type": "array",
              "items": "float"
            }
          }
       ]
    }
    

Manage indexes

This section describes how to create, delete or update indexes. Refer to the API docs on indexes.

Index metadata file

Before you create an index, you need to configure the parameters for your index.

For example, create a file named index_metadata.json:

{
  "contentsDeltaUri": "gs://BUCKET_NAME/path",
  "config": {
    "dimensions": 100,
    "approximateNeighborsCount": 150,
    "distanceMeasureType": "DOT_PRODUCT_DISTANCE",
    "algorithm_config": {
      "treeAhConfig": {
        "leafNodeEmbeddingCount": 500,
        "leafNodesToSearchPercent": 7
      }
    }
  }
}

You can find the definition for each of these fields in Configuring indexes, or view the definitions within the following schema:

title: NearestNeighborSearch
type: object
properties:
  contentsDeltaUri:
    type: string
    description: >
      Allows inserting, updating  or deleting the contents of the Matching Engine Index.
      The string must be a valid Cloud Storage directory path. If this
      field is set when calling IndexService.UpdateIndex, then no other
      Index field can be also updated as part of the same call.
      The expected structure and format of the files this URI points to is
      described at https://cloud.google.com/vertex-ai/docs/matching-engine/using-matching-engine#input-data-format
    writeOnly: true
  isCompleteOverwrite:
    type: boolean
    description: >
      If this field is set together with contentsDeltaUri when calling IndexService.UpdateIndex,
      then existing content of the Index will be replaced by the data from the contentsDeltaUri.
    default: false
  config:
    type: object
    description: >
      The configuration of the Matching Engine Index.
    required:
    - dimensions
    - algorithmConfig
    properties:
      dimensions:
        type: integer
        format: int32
        description: >
          The number of dimensions of the input vectors.
      approximateNeighborsCount:
        type: integer
        format: int32
        description: >
          The default number of neighbors to find via approximate search before exact reordering is
          performed. Exact reordering is a procedure where results returned by an
          approximate search algorithm are reordered via a more expensive distance computation.
          Required if tree-AH algorithm is used.
      distanceMeasureType:
        description: >
          The distance measure used in nearest neighbor search.
        oneOf:
        - enum: [SQUARED_L2_DISTANCE]
          description: >
            Euclidean (L_2) Distance
        - enum: [L1_DISTANCE]
          description: >
            Manhattan (L_1) Distance
        - enum: [COSINE_DISTANCE]
          description: >
            Cosine Distance. Defined as 1 - cosine similarity.
        - enum: [DOT_PRODUCT_DISTANCE]
          description: >
            Dot Product Distance. Defined as a negative of the dot product
        default: DOT_PRODUCT_DISTANCE
      featureNormType:
        description: >
          Type of normalization to be carried out on each vector.
        oneOf:
        - enum: [UNIT_L2_NORM]
          description: >
            Unit L2 normalization type.
        - enum: [NONE]
          description: >
            No normalization type is specified.
        default: NONE
      algorithmConfig:
        description: >
          The configuration with regard to the algorithms used for efficient search.
        oneOf:
        - type: object
          description: >
             Configuration options for using the tree-AH algorithm (Shallow tree + Asymmetric Hashing).
             Please refer to this paper for more details: https://arxiv.org/abs/1908.10396
          properties:
            type:
              type: string
              enum: [treeAhConfig]
            leafNodeEmbeddingCount:
              type: integer
              format: int64
              description: >
                 Number of embeddings on each leaf node. The default value is 1000 if not set.
            leafNodesToSearchPercent:
              type: number
              format: int32
              description: >
                 The default percentage of leaf nodes that any query may be searched. Must be in
                 range 1-100, inclusive. The default value is 10 (means 10%) if not set.
        - type: object
          description: >
             Configuration options for using brute force search, which simply implements the
             standard linear search in the database for each query.
          properties:
            type:
              type: string
              enum: [bruteForceConfig]
        discriminator:
          propertyName: type

This metadata schema file is available to download from Cloud Storage.

Create an index

To create an index:

  1. Define your index metadata
  2. Send the request using gcloud beta ai indexes create:

    gcloud beta ai indexes create \
      --display-name=INDEX_NAME \
      --description=test \
      --metadata-file=LOCAL_PATH_TO_METADATA_FILE \
      --project=PROJECT_ID \
      --region=us-central1
    

Alternatively, here is an example curl API call with the index metadata passed directly:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexes \
-d '{
    displayName: "'${DISPLAY_NAME}'",
    description: "'${DESCRIPTION}'",
    metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       config: {
          dimensions: 100,
          approximateNeighborsCount: 150,
          distanceMeasureType: "DOT_PRODUCT_DISTANCE",
          algorithm_config: {
          treeAhConfig: {
            leafNodeEmbeddingCount: 500,
            leafNodesToSearchPercent: 7
            }
          }
       }
    }
}'

Here is the example output:

{
  "name": "projects/xxx/locations/us-central1/indexes/xxxx/operations/yyyy",
  "metadata": {...}
}

Take a look at the operation output, find the line with "name": "projects/xxxx/locations/us-central1/indexes/xxx/operations/yyyy". The yyyy part is your operation ID. Poll the operation until it's done successfully. Wait until "done": true is in the response.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexes/xxxx/operations/yyyy"

List indexes

To list indexes, run gcloud beta ai indexes list:

gcloud beta ai indexes list \
  --project=PROJECT_ID \
  --region=us-central1

Update index content

To update the content of an existing index, use the IndexService.UpdateIndex method.

To replace the existing content of an Index:

  • Set Index.metadata.contentsDeltaUri to the Cloud Storage URI that includes the vectors you want to update.
  • Set isCompleteOverwrite to true.

If you set the contentsDeltaUri field when calling IndexService.UpdateIndex, then no other Index fields (such as displayName, description, or userLabels) can be also updated as part of the same call.

Here is an example curl API call:

curl -X PATCH -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexes/${INDEX_ID} \
-d '{
   metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       isCompleteOverwrite: <true|false>
    },
}'

Same as creating an index, poll the operation that is returned from the UpdateIndex call until it's done successfully.

If the Index has any associated deployments (see Index.deployed_indexes field), when certain changes to the original Index are being done, the DeployedIndex is automatically updated asynchronously in the background to reflect these changes.

To check whether the change has been propagated, compare the update index operation finish time and the DeployedIndex.index_sync_time.

Delete index

You cannot delete the Index until all its Index.deployed_indexes have been undeployed.

curl -X DELETE -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexes/${INDEX_ID}

Deploy and manage indexes

Deploying an index consists of the following three tasks:

  1. Create an IndexEndpoint if needed or reuse an existing IndexEndpoint.
  2. Get the IndexEndpoint ID.
  3. Deploy the Index to the IndexEndpoint.

Create an IndexEndpoint within your VPC network

Before you use an index to serve online vector matching queries, you must deploy the Index to an IndexEndpoint within your VPC Network Peering network. Therefore the first step is to create an IndexEndpoint. You can deploy more than one Index to an IndexEndpoint which shares the same VPC network.

Here is an example curl API call:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints \
-d '{
    display_name: "'${DISPLAY_NAME}'",
    network: "'${VPC_NETWORK_NAME}'",
}'

Here is the example output:

{
  "name": "projects/xxx/locations/us-central1/indexEndpoints/xxxx/operations/yyyy",
  "metadata": {...}
}

Take a look at the operation output, find the line with "name": "projects/xxxx/locations/us-central1/indexEndpoints/xxx/operations/yyyy". The yyyy part is your operation ID. Poll the operation until it's done successfully. Wait until "done": true is in the response.

curl -H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/xxxx/operations/yyyy"

Deploy an index

To deploy an index:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:deployIndex \
-d '{
  deployedIndex: {
    id: "'${DEPLOYED_INDEX_ID}'",
    index: "'${INDEX_RESOURCE_NAME}'",
    displayName: "'${DISPLAY_NAME}'"
  }
}'

As with other operations, get the operation id from the response, and use it to poll the operation until it is done.

Enable autoscaling

Matching Engine supports autoscaling, which can automatically resize the number of nodes based on the demands of your workloads. When demand is high, nodes are added to the node pool (won't exceed the maximum size you designate). When demand is low, node pool scales back down to a minimum size that you designate. (You can check the actual nodes in use and the changes by monitoring the current replicas.

To enable autoscaling, specify the maxReplicaCount and minReplicaCount when you deploy your index:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:deployIndex \
-d '{
  deployedIndex: {
    id: "'${DEPLOYED_INDEX_ID}'",
    index: "'${INDEX_RESOURCE_NAME}'",
    displayName: "'${DISPLAY_NAME}'"
    automaticResources: {
      minReplicaCount: 2,
      maxReplicaCount: 5
    }
  }
}'
  • If both minReplicaCount and maxReplicaCount are not set, they will be set to 1 by default.
  • If only maxReplicaCount is set, minReplicaCount will be set to 1 by default.
  • If only minReplicaCount is set, maxReplicaCount will be set to equal minReplicaCount.

List IndexEndpoints

To list your IndexEndpoints and view the information of any associated DeployedIndex instances, run:

curl -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints

Here is the response:

{
  "indexEndpoints": [
    {
      "name": "projects/<ProjectId>/locations/us-central1/indexEndpoints/<IndexEndpoingId>",
      "displayName": "...",
      "deployedIndexes": [
        {
          "id": "<user specified deployed index id>",
          "index": "projects/<ProjectId>/locations/us-central1/indexes/<IndexId>",
          "displayName": "demo",
          "createTime": "2021-06-18T00:19:13.242212Z",
          "privateEndpoints": {
            "matchGrpcAddress": "10.29.2.5"
          },
          "indexSyncTime": "2021-08-13T19:52:48.671205Z",
          "automaticResources": {
            "minReplicaCount": 1,
            "maxReplicaCount": 1
          }
        }
        ...
      ],
      "etag": "AMEw9yP9cMX3cjWFRuyLqI6YbB2UQcb-OU3tMwx9_B2p_MUiMlsMKPWX5KCphr1vbyiv",
      "createTime": "2021-06-18T00:16:59.320793Z",
      "updateTime": "2021-06-18T00:16:59.850034Z",
      "network": "projects/<ProjectId>/global/networks/<NetworkId>"
    },
    ...
  ]
}

For more information, see the reference documentation for IndexEndpoint.

Undeploy an index

To undeploy an index, run:

curl -H "Content-Type: application/json"  \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}:undeployIndex \
-d '{
  deployed_index_id: "'${DEPLOYED_INDEX_ID}'",
}'

Delete an IndexEndpoint

Before you delete an IndexEndpoint, you must undeploy all the indexes associated with it.

curl -X DELETE -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexEndpoints/${INDEX_ENDPOINT_ID}

Query indexes to get nearest neighbors

Each DeployedIndex has a DEPLOYED_INDEX_SERVER_IP which you can retrieve by listing IndexEndpoints. To query a DeployedIndex, connect to its DEPLOYED_INDEX_SERVER_IP at port 10000 and invoke the MatchRequest or BatchMatchRequest method.

The following example uses grpc_cli

./grpc_cli call ${DEPLOYED_INDEX_SERVER_IP}:10000 google.cloud.aiplatform.container.v1beta1.MatchService.BatchMatch 'requests: [{deployed_index_id: "<deployed index id 1>", requests: [{deployed_index_id: "<deployed index id 1>", float_val: [-0.1,..<your query input>]}, {deployed_index_id: "<deployed index id 1>", float_val: [-0.1,..<your query input>]}]}]'

Calls to these APIs should be made from a client running in the same VPC that the service was peered with.

You can launch a sample notebook that has additional instructions on how to construct the queries, and run it in Notebooks.

Tuning the index

This section describes the configuration parameters that impact the performance, in particular recall and latency, of deployed indexes. These parameters are set when the index is created. We will also describe how to use brute force indexes to measure recall.

Configuration parameters that impact recall and latency:

1) 'distanceMeasureType'

The supported values are:

  • SQUARED_L2_DISTANCE – Euclidean L2 distance
  • L1_DISTANCE – Manhattan L1 distance
  • COSINE_DISTANCE – Cosine distance defined as '1 - cosine similarity'
  • DOT_PRODUCT_DISTANCE – vDot product distance, defined as a negative of the dot product. This is the default value.

In most cases, the embedding vectors used for similarity matching are computed via Metric Learning models (also called "Siamese Networks" or "Two-Tower Models"). These models use a distance metric to compute the contrastive loss function. Ideally, the distanceMeasureType parameter for the matching index should match the distance measure used by the model that produced the embedding vectors.

2) approximateNeighborsCount

The default number of neighbors to find via approximate search before exact reordering is performed. Exact reordering is a procedure where results returned by an approximate search algorithm are reordered via a more expensive distance computation. Increasing this value will increase recall, but may also have a proportionate increase in latency.

3) treeAhConfig.leafNodesToSearchPercent

The percentage of leaves to be searched for each query. Increasing this value will increase recall, but may also have a proportionate increase in latency. The default value is 10, which means we search for 10% of the leaves.

4) treeAhConfig.leafNodeEmbeddingCount

The number of embeddings for each leaf node. By default, this is set to 1000.

This parameter does not have a linear correlation to recall. Increasing or decreasing its value will not always increase or decrease recall. For each use case, there is usually a sweet spot that may require some experimentation to find. However, generally, the impact of this parameter is lower than the impact of the other parameters.

Using a brute force index to measure recall

Indexes with the brute force algorithm can be used to obtain the exact nearest neighbors, which provides 100% recall at the expense of higher latency. This is usually not a good choice for production serving, but may be useful, for example, to evaluate the recall of various indexing options offline.

To create an index with the brute force algorithm, specify brute_force_config in index metadata:

curl -X POST -H "Content-Type: application/json" \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
https://us-central1-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/indexes \
-d '{
    displayName: "'${DISPLAY_NAME}'",
    description: "'${DESCRIPTION}'",
    metadata: {
       contentsDeltaUri: "'${INPUT_DIR}'",
       config: {
          dimensions: 100,
          approximateNeighborsCount: 150,
          distanceMeasureType: "DOT_PRODUCT_DISTANCE",
          featureNormType: "UNIT_L2_NORM",
          algorithmConfig: {
             bruteForceConfig: {}
          }
       },
    },
}'

The example notebook demonstrates how to use a 'Brute Force' index to measure recall.

Monitor the IndexEndpoint

We provide two metrics for monitoring the IndexEndpoint:

  • aiplatform.googleapis.com/matching_engine/current_shards

    The number of shards of the DeployedIndex. As data is added and deleted, Matching Engine automatically reshards the index to achieve optimal performance. This metric indicates the current number of shards of the deployed index.

  • aiplatform.googleapis.com/matching_engine/current_replicas

    The number of active replicas that are used by the DeployedIndex. Matching Engine automatically turns up and down replica servers (subject to user-specified min and max replica settings when deploying the index) to match the query volume. This metric indicates the total number of replica servers. Note that if the index has multiple shards, each shard can be served by using a different number of replicas. This metric is the total number of replicas across all shards of the given index.

Learn how to select, query, and display these metrics in Metrics Explorer.

Quotas

Learn about Vertex Matching Engine quotas and how to request quota increases.

FAQ

How many IP addresses should I reserve?

If there's no restriction on the IP range you can allocate, we would recommend you to reserve a large IP range like /16 to avoid ip exhausted issue in the future.

However, if you don't want to over allocate the IP ranges, you can do a rough estimation based on your data size and traffic. Each shard can host about 20GB data in Avro format and each replica of the shard can serve about 800~1k qps. (The accurate qps each replica can serve depends on, for example, your embedding size, dimensions, algorithm configurations. We strongly recommend you to do a load test to figure out the accurate number.) The total deployed index nodes needed is (the number of shards * the number of replicas per shard).

For example, if your data size is 30GB and qps is 1200, you will need at least 2 shards and 2 replicas per shard which is a total of 4 deployed index nodes.

After estimated your total deployed index nodes, you can pick the IP range prefix based on the table below.

Total Deployed Index Nodes Recommended reserved IP prefix
1 - 10 /21
11 - 25 /20
26 - 50 /19
51 - 120 /18

What can I do if I get IP exhausted error?

First of all, you can check if there are any unused DeployedIndexes and undeploy them to free up some IP spaces.

Besides, you can exanding existing reserved IP ranges or allocating more IP ranges. See detailed instruction here.

What's next