Filtering in vector matching

While performing vector matching in Vertex Matching Engine, you can restrict searches to a subset of the index using boolean rules. You can specify boolean predicates that tell Matching Engine which vectors in the index to ignore from the similarity searches.

Vector attributes

Connsider an example application that is doing a vector similarity search over a database of vectors. Each vector also happens to be described by zero-or-more attributes (or tokens) from each of several attribute categories (or namespaces).

In this example application, vectors are tagged with a color and a shape:

  • color and shape are namespaces.
  • red and blue are tokens from the color namespace.
  • square and circle are tokens from the shape namespace.

Specifying vector attributes

To specify vector attributes in the example application:

  • To specify a "red circle": {color: red}, {shape: circle}.
  • To specify a "red and blue square": {color: red, blue}, {shape: square}.
  • To specify an object with no color, simply do not skip the "color" namespace in restricts.

See the section "Input data format for specifying attributes and namespaces" for the schema of specifying this data.

Queries

  • A query that specifies, **{color: red, blue}, {shape: square, circle}**, matches all database points that satisfy, **(red || blue) && (square || circle)**. Queries express an AND across namespaces, and an OR within each namespace.
  • A query that specifies, **{color: red}**, matches will match all **red** objects of any kind, with no restriction on **shape**.

Denylisting

To enable more advanced scenarios, we support special form of negation, known as denylist tokens. When a query denylists a token, matches will be excluded with any datapoint that has the denylisted token. Note that, if a query namespace has only denylist tokens, all points not explicitly denylisted will match -- in exactly the same way that an empty namespace matches with all points.

Datapoints can also denylist a token, excluding matches with any query specifying that token.

Here are a few examples to clarify the filtering capabilities. Let's define the following data points with the specified tokens:

  1. {} // empty set matches everything

red

// just a 'red' token

blue

// just a 'blue' token

orange

// just an 'orange' token

red, blue

// multiple tokens

red, !blue

// deny the 'blue' token

red, blue, !blue

// a weird edge-case

!blue

// deny-only (similar to empty-set)

This is the intuitive behavior that most people expect of the system.

  • Empty query namespaces are match-all wildcards. For example, Q:{} matches DB:{color:red}
  • Empty datapoint namespaces are not. For example, Q:{color:red} does not match DB:{}.

Chart showing query and database points

Specify namespaces and tokens in the input data

To learn how to structure your input data overall, refer to Input data format and structure.

This section describes how to specify the namespaces and tokens associated with each input vector.

JSON

  • For each vector's record, add a field called "restricts", which should contain an array of objects, each of which will be a namespace.

    • Each object must have a field named "namespace". This is the TokenNamespace.namespace.
    • The value of the field "allow", if present, should be an array of strings. This is the list of TokenNamespace.string_tokens.
    • The value of the field "deny", if present, should be an array of strings. This is the list of TokenNamespace.string_denylist_tokens.

The following are two example records in JSON format:

{"id": "42", "embedding": [0.5, 1.0], "restricts": [{"namespace": "class",
"allow": ["cat", "pet"]},{"namespace": "category", "allow": ["feline"]}]}
{"id": "43", "embedding": [0.6, 1.0], "restricts": [{"namespace":
"class", "allow": ["dog", "pet"]},{"namespace": "category", "allow":
["canine"]}]}

Avro

Avro records should have similar structure as defined in JSON format. Specifically, it should conform to the following schema:

{
   "type": "record",
   "name": "FeatureVector",
   "fields": [
      {"name": "id", "type": "string"},
      {"name": "embedding",
       "type": {
          "type": "array",
    "items": "float"
  }
      },
      {"name": "restricts",
       "type": [
         "null",
         {"type": "array",
          "items": {
          "type": "record",
          "name": "Restrict",
          "fields": [
            {"name": "namespace", "type": "string"},
            {"name": "allow", "type": ["null", {"type": "array", "items": "string"}]},
            {"name": "deny", "type": ["null", {"type": "array", "items": "string"}]}]}}]},
      {"name": "crowding_tag", "type": ["null", "string"]}]
}

What's next