Input data format and structure

To build a new index or update an existing index, provide vectors to Vector Search in the format and structure described in the following sections.

Input data storage and file organization

Prerequisite

Store your input data in a Cloud Storage bucket, in your Google Cloud project.

Input data files should be organized as follows:

  • Each batch of input data files should be under a single Cloud Storage directory.
  • Data files should be placed directly under batch_root and named with the following suffixes: .csv, .json, and .avro.
  • There is a limit of 5000 objects (files) in the batch root directory.
  • Each data file is interpreted as a set of records. The format of the record is determined by the suffix of the filename and those format requirements are described. See Data file formats.
  • Each record should have an id, a feature vector, and your optional fields supported by Vertex AI Feature Store, like restricts and crowding.
  • A subdirectory named delete may be present. Each file directly under batch_root/delete is taken as a text file of id records with one id in each line.
  • All other subdirectories are not allowed.

Input data processing

  • All records from all data files, including those under delete, consist of a single batch of input.
  • The relative ordering of records within a data file is not important.
  • A single ID should only appear once in a batch. If there is a duplicate with the same ID, it displays as one vector count.
  • An ID cannot appear both in a regular data file and a delete data file.
  • All IDs from a data file under delete causes it to be removed from the next index version.
  • Records from regular data files is included in the next version, overwriting a value in an older index version.

The following are examples of dense, sparse, and hybrid embeddings:

  • Dense embeddings:

    {"id": "1", "embedding": [1,1,1]}
    {"id": "2", "embedding": [2,2,2]}
    
  • Sparse embeddings:

    {"id": "3", "sparse_embedding": {"values": [0.1, 0.2], "dimensions": [1, 4]}}
    {"id": "4", "sparse_embedding": {"values": [-0.4, 0.2, -1.3], "dimensions": [10, 20, 20]}}
    
  • Hybrid embeddings:

    {"id": "5", "embedding": [5, 5, -5], "sparse_embedding": {"values": [0.1], "dimensions": [500]}}
    {"id": "6", "embedding": [6, 7, -8.1], "sparse_embedding": {"values": [0.1, -0.2], "dimensions": [40, 901]}}
    

The following is an example of a valid input data file organization:

batch_root/
  feature_file_1.csv
  feature_file_2.csv
  delete/
    delete_file.txt

The feature_file_1.csv and feature_file_2.csv files contain records in CSV format. The delete_file.txt file contains a list of record IDs to be deleted from the next index version.

Data file formats

JSON

  • Encode the JSON file using UTF-8.
  • Each line of the JSON file will be interpreted as a separate JSON object.
  • Each record must contain an id field to specify the ID of the vector.
  • Each record must contain at least one of embedding or sparse_embedding.
  • The embedding field is an array of N floating point numbers that represents the feature vector, where N is the dimension of the feature vector that was configured when the index was created. This field can be used for dense embeddings only.
    • configs.dimensions, which is specified at index creation time, must be the same length as embeddings. configs.dimensions applies only to embedding, not to sparse_embedding.
  • The sparse_embedding field is an object with values and dimensions fields. The values field is a list of floating point numbers that represents the feature vector and the dimensions field is a list of integers that represent the dimension in which the corresponding value is located. For example, a sparse embedding that looks like [0,0.1,0,0,0.2] can be represented as "sparse_embedding": {"values": [0.1, 0.2], "dimensions": [1,4]}. This field can be used for sparse embeddings only.
    • The length of sparse_embedding.values must be the same length as sparse_embedding.dimensions. They don't need to be the same length as configs.dimensions, which is specified at index creation time and doesn't apply to sparse_embedding.
  • An optional restricts field can be included that specifies an array of TokenNamespace objects in restricts. For each object:
    • Specify a namespace field that is the TokenNamespace.namespace.
    • An optional allow field can be set to an array of strings which are the list of TokenNamespace.string_tokens.
    • An optional deny field can be set to an array of strings which are the list of TokenNamespace.string_blacklist_tokens.
    • The value of the field crowding_tag, if present, must be a string.
  • An optional numeric_restricts field can be included that specifies an array of NumericRestrictNamespace. For each object:
    • Specify a namespace field that is the NumericRestrictNamespace.namespace.
    • One of the value fields value_int, value_float, and value_double.
    • It must not have a field named op. This field is only for queries.

Avro

  • Use a valid Avro file.
  • To represent a sparse-only datapoint, provide a sparse embedding in the sparse_embedding field and enter an empty list in the embedding field.
  • Make records that conform to the following schema:

    {
      "type": "record",
      "name": "FeatureVector",
      "fields": [
        {
          "name": "id",
          "type": "string"
        },
        {
          "name": "embedding",
          "type": {
            "type": "array",
            "items": "float"
          }
        },
        {
          "name": "sparse_embedding",
          "type": [
            "null",
            {
              "type": "record",
              "name": "sparse_embedding",
              "fields": [
                {
                  "name": "values",
                  "type": {
                    "type": "array",
                    "items": "float"
                  }
                },
                {
                  "name": "dimensions",
                  "type": {
                    "type": "array",
                    "items": "long"
                  }
                }
              ]
            }
          ]
        },
        {
          "name": "restricts",
          "type": [
            "null",
            {
              "type": "array",
              "items": {
                "type": "record",
                "name": "Restrict",
                "fields": [
                  {
                    "name": "namespace",
                    "type": "string"
                  },
                  {
                    "name": "allow",
                    "type": [
                      "null",
                      {
                        "type": "array",
                        "items": "string"
                      }
                    ]
                  },
                  {
                    "name": "deny",
                    "type": [
                      "null",
                      {
                        "type": "array",
                        "items": "string"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "numeric_restricts",
          "type": [
            "null",
            {
              "type": "array",
              "items": {
                "name": "NumericRestrict",
                "type": "record",
                "fields": [
                  {
                    "name": "namespace",
                    "type": "string"
                  },
                  {
                    "name": "value_int",
                    "type": [ "null", "int" ],
                    "default": null
                  },
                  {
                    "name": "value_float",
                    "type": [ "null", "float" ],
                    "default": null
                  },
                  {
                    "name": "value_double",
                    "type": [ "null", "double" ],
                    "default": null
                  }
                ]
              }
            }
          ],
          "default": null
        },
        {
          "name": "crowding_tag",
          "type": [
            "null",
            "string"
          ]
        }
      ]
    }
    

CSV

  • Format: ID,N feature vector values,Any number of dimension:value sparse values,name=value lists
  • Encode the CSV file using UTF-8.
  • Each line of the CSV must contain exactly one record.
  • The first value in each line must be the vector ID, which must be a valid UTF-8 string.
  • Following the ID, at least one of dense embedding or sparse embedding must be specified.
  • For a dense embedding, the next N values represent the feature vector, where N is the dimension of the feature vector that was configured when the index was created.
  • For a sparse embedding, any number of dimension:value can be specified, in which value is parsed as a float and dimension is parsed as a long.
  • For a hybrid embedding that has both dense and sparse embeddings, dense embeddings must be specified before sparse embeddings.
  • Feature vector values must be floating point literals as defined in the Java language spec.
  • Additional values may be in the form name=value.
  • The name crowding_tag is interpreted as the crowding tag and may only appear once in the record.
  • All other name=value pairs are interpreted as token namespace restricts. The same name may be repeated if there are multiple values in a namespace.

    For example, color=red,color=blue represents this TokenNamespace:

    {
      "namespace": "color"
      "string_tokens": ["red", "blue"]
    }
    
  • If value starts with !, the rest of the string is interpreted as an excluded value.

    For example, color=!red represents this TokenNamespace:

    {
      "namespace": "color"
      "string_blacklist_tokens": ["red"]
    }
    
  • #name=numericValue pairs with number type suffix is interpreted as numeric namespace restricts. Number type suffix is i for int, f for float, and d for double. The same name shouldn't be repeated as there should be a single value associated per namespace.

    For example, #size=3i represents this NumericRestrictNamespace:

    {
      "namespace": "size"
      "value_int": 3
    }
    

    #ratio=0.1f represents this NumericRestrictNamespace:

    {
      "namespace": "ratio"
      "value_float": 0.1
    }
    

    #weight=0.3d represents this NumericRestriction:

    {
      "namespace": "weight"
      "value_double": 0.3
    }
    
  • The following example is a datapoint with id: "6", embedding: [7, -8.1], sparse_embedding: {values: [0.1, -0.2, 0.5], dimensions: [40, 901, 1111]}, crowding tag test, token allowlist of color: red, blue, token denylist of color: purple, and numeric restrict of ratio with float 0.1:

    6,7,-8.1,40:0.1,901:-0.2,1111:0.5,crowding_tag=test,color=red,color=blue,color=!purple,ratio=0.1f
    

What's next