Input data format and structure

To build a new index or update an existing index, provide vectors to Vector Search in the format and structure described in the following sections.

Input data storage and file organization

Prerequisite

Store your input data in a Cloud Storage bucket , in your Google Cloud project.

Input data files should be organized as follows:

  • Each batch of input data files should be under a single Cloud Storage directory.
  • Data files should be placed directly under batch_root and named with the following suffixes: .csv, .json, and .avro.
  • Each data file is interpreted as a set of records. The format of the record is determined by the suffix of the file name and those format requirements are described below.
  • Each record should have an id, a feature vector, and your optional fields supported by Vertex AI Feature Store, like restricts and crowding.
  • A subdirectory named delete may be present. Each file directly under batch_root/delete is taken as a text file of id records with one id in each line.
  • All other directories and files are ignored.

Input data processing

  • All records from all data files, including those under delete, comprise a single batch of input.
  • The relative ordering of records within a data file is not important.
  • A single ID should only appear once in a batch. If there is a duplicate with the same ID, it displays as one vector count.
  • An ID cannot appear both in a regular data file and a delete data file.
  • All IDs from a data file under delete will cause it to be removed from the next index version.
  • Records from regular data files will be included in the next version, overwriting a value in an older index version.

Here is a .json example: {"id": "1", "embedding": [1,1,1]} {"id": "2", "embedding": [2,2,2]}

The following is an example of a valid input data file organization:

batch_root/
  csv/
    feature_file_1.csv
    feature_file_2.csv
  delete/
    delete_file.txt

The feature_file_1.csv and feature_file_2.csv files contain records in .csv format. The delete_file.txt file contains a list of record ids to be deleted from the next index version.

Data file formats

CSV

  • Encode the file using UTF-8.
  • Make each line a valid .CSV to be interpreted as a single record.
  • Make the first value the id, and the id a valid UTF-8 string.
  • Make the next N values the dimension of the feature vector, which is configured when creating an index. Make each value a floating point literal as defined in the Java language spec.
  • The rest of the values are of the form name=value. If name is crowding_tag, the value is interpreted as the crowding tag. crowding_tag is allowed to appear at most once in each record. All other names are interpreted as namespace in restricts. The same name may be repeated if there are multiple values in a namespace. For example, color=red,color=blue represents a TokenNamespace:

    {
      "namespace": "color"
      "string_tokens": ["red", "blue"]
    }
    

    If value starts with "!", the rest of the string is interpreted as an excluded value. For example, "color=!red" represents a TokenNamespace:

    {
      "namespace": "color"
      "string_blacklist_tokens": "red"
    }
    

JSON

  • Encode the file using UTF-8.
  • Make each line a valid .json object to be interpreted as a record.
  • Include in each record a field named id that requires a valid UTF-8 string that is the ID of the vector.
  • Include in each record a field named embedding that requires an array of numbers. This is the feature vector.
  • The value of the field restricts, if present, should be an array of objects, each is turned into a TokenNamespace in restricts.
    • Each object must have a field named namespace. This is the TokenNamespace.namespace.
    • The value of the field allow, if present, should be an array of strings. This is the list of TokenNamespace.string_tokens.
    • The value of the field deny, if present, should be an array of strings. This is the list of TokenNamespace.string_blacklist_tokens.
    • The value of the field crowding_tag, if present, should be a string.

AVRO

  • Use a valid Avro file.
  • Make records that conform to the following schema:

    {
      "type": "record",
      "name": "FeatureVector",
      "fields": [
        {
          "name": "id",
          "type": "string"
        },
        {
          "name": "embedding",
          "type": {
            "type": "array",
            "items": "float"
          }
        },
        {
          "name": "restricts",
          "type": [
            "null",
            {
              "type": "array",
              "items": {
                "type": "record",
                "name": "Restrict",
                "fields": [
                  {
                    "name": "namespace",
                    "type": "string"
                  },
                  {
                    "name": "allow",
                    "type": [
                      "null",
                      {
                        "type": "array",
                        "items": "string"
                      }
                    ]
                  },
                  {
                    "name": "deny",
                    "type": [
                      "null",
                      {
                        "type": "array",
                        "items": "string"
                      }
                    ]
                  }
                ]
              }
            }
          ]
        },
        {
          "name": "crowding_tag",
          "type": [
            "null",
            "string"
          ]
        }
      ]
    }
    

What's next