To build a new index or update an existing index, provide vectors to Vector Search in the format and structure described in the following sections.
Input data storage and file organization
Prerequisite
Store your input data in a Cloud Storage bucket , in your Google Cloud project.
Input data files should be organized as follows:
- Each batch of input data files should be under a single Cloud Storage directory.
- Data files should be placed directly under
batch_root
and named with the following suffixes:.csv
,.json
, and.avro
. - Each data file is interpreted as a set of records. The format of the record is determined by the suffix of the file name and those format requirements are described below.
- Each record should have an
id
, a feature vector, and your optional fields supported by Vertex AI Feature Store, like restricts and crowding. - A subdirectory named
delete
may be present. Each file directly underbatch_root
/delete
is taken as a text file ofid
records with oneid
in each line. - All other directories and files are ignored.
Input data processing
- All records from all data files, including those under
delete
, comprise a single batch of input. - The relative ordering of records within a data file is not important.
- A single ID should only appear once in a batch. If there is a duplicate with the same ID, it displays as one vector count.
- An ID cannot appear both in a regular data file and a delete data file.
- All IDs from a data file under delete will cause it to be removed from the next index version.
- Records from regular data files will be included in the next version, overwriting a value in an older index version.
Here is a .json example:
{"id": "1", "embedding": [1,1,1]}
{"id": "2", "embedding": [2,2,2]}
The following is an example of a valid input data file organization:
batch_root/
csv/
feature_file_1.csv
feature_file_2.csv
delete/
delete_file.txt
The feature_file_1.csv
and feature_file_2.csv
files contain records in .csv
format.
The delete_file.txt
file contains a list of record ids to be deleted from the
next index version.
Data file formats
CSV
- Encode the file using UTF-8.
- Make each line a valid .CSV to be interpreted as a single record.
- Make the first value the
id
, and theid
a valid UTF-8 string. - Make the next
N
values the dimension of the feature vector, which is configured when creating an index. Make each value a floating point literal as defined in the Java language spec. The rest of the values are of the form
name=value
. If name iscrowding_tag
, the value is interpreted as the crowding tag.crowding_tag
is allowed to appear at most once in each record. All other names are interpreted as namespace in restricts. The same name may be repeated if there are multiple values in a namespace. For example,color=red,color=blue
represents a TokenNamespace:{ "namespace": "color" "string_tokens": ["red", "blue"] }
If value starts with "!", the rest of the string is interpreted as an excluded value. For example, "color=!red" represents a TokenNamespace:
{ "namespace": "color" "string_blacklist_tokens": "red" }
JSON
- Encode the file using UTF-8.
- Make each line a valid .json object to be interpreted as a record.
- Include in each record a field named
id
that requires a valid UTF-8 string that is the ID of the vector. - Include in each record a field named
embedding
that requires an array of numbers. This is the feature vector. - The value of the field
restricts
, if present, should be an array of objects, each is turned into aTokenNamespace
in restricts.- Each object must have a field named
namespace
. This is theTokenNamespace.namespace
. - The value of the field
allow
, if present, should be an array of strings. This is the list ofTokenNamespace.string_tokens
. - The value of the field
deny
, if present, should be an array of strings. This is the list ofTokenNamespace.string_blacklist_tokens
. - The value of the field
crowding_tag
, if present, should be a string.
- Each object must have a field named
AVRO
- Use a valid Avro file.
Make records that conform to the following schema:
{ "type": "record", "name": "FeatureVector", "fields": [ { "name": "id", "type": "string" }, { "name": "embedding", "type": { "type": "array", "items": "float" } }, { "name": "restricts", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "Restrict", "fields": [ { "name": "namespace", "type": "string" }, { "name": "allow", "type": [ "null", { "type": "array", "items": "string" } ] }, { "name": "deny", "type": [ "null", { "type": "array", "items": "string" } ] } ] } } ] }, { "name": "crowding_tag", "type": [ "null", "string" ] } ] }
What's next
- Learn how to Create and manage your index