Improve search results with search tuning

Tuned search can give you better quality results than the default search.

Search tuning is particularly valuable if you have industry-specific or company-specific queries that are less well addressed by general LLMs. It can be used to further train the search model.

About training data

To tune a search model, you start by putting together training data.

The training data should contain queries that you expect that your end users will ask and snippets of text, 250 to 500 words long, that contain relevant information needed to answer the queries. A query can be associated with multiple snippets as long as each snippet contains information that answers the query.

The training data should also contain snippets of text that are not paired with queries, but are otherwise similar to the answers in style and length. These snippets without associated queries provide random negatives to tune the model. Google recommends that you provide at least 10,000 of these snippets.

Here is some terminology to describe the training data that you'll need to provide:

  • Training queries: Queries that you anticipate your end users asking. Make sure to focus on queries with specific domain or company terminology.

    Provide at least 100.

  • Extractive segments: Snippets (typically multiple paragraphs) must be taken verbatim from the documents in the data store. All the documents together in the data store are referred to as "the corpus".

    You must provide two types of extractive segments:

    • Segments that contain relevant information needed to answer the training queries. These are segments that have positive matching with queries.

    • Segments that are not associated with any training queries. These segments are used as random negatives in the model tuning.

    A sentence or two is not sufficiently long to be an extractive segment; the segment needs to contain enough context for training. For example, in response to a query like "who founded Google," a short extract like "Larry Page" is insufficient. For examples of sufficiently long segments, see the following table of examples.

    Provide at least one extractive segment per query and at least 10,000 additional extractive segments.

  • Relevance scores: Relevance scores are non-negative integers that estimate how relevant the extractive segment is to the query. You provide a score value for each query and extractive segment pair. A score of 0 means that the extractive segment isn't relevant to the query at all. A score greater than zero indicates some relevance. For simple scoring, Google recommends 1 for all relevant segments and 0 for non-relevant ones. Alternatively, if you want to rank relevance, you can assign relevance scores of 0 to 10 (for example), with 10 for the most relevant segments and 0 for segments that are not at all relevant.

    Provide at least 100 relevant scores and, optionally, additional non-relevant scores.

Examples of query and extractive segment pairs

The following table provides some examples of query and extractive segment pairs. These general examples have been taken from Wikipedia. However, for useful tuning, you'll want to supply documents from proprietary data sets that contain information specific to your business and that are less easily found on the web.

The last pair in this table is an example of a zero-score pair, where the answer is not relevant to the query.

Training query Extractive segment Score
who founded Google? Google was founded on September 4, 1998, by American computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California.Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet. [...] On the list of most valuable brands, Google is ranked second by Forbes and fourth by Interbrand. 1
where is blood pumped after it leaves the right ventricle? Oxygenated blood leaves the lungs through pulmonary veins, which return it to the left part of the heart, completing the pulmonary cycle. This blood then enters the left atrium, which pumps it through the mitral valve into the left ventricle. From the left ventricle, the blood passes through the aortic valve to the aorta. The blood is then distributed to the body through the systemic circulation before returning again to the pulmonary circulation. Arteries Main article: Pulmonary artery From the right ventricle, blood is pumped through the semilunar pulmonary valve into the left and right main pulmonary artery (one for each lung), which branch into smaller pulmonary arteries that spread throughout the lungs. [...] Cardiac shunt is an unnatural connection between parts of the heart that leads to blood flow that bypasses the lungs. 1
where is the bowling hall of fame located? The World Bowling Writers ( WBW ) International Bowling Hall of Fame was established in 1993 and is located in the International Bowling Museum and Hall of Fame , on the International Bowling Campus in Arlington , Texas. History The International Bowling Museum and Hall of Fame was located at 11 Stadium Plaza, St. Louis, Missouri, USA, and shared the same building with the St. Louis Cardinals Hall of Fame Museum, until November 8, 2008. It moved to Arlington and reopened in early 2010. In 2012, the WBW was merged with the International Bowling Media Association. After the merger, the WBW Hall of Fame inductees became part of the IBMA Luby Hall of Fame. officers of the World Bowling Writers, which formed the Hall's Board.][...] The man and woman who receive the most votes are elected. 1
why is the sky blue? A "Hello, World!" program is generally a simple computer program which outputs (or displays) to the screen (often the console) a message similar to "Hello, World!" while ignoring any user input. A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. A "Hello, World!" program is often the first written by a student of a new programming language, but such a program can also be used as a check to ensure that the computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it. [...] The C-language version was preceded by Kernighan's own 1972 A Tutorial Introduction to the Language B, where the first known version of the program is found in an example used to illustrate external variables 0

About testing

After training, the tuned search is tested to determine if tuning improved the results. You can explicitly provide the queries that you want tested. If you don't provide test queries, then Vertex AI Search uses 20% of the training queries as test queries.

Training files

The training data needs to be uploaded in three (optionally four) specific files:

  • A corpus file that contains the extractive segments

  • A query file that contains only the queries

  • A training labels file that connects queries with segments and contains the relevance scores

  • Optional: A test labels file that is similar to the training labels files but is used for evaluating the tuned model instead of training it

The three training files (corpus file, query file, and training labels file) and the (optional) test labels file need to be in Cloud Storage. The paths of the files are defined by fields in the trainCustomMethod call.

Corpus file

The corpus file contains extractive segments: segments that contain information to answer the queries in the query file and many additional segments to be used as random negatives when tuning the model. You should have at least 100 segments that contain query answers; queries can be answered by multiple segments. You should also have at least 10,000 random segments.

If the documents in your data store contain fewer than 500 words, you can use whole documents as segments. Otherwise, programmatically create random segments of 250–500 words from the documents in your data store and add those to the corpus file.

The corpus file is a JSONL (JSON lines) file where each line has the fields _id and text with string values. The maximum size of the file is 250,000 lines.

For example:

  {"_id": "doc1", "text": "Google was founded on September 4, 1998, by American computer scientists Larry Page and Sergey Brin while they were PhD students at Stanford University in California. Together they own about 14% of its publicly listed shares and control 56% of its stockholder voting power through super-voting stock. The company went public via an initial public offering (IPO) in 2004. In 2015, Google was reorganized as a wholly owned subsidiary of Alphabet Inc. Google is Alphabet's largest subsidiary and is a holding company for Alphabet's internet properties and interests. Sundar Pichai was appointed CEO of Google on October 24, 2015, replacing Larry Page, who became the CEO of Alphabet. On December 3, 2019, Pichai also became the CEO of Alphabet. [...] On the list of most valuable brands, Google is 105 ranked second by Forbes and fourth by Interbrand."}
  {"_id": "doc2", "text": "Oxygenated blood leaves the lungs through pulmonary veins, which return it to the left part of the heart, completing the pulmonary cycle. This blood then enters the left atrium, which pumps it through the mitral valve into the left ventricle. From the left ventricle, the blood passes through the aortic valve to the aorta. The blood is then distributed to the body through the systemic circulation before returning again to the pulmonary circulation. Arteries Main article: Pulmonary artery From the right ventricle, blood is pumped through the semilunar pulmonary valve into the left and right main pulmonary artery (one for each lung), which branch into smaller pulmonary arteries that spread throughout the lungs. [...] Cardiac shunt is an unnatural connection between parts of the heart that leads to blood flow that bypasses the lungs."}
  {"_id": "doc3", "text": "The World Bowling Writers ( WBW ) International Bowling Hall of Fame was established in 1993 and is located in the International Bowling Museum and Hall of Fame , on the International Bowling Campus in Arlington , Texas. History The International Bowling Museum and Hall of Fame was located at 11 Stadium Plaza, St. Louis, Missouri, USA, and shared the same building with the St. Louis Cardinals Hall of Fame Museum, until November 8, 2008. It moved to Arlington and reopened in early 2010. In 2012, the WBW was merged with the International Bowling Media Association. After the merger, the WBW Hall of Fame inductees became part of the IBMA Luby Hall of Fame.  officers of the World Bowling Writers, which formed the Hall's Board.][...] The man and woman who receive the most votes are elected."}
  {"_id": "doc4", "text": "A \"Hello, World!\" program is generally a simple computer program which outputs (or displays) to the screen (often the console) a message similar to "Hello, World!" while ignoring any user input. A small piece of code in most general-purpose programming languages, this program is used to illustrate a language's basic syntax. A "Hello, World!" program is often the first written by a student of a new programming language, but such a program can also be used as a check to ensure that the computer software intended to compile or run source code is correctly installed, and that its operator understands how to use it.  [...] The C-language version was preceded by Kernighan's own 1972 A Tutorial Introduction to the Language B, where the first known version of the program is found in an example used to illustrate external variables."}

Query file

The query file contains the example queries that will be used for tuning the model. Each query should have one or more corresponding extractive segments in the corpus file. You should provide at least 100 positive match queries. You can also provide non-relevant queries: these are queries that correspond to extractive segments with a relevance score of zero.

The query file is in JSONL format and has the same fields as the corpus file.

For example:

  {"_id": "query1", "text": "who founded Google?"}
  {"_id": "query2", "text": "where is blood pumped after it leaves the right ventricle?"}
  {"_id": "query3", "text": "where is the bowling hall of fame located?"}
  {"_id": "query4", "text": "why is the sky blue?"}

Training labels

The training labels file connects the queries with the extractive segments and scores each query and segment pair.

If the test labels file is not present, then 20% of the queries in the training labels file are reserved for evaluating the tuned model after training.

The file contains the ID of a query and the ID of its matching (or non-matching) extractive segment and a score for the relevance of the segment to the query. There must be at least one line per query; if a query is answered by two segments, then there are two lines for that query. Score is a non-negative integer value. Any score greater than zero indicates that the document is related to the query. Larger numbers indicate a greater level of relevance. If the score is omitted, the default value is 1.

The training labels file is a TSV (tab-separated values) file with a header. The file must have the columns query-id, corpus-id and score. The query-id is a string that matches the _id key from the query file, and the corpus-id is a string that matches the _id in the corpus file.

For example:

  query-id  corpus-id   score
  query1    doc1    1
  query2    doc2    1
  query3    doc3    1
  query3    doc9    1
  query4    doc4    0

Test labels

Like the training labels file, this optional file contains the IDs of the query and extractive segment and relevance scores. It contains fewer and different queries than in the training labels file. If present, the query and extractive segment pairs in the file are used to evaluate the tuning. If the test labels file is not present, then query and extractive segment pairs from the training labels file are used for evaluating.

This file has the same format as the training labels file.

For example:

  query-id  corpus-id   score
  query200    doc200    1
  query201    doc201    1
  query202    doc202    1
  query203    doc203    1

To tune a data store for search with your own training data follow these steps.

  1. Prepare your training data (and optionally, your test data) files. Use the formats described in Training files.

  2. Place the files into a Cloud Storage bucket.

  3. Upload the files from the Cloud Storage bucket into Vertex AI Search by running the following curl command:

    curl -X POST \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID:trainCustomModel" \
    -d '{
    "gcsTrainingInput": {
     "corpusDataPath": "CORPUS_JSONL_GCS_PATH",
     "queryDataPath": "QUERY_JSONL_GCS_PATH",
     "trainDataPath": "TRAIN_JSONL_GCS_PATH",
     "testDataPath": "TEST_JSONL_GCS_PATH"
    },
    "modelType": "search-tuning"
    }'
    

    Replace the following:

    • PROJECT_ID: The project number or ID of your Google Cloud project.

    • DATA_STORE_ID: The ID of the data store that you want to tune.

    • CORPUS_JSONL_GCS_PATH: The corpus JSONL file path in Cloud Storage—for example, gs://my-bucket/corpus.jsonl.

    • QUERY_JSONL_GCS_PATH: The query JSONL file path in Cloud Storage—for example, gs://my-bucket/query.jsonl.

    • TRAIN_JSONL_GCS_PATH: The training labels TSV file path in Cloud Storage—for example, gs://my-bucket/train.tsv.

    • TEST_JSONL_GCS_PATH: An optional field. If you have a test labels TSV file, specify its path Cloud Storage—for example, gs://my-bucket/test.tsv. If you don't have a test labels file, don't provide the field name or path in the curl command.

    For general information about this method, see trainCustomModel.

    Tuning begins automatically after the data files are uploaded.

    Click for an example curl command and response.

    curl -X POST
    -H "Authorization: Bearer $(gcloud auth print-access-token)"
    -H "Content-Type: application/json"
    "https://discoveryengine.googleapis.com/v1alpha/projects/12345/locations/global/collections/default_collection/dataStores/my-data-store_4321:trainCustomModel"
    -d '{
    "dataStore":
    "projects/12345/locations/global/collections/default_collection/dataStores/my-data-store_4321",
    "gcsTrainingInput": {
    "corpusDataPath": "gs://my-bucket/corpus.jsonl",
    "queryDataPath": "gs://my-bucket/query.jsonl",
    "trainDataPath": "gs://my-bucket/train.tsv"
    },
    "modelType": "search-tuning"
    }
    {
    "name": "projects/12345/locations/global/collections/default_collection/dataStores/my-data-store_4321/operations/train-custom-model-6071430366161939774",
    "metadata": {
    "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.TrainCustomModelMetadata"
    },
    "response": {
    "@type": "type.googleapis.com/google.cloud.discoveryengine.v1alpha.TrainCustomModelResponse",
    "modelStatus": "in-progress"
    }
    
  4. Make note of the name value returned by the trainCustomModel method, and follow the instructions in Get details about a long-running operation to see when the search-tuning operation is complete.

  5. After tuning is complete, to start using the tuned search, run the following curl command:

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/servingConfigs/default_search?updateMask=customFineTuningSpec.enableSearchAdaptor" \
    -d '{
    "customFineTuningSpec": {
     "enableSearchAdaptor": true
    }
    }'
    

    For general information about this method, see servingConfigs.patch.

Turn off tuned search

If you no longer want to use the tuned version of search—for example, if you find the results no better, or worse, than before tuning, then you can disable the tuned search.

To stop using the tuned search, run a curl call similar to the preceding one, but set enableSearchAdaptor to false:

  1. Run the following curl command:

    curl -X PATCH \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json" \
    -H "X-Goog-User-Project: PROJECT_ID" \
    "https://discoveryengine.googleapis.com/v1alpha/projects/PROJECT_ID/locations/global/collections/default_collection/dataStores/DATA_STORE_ID/servingConfigs/default_search?updateMask=customFineTuningSpec.enableSearchAdaptor" \
    -d '{
    "customFineTuningSpec": {
     "enableSearchAdaptor": false
    }
    }'
    

    Replace the following:

    • PROJECT_ID: The project number or ID of your Google Cloud project.

    • DATA_STORE_ID: The ID of the data store that you want to tune.

    For general information about this method, see servingConfigs.patch.