RAG Engine overview

Formerly called LlamaIndex on Vertex AI and most recently, Knowledge Engine, RAG Engine is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With RAG Engine, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.

By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.

The following concepts are key to understanding RAG Engine. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.

  1. Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.

  2. Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.

  3. Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.

  4. Data indexing: RAG Engine creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.

  5. Retrieval: When a user asks a question or provides a prompt, the retrieval component in RAG Engine searches through its knowledge base to find information that is relevant to the query.

  6. Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.

Generative AI models that support RAG

This section lists models that support RAG.

Gemini models

The following table lists the Gemini models and their versions that support RAG Engine:

Model Version
Gemini 1.5 Flash gemini-1.5-flash-002
gemini-1.5-flash-001
Gemini 1.5 Pro gemini-1.5-pro-002
gemini-1.5-pro-001
Gemini 1.0 Pro gemini-1.0-pro-001
gemini-1.0-pro-002
Gemini 1.0 Pro Vision gemini-1.0-pro-vision-001
Gemini gemini-experimental

Self-deployed models

RAG Engine supports all models in Model Garden.

Use RAG Engine with your self-deployed open model endpoints.

  # Create a model instance with your self-deployed open model endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID",
      tools=[rag_retrieval_tool]
  )

Models with managed APIs on Vertex AI

The models with managed APIs on Vertex AI that support RAG Engine include the following:

The following code sample demonstrates how to use the Gemini GenerateContent API to create a generative model instance. The model ID, /publisher/meta/models/llama-3.1-405B-instruct-maas, is found in the model card.

  # Create a model instance with Llama 3.1 MaaS endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/publisher/meta/models/llama-3.1-405B-instruct-maas",
      tools=[rag_retrieval_tool]
  )

The following code sample demonstrates how to use the OpenAI compatible ChatCompletions API to generate a model response.

  # Generate a response with Llama 3.1 MaaS endpoint
  response = client.chat.completions.create(
      model="meta/llama-3.1-405b-instruct-maas",
      messages=[{"role": "user", "content": "your-query"}],
      extra_body={
          "extra_body": {
              "google": {
                  "vertex_rag_store": {
                      "rag_resources": {
                          "rag_corpus": rag_corpus_resource
                      },
                      "similarity_top_k": 10
                  }
              }
          }
      },
  )

Embedding models

Embedding models are used to create a corpus and for search and retrieval during response generation. This section lists the supported embedding models.

  • textembedding-gecko@003
  • textembedding-gecko-multilingual@001
  • text-embedding-004 (default)
  • text-multilingual-embedding-002
  • textembedding-gecko@002 (fine-tuned versions only)
  • textembedding-gecko@001 (fine-tuned versions only)

For more information about tuning embedding models, see Tune text embeddings.

The following open embedding models are also supported. You can find them in Model Garden.

  • e5-base-v2
  • e5-large-v2
  • e5-small-v2
  • multilingual-e5-large
  • multilingual-e5-small

Document types supported for RAG

Only text documents are supported. The following table shows the file types and their file size limits:

File type File size limit
Google documents 10 MB when exported from Google Workspace
Google drawings 10 MB when exported from Google Workspace
Google slides 10 MB when exported from Google Workspace
HTML file 10 MB
JSON file 1 MB
Markdown file 10 MB
Microsoft PowerPoint slides (PPTX file) 10 MB
Microsoft Word documents (DOCX file) 50 MB
PDF file 50 MB
Text file 10 MB

Using RAG Engine with other document types is possible but can generate lower-quality responses.

Data sources supported for RAG

The following data sources are supported:

  • Upload a local file: A single-file upload using upload_file (up to 25 MB), which is a synchronous call.
  • Cloud Storage: Import file(s) from Cloud Storage.
  • Google Drive: Import a directory from Google Drive.

    The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

    To authenticate and grant permissions, do the following:

    1. Go to the IAM page of your Google Cloud project.
    2. Select Include Google-provided role grant.
    3. Search for the Vertex AI RAG Data Service Agent service account.
    4. Click Share on the drive folder, and share with the service account.
    5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
  • Slack: Import files from Slack by using a data connector.

  • Jira: Import files from Jira by using a data connector.

For more information, see the RAG API reference.

Fine-tune your RAG transformations

After a document is ingested, RAG Engine runs a set of transformations to prepare the data for indexing. You can control your use cases using the following parameters:

Parameter Description
chunk_size When documents are ingested into an index, they are split into chunks. The chunk_size parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens.
chunk_overlap By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens.

A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.

For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk of text might not fit into a small-window model.

RAG quotas

For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).
Service Quota Metric
RAG Engine data management APIs 60 RPM VertexRagDataService requests per minute per region
RetrievalContexts API 1,500 RPM VertexRagService retrieve requests per minute per region
base_model: textembedding-gecko 1,500 RPM Online prediction requests per base model per minute per region per base_model

An additional filter for you to specify is base_model: textembedding-gecko
The following limits apply:
Service Limit Metric
Concurrent ImportRagFiles requests 3 RPM VertexRagService concurrent import requests per region
Maximum number of files per ImportRagFiles request 10,000 VertexRagService import rag files requests per region

For more rate limits and quotas, see Generative AI on Vertex AI rate limits.

Retrieval parameters

The following table includes the retrieval parameters:

Parameter Description
similarity_top_k Controls the maximum number of contexts that are retrieved.
vector_distance_threshold Only contexts with a distance smaller than the threshold are considered.

Manage your RAG knowledge base (corpus)

This section describes how you can manage your corpus for RAG tasks by performing index management and file management.

Corpus management

A corpus, also referred to as an index, is a collection of documents or source of information. The corpus can be queried to retrieve relevant contexts for response generation. When creating a corpus for the first time, the process might take an additional minute.

The following corpus operations are supported:

Operation Description Parameters Examples
Create a RAG corpus. Create an index to import or upload documents. Create parameters Create example
Update a RAG corpus. Update a previously-created index to import or upload documents. Update parameters Update example
List a RAG corpus. List all of the indexes. List parameters List example
Get a RAG corpus. Get the metadata describing the index. Get parameters Get example
Delete a RAG corpus. Delete the index. Delete parameters Delete example

Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.

File management

The following file operations are supported:

Operation Description Parameters Examples
Upload a RAG file. Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses. Upload parameters Upload example
Import RAG files. Import a set of files from some other storage into a storage location. Import parameters Import example
Get a RAG file. Get details about a RAG file for use by the LLM. Get parameters Get example
Delete a RAG file. Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses. Delete parameters Delete example

For more information, see the RAG API reference.

What's next