RAG Engine overview

Formerly called LlamaIndex on Vertex AI and most recently, Knowledge Engine, RAG Engine is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With RAG Engine, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.

By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.

The following concepts are key to understanding RAG Engine. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.

Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.
Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.
Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.
Data indexing: RAG Engine creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.
Retrieval: When a user asks a question or provides a prompt, the retrieval component in RAG Engine searches through its knowledge base to find information that is relevant to the query.
Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.

Generative AI models that support RAG

This section lists models that support RAG.

Gemini models

The following table lists the Gemini models and their versions that support RAG Engine:

Model	Version
Gemini 1.5 Flash	`gemini-1.5-flash-002` `gemini-1.5-flash-001`
Gemini 1.5 Pro	`gemini-1.5-pro-002` `gemini-1.5-pro-001`
Gemini 1.0 Pro	`gemini-1.0-pro-001` `gemini-1.0-pro-002`
Gemini 1.0 Pro Vision	`gemini-1.0-pro-vision-001`
Gemini	`gemini-experimental`

Self-deployed models

RAG Engine supports all models in Model Garden.

Use RAG Engine with your self-deployed open model endpoints.

  # Create a model instance with your self-deployed open model endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID",
      tools=[rag_retrieval_tool]
  )

Models with managed APIs on Vertex AI

The models with managed APIs on Vertex AI that support RAG Engine include the following:

The following code sample demonstrates how to use the Gemini GenerateContent API to create a generative model instance. The model ID, /publisher/meta/models/llama-3.1-405B-instruct-maas, is found in the model card.

  # Create a model instance with Llama 3.1 MaaS endpoint
  rag_model = GenerativeModel(
      "projects/PROJECT_ID/locations/REGION/publisher/meta/models/llama-3.1-405B-instruct-maas",
      tools=[rag_retrieval_tool]
  )

The following code sample demonstrates how to use the OpenAI compatible ChatCompletions API to generate a model response.

  # Generate a response with Llama 3.1 MaaS endpoint
  response = client.chat.completions.create(
      model="meta/llama-3.1-405b-instruct-maas",
      messages=[{"role": "user", "content": "your-query"}],
      extra_body={
          "extra_body": {
              "google": {
                  "vertex_rag_store": {
                      "rag_resources": {
                          "rag_corpus": rag_corpus_resource
                      },
                      "similarity_top_k": 10
                  }
              }
          }
      },
  )

Embedding models

Embedding models are used to create a corpus and for search and retrieval during response generation. This section lists the supported embedding models.

textembedding-gecko@003
textembedding-gecko-multilingual@001
text-embedding-004 (default)
text-multilingual-embedding-002
textembedding-gecko@002 (fine-tuned versions only)
textembedding-gecko@001 (fine-tuned versions only)

For more information about tuning embedding models, see Tune text embeddings.

The following open embedding models are also supported. You can find them in Model Garden.

e5-base-v2
e5-large-v2
e5-small-v2
multilingual-e5-large
multilingual-e5-small

Document types supported for RAG

Only text documents are supported. The following table shows the file types and their file size limits:

File type	File size limit
Google documents	10 MB when exported from Google Workspace
Google drawings	10 MB when exported from Google Workspace
Google slides	10 MB when exported from Google Workspace
HTML file	10 MB
JSON file	1 MB
Markdown file	10 MB
Microsoft PowerPoint slides (PPTX file)	10 MB
Microsoft Word documents (DOCX file)	50 MB
PDF file	50 MB
Text file	10 MB

Using RAG Engine with other document types is possible but can generate lower-quality responses.

Data sources supported for RAG

The following data sources are supported:

Upload a local file: A single-file upload using upload_file (up to 25 MB), which is a synchronous call.
Cloud Storage: Import file(s) from Cloud Storage.
Google Drive: Import a directory from Google Drive.

The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

To authenticate and grant permissions, do the following:
1. Go to the IAM page of your Google Cloud project.
2. Select Include Google-provided role grant.
3. Search for the Vertex AI RAG Data Service Agent service account.
4. Click Share on the drive folder, and share with the service account.
5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
Slack: Import files from Slack by using a data connector.
Jira: Import files from Jira by using a data connector.

For more information, see the RAG API reference.

Fine-tune your RAG transformations

After a document is ingested, RAG Engine runs a set of transformations to prepare the data for indexing. You can control your use cases using the following parameters:

Parameter	Description
`chunk_size`	When documents are ingested into an index, they are split into chunks. The `chunk_size` parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens.
`chunk_overlap`	By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens.

A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.

For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk of text might not fit into a small-window model.

RAG quotas

For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).

Service	Quota	Metric
RAG Engine data management APIs	60 RPM	`VertexRagDataService requests per minute per region`
`RetrievalContexts` API	1,500 RPM	`VertexRagService retrieve requests per minute per region`
`base_model: textembedding-gecko`	1,500 RPM	`Online prediction requests per base model per minute per region per base_model` An additional filter for you to specify is `base_model: textembedding-gecko`

The following limits apply:

Service	Limit	Metric
Concurrent `ImportRagFiles` requests	3 RPM	`VertexRagService concurrent import requests per region`
Maximum number of files per `ImportRagFiles` request	10,000	`VertexRagService import rag files requests per region`

For more rate limits and quotas, see Generative AI on Vertex AI rate limits.

Retrieval parameters

The following table includes the retrieval parameters:

Parameter	Description
`similarity_top_k`	Controls the maximum number of contexts that are retrieved.
`vector_distance_threshold`	Only contexts with a distance smaller than the threshold are considered.

Manage your RAG knowledge base (corpus)

This section describes how you can manage your corpus for RAG tasks by performing index management and file management.

Corpus management

A corpus, also referred to as an index, is a collection of documents or source of information. The corpus can be queried to retrieve relevant contexts for response generation. When creating a corpus for the first time, the process might take an additional minute.

The following corpus operations are supported:

Operation	Description	Parameters	Examples
Create a RAG corpus.	Create an index to import or upload documents.	Create parameters	Create example
Update a RAG corpus.	Update a previously-created index to import or upload documents.	Update parameters	Update example
List a RAG corpus.	List all of the indexes.	List parameters	List example
Get a RAG corpus.	Get the metadata describing the index.	Get parameters	Get example
Delete a RAG corpus.	Delete the index.	Delete parameters	Delete example

Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.

File management

The following file operations are supported:

Operation	Description	Parameters	Examples
Upload a RAG file.	Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses.	Upload parameters	Upload example
Import RAG files.	Import a set of files from some other storage into a storage location.	Import parameters	Import example
Get a RAG file.	Get details about a RAG file for use by the LLM.	Get parameters	Get example
Delete a RAG file.	Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses.	Delete parameters	Delete example

For more information, see the RAG API reference.

What's next

To learn about the file size limits, see Supported document types.
To learn about quotas related to RAG Engine, see RAG Engine quotas.
To learn about customizing parameters, see Retrieval parameters.
To learn more about the RAG API, see RAG Engine API.
To learn more about grounding, see Grounding overview.
To learn more about the difference between grounding and RAG, see Ground responses using RAG.
To learn more about Generative AI on Vertex AI, see Overview of Generative AI on Vertex AI.
To learn more about the RAG architecture, see Infrastructure for a RAG-capable generative AI application using Vertex AI.