Formerly called LlamaIndex on Vertex AI and most recently, Knowledge Engine, RAG Engine is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).
A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With RAG Engine, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.
By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.
The following concepts are key to understanding RAG Engine. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.
Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.
Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.
Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.
Data indexing: RAG Engine creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.
Retrieval: When a user asks a question or provides a prompt, the retrieval component in RAG Engine searches through its knowledge base to find information that is relevant to the query.
Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.
Generative AI models that support RAG
This section lists models that support RAG.
Gemini models
The following table lists the Gemini models and their versions that support RAG Engine:
Model | Version |
---|---|
Gemini 1.5 Flash | gemini-1.5-flash-002 gemini-1.5-flash-001 |
Gemini 1.5 Pro | gemini-1.5-pro-002 gemini-1.5-pro-001 |
Gemini 1.0 Pro | gemini-1.0-pro-001 gemini-1.0-pro-002 |
Gemini 1.0 Pro Vision | gemini-1.0-pro-vision-001 |
Gemini | gemini-experimental |
Self-deployed models
RAG Engine supports all models in Model Garden.
Use RAG Engine with your self-deployed open model endpoints.
# Create a model instance with your self-deployed open model endpoint
rag_model = GenerativeModel(
"projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID",
tools=[rag_retrieval_tool]
)
Models with managed APIs on Vertex AI
The models with managed APIs on Vertex AI that support RAG Engine include the following:
The following code sample demonstrates how to use the Gemini
GenerateContent
API to create a generative model instance. The model ID,
/publisher/meta/models/llama-3.1-405B-instruct-maas
, is found in the
model card.
# Create a model instance with Llama 3.1 MaaS endpoint
rag_model = GenerativeModel(
"projects/PROJECT_ID/locations/REGION/publisher/meta/models/llama-3.1-405B-instruct-maas",
tools=[rag_retrieval_tool]
)
The following code sample demonstrates how to use the OpenAI compatible
ChatCompletions
API to generate a model response.
# Generate a response with Llama 3.1 MaaS endpoint
response = client.chat.completions.create(
model="meta/llama-3.1-405b-instruct-maas",
messages=[{"role": "user", "content": "your-query"}],
extra_body={
"extra_body": {
"google": {
"vertex_rag_store": {
"rag_resources": {
"rag_corpus": rag_corpus_resource
},
"similarity_top_k": 10
}
}
}
},
)
Embedding models
Embedding models are used to create a corpus and for search and retrieval during response generation. This section lists the supported embedding models.
textembedding-gecko@003
textembedding-gecko-multilingual@001
text-embedding-004
(default)text-multilingual-embedding-002
textembedding-gecko@002
(fine-tuned versions only)textembedding-gecko@001
(fine-tuned versions only)
For more information about tuning embedding models, see Tune text embeddings.
The following open embedding models are also supported. You can find them in Model Garden.
e5-base-v2
e5-large-v2
e5-small-v2
multilingual-e5-large
multilingual-e5-small
Document types supported for RAG
Only text documents are supported. The following table shows the file types and their file size limits:
File type | File size limit |
---|---|
Google documents | 10 MB when exported from Google Workspace |
Google drawings | 10 MB when exported from Google Workspace |
Google slides | 10 MB when exported from Google Workspace |
HTML file | 10 MB |
JSON file | 1 MB |
Markdown file | 10 MB |
Microsoft PowerPoint slides (PPTX file) | 10 MB |
Microsoft Word documents (DOCX file) | 50 MB |
PDF file | 50 MB |
Text file | 10 MB |
Using RAG Engine with other document types is possible but can generate lower-quality responses.
Data sources supported for RAG
The following data sources are supported:
- Upload a local file: A single-file upload using
upload_file
(up to 25 MB), which is a synchronous call. - Cloud Storage: Import file(s) from Cloud Storage.
Google Drive: Import a directory from Google Drive.
The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.
To authenticate and grant permissions, do the following:
- Go to the IAM page of your Google Cloud project.
- Select Include Google-provided role grant.
- Search for the Vertex AI RAG Data Service Agent service account.
- Click Share on the drive folder, and share with the service account.
- Grant
Viewer
permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
Slack: Import files from Slack by using a data connector.
Jira: Import files from Jira by using a data connector.
For more information, see the RAG API reference.
Fine-tune your RAG transformations
After a document is ingested, RAG Engine runs a set of transformations to prepare the data for indexing. You can control your use cases using the following parameters:
Parameter | Description |
---|---|
chunk_size |
When documents are ingested into an index, they are split into chunks. The chunk_size parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens. |
chunk_overlap |
By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens. |
A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.
For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk of text might not fit into a small-window model.
RAG quotas
For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).Service | Quota | Metric |
---|---|---|
RAG Engine data management APIs | 60 RPM | VertexRagDataService requests per minute per region |
RetrievalContexts API |
1,500 RPM | VertexRagService retrieve requests per minute per region |
base_model: textembedding-gecko |
1,500 RPM | Online prediction requests per base model per minute per region per base_model An additional filter for you to specify is base_model: textembedding-gecko |
Service | Limit | Metric |
---|---|---|
Concurrent ImportRagFiles requests |
3 RPM | VertexRagService concurrent import requests per region |
Maximum number of files per ImportRagFiles request |
10,000 | VertexRagService import rag files requests per region |
For more rate limits and quotas, see Generative AI on Vertex AI rate limits.
Retrieval parameters
The following table includes the retrieval parameters:
Parameter | Description |
---|---|
similarity_top_k |
Controls the maximum number of contexts that are retrieved. |
vector_distance_threshold |
Only contexts with a distance smaller than the threshold are considered. |
Manage your RAG knowledge base (corpus)
This section describes how you can manage your corpus for RAG tasks by performing index management and file management.
Corpus management
A corpus, also referred to as an index, is a collection of documents or source of information. The corpus can be queried to retrieve relevant contexts for response generation. When creating a corpus for the first time, the process might take an additional minute.
The following corpus operations are supported:
Operation | Description | Parameters | Examples |
---|---|---|---|
Create a RAG corpus. | Create an index to import or upload documents. | Create parameters | Create example |
Update a RAG corpus. | Update a previously-created index to import or upload documents. | Update parameters | Update example |
List a RAG corpus. | List all of the indexes. | List parameters | List example |
Get a RAG corpus. | Get the metadata describing the index. | Get parameters | Get example |
Delete a RAG corpus. | Delete the index. | Delete parameters | Delete example |
Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.
File management
The following file operations are supported:
Operation | Description | Parameters | Examples |
---|---|---|---|
Upload a RAG file. | Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses. | Upload parameters | Upload example |
Import RAG files. | Import a set of files from some other storage into a storage location. | Import parameters | Import example |
Get a RAG file. | Get details about a RAG file for use by the LLM. | Get parameters | Get example |
Delete a RAG file. | Upload a file from local storage with additional information that provides context to the LLM to generate more accurate responses. | Delete parameters | Delete example |
For more information, see the RAG API reference.
What's next
- To learn about the file size limits, see Supported document types.
- To learn about quotas related to RAG Engine, see RAG Engine quotas.
- To learn about customizing parameters, see Retrieval parameters.
- To learn more about the RAG API, see RAG Engine API.
- To learn more about grounding, see Grounding overview.
- To learn more about the difference between grounding and RAG, see Ground responses using RAG.
- To learn more about Generative AI on Vertex AI, see Overview of Generative AI on Vertex AI.
- To learn more about the RAG architecture, see
Infrastructure for a RAG-capable generative AI application using Vertex AI.