LlamaIndex is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).
A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With LlamaIndex on Vertex AI for RAG, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.
By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.
The following concepts are key to understanding LlamaIndex on Vertex AI. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.
Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.
Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.
Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.
Data indexing: LlamaIndex on Vertex AI for RAG creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.
Retrieval: When a user asks a question or provides a prompt, the retrieval component in LlamaIndex on Vertex AI for RAG searches through its knowledge base to find information that is relevant to the query.
Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.
This page gets you started with using LlamaIndex on Vertex AI for RAG and provides Python samples to demonstrate how to use the RAG API.
For information about the file size limits, see Supported document types. For information about quotas related to LlamaIndex on Vertex AI for RAG, see LlamaIndex on Vertex AI for RAG quotas. For information about customizing parameters, see Retrieval parameters.
Run LlamaIndex on Vertex AI for RAG using the Vertex AI SDK
To use LlamaIndex on Vertex AI for RAG, do the following:
Run this command in the Google Cloud console to set up your project.
gcloud config set {project}
Run this command to authorize your login.
gcloud auth application-default login
Copy and paste this sample code into the Google Cloud console to run LlamaIndex on Vertex AI.
Python
To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.
Supported generation models
The following models and their versions that support LlamaIndex on Vertex AI include:
Model | Version |
---|---|
Gemini 1.5 Flash | gemini-1.5-flash-001 |
Gemini 1.5 Pro | gemini-1.5-pro-001 |
Gemini 1.0 Pro | gemini-1.0-pro-001 gemini-1.0-pro-002 |
Gemini 1.0 Pro Vision | gemini-1.0-pro-vision-001 |
Gemini | gemini-experimental |
Supported embedding models
The following model versions are supported Google models:
textembedding-gecko@003
textembedding-gecko-multilingual@001
text-embedding-004
text-multilingual-embedding-002
The following model versions are supported fine-tuned Google models:
textembedding-gecko@003
textembedding-gecko-multilingual@001
text-embedding-004
text-multilingual-embedding-002
textembedding-gecko@002
textembedding-gecko@001
If the configuration isn't specified, the default behavior is to use
text-embedding-004
for the embedding choice on RagCorpus
. For
more information about tuning embedding models, see
Tune text embeddings.
Supported document types
Text-only documents are supported, which include the following file types with their file size limits:
File type | File size limit |
---|---|
Google documents | 10 MB when exported from Google Workspace |
Google drawings | 10 MB when exported from Google Workspace |
Google slides | 10 MB when exported from Google Workspace |
HTML file | 10 MB |
JSON file | 1 MB |
Markdown file | 10 MB |
Microsoft PowerPoint slides (PPTX file) | 10 MB |
Microsoft Word documents (DOCX file) | 10 MB |
PDF file | 50 MB |
Text file | 10 MB |
Using LlamaIndex on Vertex AI for RAG with other document types is possible but can generate lower-quality responses.
Supported data sources
There are three supported data sources, which include:
A single-file upload using
upload_file
(up to 25 MB), which is a synchronous call.Import file(s) from Cloud Storage.
Import a directory from Google Drive.
The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.
To authenticate and grant permissions, do the following:
- Go to the IAM page of your Google Cloud project.
- Select Include Google-provided role grant.
- Search for the Vertex AI RAG Data Service Agent service account.
- Click Share on the drive folder, and share with the service account.
- Grant
Viewer
permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
For more information, see the RAG API reference.
Supported data transformations
After a document is ingested, LlamaIndex on Vertex AI for RAG runs a set of transformations for the best quality, and there are parameters that developers can control for their use cases.
These parameters include the following:
Parameter | Description |
---|---|
chunk_size |
When documents are ingested into an index, they are split into chunks. The chunk_size parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens. |
chunk_overlap |
By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens. |
A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.
For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk might not fit into a small-window model.
Retrieval parameters
The following table includes the retrieval parameters:
Parameter | Description |
---|---|
similarity_top_k |
Controls the maximum number of contexts that are retrieved. |
vector_distance_threshold |
Only contexts with a distance smaller than the threshold are considered. |
Index management
A corpus is a collection of documents or source of information. That collection is also referred to as an index. The index can then be queried to retrieve relevant contexts for LLM generation. When creating an index for the first time, the process might take an additional minute. For more creations of indexes within the same Google Cloud project, the process takes less time.
The following index operations are supported:
- Create a corpus: Create an index to import or upload documents.
- Delete a corpus: Delete the index.
- Fetch details of a corpus: Get the metadata describing the index.
- List corpora in a given project: List multiple indexes in a given project.
Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.
File management
The following file operations are supported:
- Upload a file from local storage into a corpus: Upload file with additional information that provides context to the LLM to generate more accurate responses.
- Ingest file(s) from some other storage into a corpus: Import a set of files from a storage location.
- Fetch details of a file: Retrieve relevant contexts for a given query from an index or specified files in the index.
- List files in a corpus: Generates a list of files in the index.
- Delete a file from a corpus: Remove a file from the index.
For more information, see the RAG API reference.
What's next
- Learn more about generative AI on Vertex AI .
- Learn about grounding and how it relates to LlamaIndex on Vertex AI for RAG.
- Learn more about the Retrieval-Augmented Generation API.
- Learn more about the RAG architecture. See
Infrastructure for a RAG-capable generative AI application using Vertex AI .