Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window. Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window.

Overview of LlamaIndex on Vertex AI for RAG

LlamaIndex on Vertex AI for RAG is a data framework for developing context-augmented large language model (LLM) applications. Context augmentation occurs when you apply an LLM to your data. This implements retrieval-augmented generation (RAG).

A common problem with LLMs is that they don't understand private knowledge, that is, your organization's data. With LlamaIndex on Vertex AI for RAG, you can enrich the LLM context with additional private information, because the model can reduce hallucination and answer questions more accurately.

By combining additional knowledge sources with the existing knowledge that LLMs have, a better context is provided. The improved context along with the query enhances the quality of the LLM's response.

The following concepts are key to understanding LlamaIndex on Vertex AI for RAG. These concepts are listed in the order of the retrieval-augmented generation (RAG) process.

Data ingestion: Intake data from different data sources. For example, local files, Cloud Storage, and Google Drive.
Data transformation: Conversion of the data in preparation for indexing. For example, data is split into chunks.
Embedding: Numerical representations of words or pieces of text. These numbers capture the semantic meaning and context of the text. Similar or related words or text tend to have similar embeddings, which means they are closer together in the high-dimensional vector space.
Data indexing: LlamaIndex on Vertex AI for RAG creates an index called a corpus. The index structures the knowledge base so it's optimized for searching. For example, the index is like a detailed table of contents for a massive reference book.
Retrieval: When a user asks a question or provides a prompt, the retrieval component in LlamaIndex on Vertex AI for RAG searches through its knowledge base to find information that is relevant to the query.
Generation: The retrieved information becomes the context added to the original user query as a guide for the generative AI model to generate factually grounded and relevant responses.

Supported models

This section lists the Google models and open models that support LlamaIndex on Vertex AI for RAG.

Gemini models

The following table lists the Gemini models and their versions that support LlamaIndex on Vertex AI for RAG:

Model	Version
Gemini 1.5 Flash	`gemini-1.5-flash-001`
Gemini 1.5 Pro	`gemini-1.5-pro-001`
Gemini 1.0 Pro	`gemini-1.0-pro-001` `gemini-1.0-pro-002`
Gemini 1.0 Pro Vision	`gemini-1.0-pro-vision-001`
Gemini	`gemini-experimental`

Open models

Google-operated Llama 3.1 model as a service (MaaS) endpoint and your self-deployed open model endpoints support LlamaIndex on Vertex AI for RAG.

The following code sample demonstrates how to use the Gemini API to create an open model instance.

  # Create a model instance with Llama 3.1 MaaS endpoint
  rag_model = GenerativeModel(
  "projects/PROJECT_ID/locations/REGION/publisher/meta/models/llama3-405B-instruct-maas",
  tools=[rag_retrieval_tool]
  )

  # Create a model instance with your self-deployed open model endpoint
  rag_model = GenerativeModel(
  "projects/PROJECT_ID/locations/REGION/endpoints/ENDPOINT_ID",
  tools=[rag_retrieval_tool]
  )

Supported embedding models

Embedding models are used to create a corpus and for search and retrieval during response generation. This section lists the supported embedding models.

textembedding-gecko@003
textembedding-gecko-multilingual@001
text-embedding-004 (default)
text-multilingual-embedding-002
textembedding-gecko@002 (fine-tuned versions only)
textembedding-gecko@001 (fine-tuned versions only)

For more information about tuning embedding models, see Tune text embeddings.

The following open embedding models are also supported. You can find them in Model Garden.

e5-base-v2
e5-large-v2
e5-small-v2
multilingual-e5-large
multilingual-e5-small

Supported document types

Only text documents are supported. The following table shows the file types and their file size limits:

File type	File size limit
Google documents	50 MB when exported from Google Workspace
Google drawings	10 MB when exported from Google Workspace
Google slides	10 MB when exported from Google Workspace
HTML file	10 MB
JSON file	1 MB
Markdown file	10 MB
Microsoft PowerPoint slides (PPTX file)	10 MB
Microsoft Word documents (DOCX file)	50 MB
PDF file	50 MB
Text file	10 MB

Using LlamaIndex on Vertex AI for RAG with other document types is possible but can generate lower-quality responses.

Supported data sources

The following data sources are supported:

Upload a local file: A single-file upload using upload_file (up to 25 MB), which is a synchronous call.
Cloud Storage: Import file(s) from Cloud Storage.
Google Drive: Import a directory from Google Drive.

The service account must be granted the correct permissions to import files. Otherwise, no files are imported and no error message displays. For more information on file size limits, see Supported document types.

To authenticate and grant permissions, do the following:
1. Go to the IAM page of your Google Cloud project.
2. Select Include Google-provided role grant.
3. Search for the Vertex AI RAG Data Service Agent service account.
4. Click Share on the drive folder, and share with the service account.
5. Grant Viewer permission to the service account on your Google Drive folder or file. The Google Drive resource ID can be found in the web URL.
Slack: Import files from Slack by using a data connector.
Jira: Import files from Jira by using a data connector.

For more information, see the RAG API reference.

Supported data transformations

After a document is ingested, LlamaIndex on Vertex AI for RAG runs a set of transformations to prepare the data for indexing. You can control your use cases using the following parameters:

Parameter	Description
`chunk_size`	When documents are ingested into an index, they are split into chunks. The `chunk_size` parameter (in tokens) specifies the size of the chunk. The default chunk size is 1,024 tokens.
`chunk_overlap`	By default, documents are split into chunks with a certain amount of overlap to improve relevance and retrieval quality. The default chunk overlap is 200 tokens.

A smaller chunk size means the embeddings are more precise. A larger chunk size means that the embeddings might be more general but can miss specific details.

For example, if you convert 200 words as opposed to 1,000 words into an embedding array of the same dimension, you can lose details. This is also a good example of when you consider the model context length limit. A large chunk of text might not fit into a small-window model.

Retrieval parameters

The following table includes the retrieval parameters:

Parameter	Description
`similarity_top_k`	Controls the maximum number of contexts that are retrieved.
`vector_distance_threshold`	Only contexts with a distance smaller than the threshold are considered.

Manage your corpus

This section describes how you can manage your corpus for RAG tasks by performing index management and file management.

Corpus management

A corpus, also referred to as an index, is a collection of documents or source of information. The corpus can be queried to retrieve relevant contexts for response generation. When creating a corpus for the first time, the process might take an additional minute.

When creating a corpus, you can specify the following:

Create a corpus: Create an index to import or upload documents.
Delete a corpus: Delete the index.
Get details about a corpus: Get the metadata describing the index.
List corpora in a given project: List multiple indexes in a given project.

Concurrent operations on corpora aren't supported. For more information, see the RAG API reference.

File management

The following file operations are supported:

Upload a file from local storage into a corpus: Upload file with additional information that provides context to the LLM to generate more accurate responses.
Ingest file(s) from some other storage into a corpus: Import a set of files from a storage location.
Get details about a file: Retrieve relevant contexts for a given query from an index or specified files in the index.
List files in a corpus: Generates a list of files in the index.
Delete a file from a corpus: Remove a file from the index.

For more information, see the RAG API reference.

What's next

To learn about the file size limits, see Supported document types.
To learn about quotas related to LlamaIndex on Vertex AI for RAG, see LlamaIndex on Vertex AI for RAG quotas.
To learn about customizing parameters, see Retrieval parameters.
To learn more about the RAG API, see LlamaIndex on Vertex AI for RAG API.
To learn more about grounding, see Grounding overview.
To learn more about the difference between grounding and RAG, see Ground responses using RAG.
To learn more about Generative AI on Vertex AI, see Overview of Generative AI on Vertex AI.
To learn more about the RAG architecture, see Infrastructure for a RAG-capable generative AI application using Vertex AI.