Use Document AI layout parser with RAG Engine

This page introduces the Document AI layout parser and how it's used with RAG Engine.

Document AI

Document AI is a document-processing and document-understanding platform that takes unstructured data from documents and transforms that unstructured data into fields that are suitable for storage in a database. Structured data leads to data that you can understand, analyze, and consume.

Document AI is built on top of products within Vertex AI with generative AI to help you create scalable, end-to-end, cloud-based document processing applications. No specialized machine-learning expertise is required to use these products.

Document AI layout parser

The layout parser extracts content elements from the document, such as text, tables, and lists. The layout parser then creates context-aware chunks that facilitate information retrieval in generative AI and discovery applications.

When it's used for retrieval and LLM generation, the document's layout is considered during the chunking process, which improves semantic coherence and reduces noise in the content. All text in a chunk comes from the same layout entity, such as the heading, subheading, or list.

For file types used by layout detection, see Layout detection per file type.

Use the layout parser in Vertex AI RAG

The ImportRagFiles API supports the layout parser, however, the following limitations apply:

  • Input the file size maximum of 20 MB for all file types.
  • There is a maximum of 500 pages per PDF file.

The Document AI quotas and pricing apply.

Enable the Document AI API

The following sample code demonstrates how to enable advanced parsing using REST in a curl command and using the Vertex AI SDK for Python.

REST

This code sample demonstrates how you can import Cloud Storage files using the layout parser. For more configuration options, such as importing files from another source, refer to ImportRagFilesConfig.

Replace the variables used in the code sample:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process your request.
  • RAG_CORPUS_ID: The ID of the RAG corpus resource.
  • GCS_URIS: A list of Cloud Storage locations. For example: "gs://my-bucket1", "gs://my-bucket2".
  • LAYOUT_PARSER_PROCESSOR_NAME: The resource path to the layout parser processor that was created. For example: "projects/{project}/locations/{location}/processors/{processor_id}".
  • CHUNK_SIZE: Optional: The number of tokens that each chunk should have.
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/ragCorpora/RAG_CORPUS_ID/ragFiles:import

Request JSON body:

{
  "import_rag_files_config": {
    "gcs_source": {
      "uris": GCS_URIS
    },
    "file_parsing_config": {
      "layout_parser": {
        "processor_name": "LAYOUT_PARSER_PROCESSOR_NAME"
      }
    },
    "rag_file_chunking_config": {
      "chunk_size": CHUNK_SIZE
    },
  }
}

To send your request, choose one of these options.

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/ragCorpora/RAG_CORPUS_ID/ragFiles:import"

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

Replace the variables used in the code sample:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process your request.
  • RAG_CORPUS_ID: The ID of the RAG corpus resource.
  • GCS_URIS: A list of Cloud Storage locations. For example: "gs://my-bucket1", "gs://my-bucket2".
  • LAYOUT_PARSER_PROCESSOR_NAME: The resource path to the layout parser processor that was created. For example: "projects/{project}/locations/{location}/processors/{processor_id}".
  • CHUNK_SIZE: Optional: The number of tokens that each chunk should have.
from vertexai.preview import rag
import vertexai

PROJECT_ID = "PROJECT_ID"
corpus_name = "projects/<var>PROJECT_ID</var>/locations/LOCATION/ragCorpora/RAG_CORPUS_ID"
# paths = ["https://drive.google.com/file/123", "GCS_URIS"]
# Supports Cloud Storage and Google Drive Links
layout_parser_processor_name = "projects/PROJECT_ID/locations/<var>LOCATION</var>/processors/LAYOUT_PARSER_PROCESSOR_NAME"

# Initialize Vertex AI API once per session
vertexai.init(project=PROJECT_ID, location="LOCATION")

response = rag.import_files(
    corpus_name=RAG_CORPUS_ID,
    paths=paths,
    chunk_size=512,  # Optional
    chunk_overlap=100,  # Optional
    max_embedding_requests_per_min=900,  # Optional
    layout_parser=rag.LayoutParserConfig(
        processor_name=layout_parser_processor_name,
        max_parsing_requests_per_min=120,  # Optional
    )
)
print(f"Imported {response.imported_rag_files_count} files.")
# Example response:
# Imported 2 files.

Turn on your layout parser

The code samples demonstrate how to turn on your layout parser.

Your RAG knowledge base (corpus)

If you don't have a RAG corpus, then create a RAG corpus. For example, see Create a RAG corpus example.

If you already have a RAG corpus, existing files that were imported without a layout parser won't be re-imported when you Import files using Layout Parser. If you want to use a layout parser with your files, delete the files first. For example, see Delete a RAG file example.

Importing files using Layout Parser

Files and folders from various sources can be imported using the layout parser.

REST

The code sample shows how to import Cloud Storage files using the layout parser. For more configuration options, including importing files from another source, refer to the ImportRagFilesConfig reference.

Before using any of the request data, replace the following variables used in the code sample:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus resource.
  • GCS_URIS: A list of Cloud Storage locations. For example: "gs://my-bucket1", "gs://my-bucket2".
  • LAYOUT_PARSER_PROCESSOR_NAME: The resource path to the layout parser processor that was create. For example: "projects/{project}/locations/{location}/processors/{processor_id}".
  • CHUNK_SIZE: Optional: The number of tokens each chunk should have.
POST https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/ragCorpora/RAG_CORPUS_ID/ragFiles:import

Request JSON body:

{
  "import_rag_files_config": {
    "gcs_source": {
      "uris": "GCS_URIS"
    },
    "file_parsing_config": {
      "layout_parser": {
        "processor_name": "LAYOUT_PARSER_PROCESSOR_NAME"
      }
    },
    "rag_file_chunking_config": {
      "chunk_size": CHUNK_SIZE
    },
  }
}

To send your request, choose one of these coding options:

Save the request body in a file named request.json, and execute the following command:

curl -X POST \
    -H "Content-Type: application/json; charset=utf-8" \
    -d @request.json \
    "https://LOCATION-aiplatform.googleapis.com/v1beta1/projects/PROJECT_ID/locations/LOCATION/ragCorpora/RAG_CORPUS_ID/ragFiles:import"

Python

To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python. For more information, see the Python API reference documentation.

Replace the following variables used in the code sample:

  • PROJECT_ID: Your project ID.
  • LOCATION: The region to process the request.
  • RAG_CORPUS_ID: The ID of the RAG corpus resource.
  • GCS_URIS: A list of Cloud Storage locations. For example: "gs://my-bucket1", "gs://my-bucket2".
  • LAYOUT_PARSER_PROCESSOR_NAME: The resource path to the layout parser processor that was create. For example: "projects/{project}/locations/{location}/processors/{processor_id}".
  • CHUNK_SIZE: Optional: The number of tokens each chunk should have.
from vertexai.preview import rag
import vertexai

PROJECT_ID = "PROJECT_ID"
corpus_name = "projects/{PROJECT_ID}/locations/LOCATION/ragCorpora/RAG_CORPUS_ID"
# paths = ["https://drive.google.com/file/123", "gs://my_bucket/my_files_dir"]
# Supports Cloud Storage and Google Drive links
layout_parser_processor_name = "LAYOUT_PARSER_PROCESSOR_NAME"

# Initialize Vertex AI API once per session
vertexai.init(project=PROJECT_ID, location="LOCATION")

response = rag.import_files(
    corpus_name=corpus_name,
    paths=paths,
    chunk_size=512,  # Optional
    chunk_overlap=100,  # Optional
    max_embedding_requests_per_min=900,  # Optional
    layout_parser=rag.LayoutParserConfig(
        processor_name=layout_parser_processor_name,
        max_parsing_requests_per_min=120,  # Optional
    )
)
print(f"Imported {response.imported_rag_files_count} files.")
# Example response:
# Imported 2 files.

Retrieval query

When a user asks a question or provides a prompt, the retrieval component in RAG searches through its knowledge base to find information that is relevant to the query.

For an example of retrieving RAG files from a corpus based on a query text, see Retrieval query.

Prediction

The prediction generates a grounded response using the retrieved contexts. For an example, see Generation.

What's next