Process documents with Layout Parser

Layout Parser extracts document content elements like text, tables, and lists, and creates context-aware chunks that facilitate information retrieval in generative AI and discovery applications.

Layout Parser features

  • Parse document layouts. You can input HTML, PDF, or DOCX files to Layout Parser to identify content elements like text blocks, tables, lists, and structural elements such as titles and headings. These elements help define the organization and hierarchy of a document with rich content and structural elements that can create more context for information retrieval and discovery.

  • Chunk documents. Layout Parser can break documents up into chunks that retain contextual information about the layout hierarchy of the original document. Answer-generating LLMs can use chunks to improve relevance and decrease computational load.

    Taking a document's layout into account during chunking improves semantic coherence and reduces noise in the content when it's used for retrieval and LLM generation. All text in a chunk comes from the same layout entity, such as a heading, subheading, or list.

Limitations

The following limitations apply:

  • Maximum PDF file size of 20 MB
  • Usage limit of 100 document files per project per day
  • Batch processing is not supported

Layout detection per file type

The following table lists the elements that Layout Parser can detect per document file type.

File type Detected elements
HTML paragraph, table, list, title, heading
PDF paragraph, table, title, and heading
DOCX paragraph, table, list, title, heading, header, footnote

Process documents with Layout Parser

Use the following steps to parse and chunk documents with Layout Parser.

  1. Create a Layout Parser by following the instructions in Creating and managing processors.

    The processor type name is LAYOUT_PARSER_PROCESSOR.

  2. Enable Layout Parser by following the instructions in Enable a processor.

  3. Input a document to Layout Parser to parse and chunk it.

    To process a document, follow the instructions for online processing requests in Send a processing request, using API version v1beta3.

    Configure fields in ProcessOptions.layoutConfig in ProcessDocumentRequest.

    Input

    The following example JSON configures ProcessOptions.layoutConfig.

    "processOptions": {
      "layoutConfig": {
        "chunkingConfig": {
          "chunkSize": "CHUNK_SIZE",
          "includeAncestorHeadings": "INCLUDE_ANCESTOR_HEADINGS_BOOLEAN
        }
      }
    }
    

    Replace the following:

    • CHUNK_SIZE: The chunk size, in tokens, to use when splitting documents.
    • INCLUDE_ANCESTOR_HEADINGS_BOOLEAN: Whether or not to include ancestor headings when splitting documents. Ancestor headings are the parents of subheadings and can provide a chunk with additional context about its position in the original document. Up to two levels of headings can be included with a chunk.

    Parsed and chunked documents are output as Document.documentLayout and Document.chunkedDocument in the response.

What's next