Document splitters behavior
General splitter behavior
Splitter output contains split information for the input document, including a
confidence score. The Document AI API outputs a
Document
JSON object, and the output format
uses the entities
field for
representing document splits. Additional information depends on the specific
type of splitter.
The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.
Splitters identify page boundaries, but do not actually split the input document for you. Here is a code sample that physically splits a PDF file by using the page boundaries:
Output examples:
Document
{ "text": "page1 page2 page3", "entities": [ { "type": "", "confidence": 0.9, "text_anchor": { "text_segments": { "start_index": 0, "end_index": 12 } }, "page_anchor": { "page_refs": [ { "page": 0 }, { "page": 1 } ] } }, { "type": "", "confidence": 0.8, "text_anchor": { "text_segments": { "start_index": 12, "end_index": 18 } }, "page_anchor": { "page_refs": [ { "page": 2 } ] } } ] }