Splitters behavior
Splitter processor output contains split information for the input document, including a
confidence score. The Document AI API outputs a
Document
JSON object, and the output format
uses the entities
field for
representing document splits. Additional information depends on the specific
type of splitter.
Entity.type
specifies the document classification. For a full list of document types that can be identified, see the following lists.Entity.pageAnchor.pageRefs[]
specifies the pages that contain each sub-document. Note thatpageRefs[].page
is zero-based and is the index into thedocument.pages[]
field.
Here is a typical JSON splitter response for a recognized document, indicating a
form_140
class document on the second and third pages of the input file:
{
"textAnchor": {
"textSegments": [
{
"startIndex": "5543",
"endIndex": "10470"
}
]
},
"type": "form_1040",
"confidence": 0.8983272,
"pageAnchor": {
"pageRefs": [
{
"page": "1",
"confidence": 0.8983272
},
{
"page": "2",
"confidence": 0.9636311
}
]
}
},
Unlike custom classifier, splitters don't provide more than one class and their confidence scores.
The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (for example a 40-page bank statement) may be split into two or more documents and classified separately.
Splitters identify page boundaries, but do not actually split the input document for you. The Document AI Toolbox SDK provides utility functions that can split the input document based on output from a splitter processor.
It's highly recommended that split predictions be reviewed by humans before actual file splitting, unless proven to be of acceptable accuracy for business needs.
Document types identified
This section details the document classes recognized by pretrained splitter processors.
[1] The corresponding parser for this form does not support this doc type. This means that the splitter can identify and classify documents of this type, but Document AI does not provide a parser to extract information.
Output examples
Processors | Output samples |
---|
Code Samples
Splitters identify page boundaries, but don't actually split the input document for you. You can use Document AI Toolbox to physically split a PDF file by using the page boundaries. The following code samples print the page ranges without splitting the PDF:
Java
For more information, see the Document AI Java API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
For more information, see the Document AI Node.js API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
For more information, see the Document AI Python API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Document
.
Python
For more information, see the Document AI Python API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.