Document splitters behavior
General splitter behavior
Splitter output contains split information for the input document, including a
confidence score. The Document AI API outputs a
Document
JSON object, and the output format
uses the entities
field for
representing document splits. Additional information depends on the specific
type of splitter.
Entity.type
specifies the document classification. For a full list of document types that can be identified, see the following lists.Entity.pageAnchor.pageRefs[]
specifies the pages that contain each sub-document. Note thatpageRefs[].page
is zero-based and is the index into thedocument.pages[]
field.
The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.
Splitters identify page boundaries, but do not actually split the input document for you. The Document AI Toolbox SDK provides utility functions that can split the input document based on output from a splitter processor.
Document types identified
Lending Document Splitter & Classifier
Show types
- 1003 - Legacy Form (standard and customized versions)
- Return type(s):
1003[1], 1003_2009
- Return type(s):
- 1040 - 2018, 2019, 2020 (standard and customized versions)
- 1040 Schedule C - 2018, 2019, 2020 (standard and customized versions)
- 1040 Schedule E - 2018, 2019, 2020 (standard and customized versions)
- 1065 - 2018, 2019, 2020 (standard and customized versions)
- 1099-DIV - 2018, 2019, 2020 (standard and customized versions)
- Return type(s):
1099div[1], 1099div_2018, 1099div_2019, 1099div_2020
- Return type(s):
- 1099-G - 2018, 2019, 2020 (standard and customized versions)
- 1099-INT - 2018, 2019, 2020 (standard and customized versions)
- Return type(s):
1099int[1], 1099int_2018, 1099int_2019, 1099int_2020
- Return type(s):
- 1099-MISC - 2018, 2019, 2020 (standard and customized versions)
- Return type(s):
1099misc[1], 1099misc_2018, 1099misc_2019, 1099misc_2020
- Return type(s):
- 1099-NEC - 2020 (standard and customized versions)
- Return type(s):
1099nec[1], 1099nec_2020
- Return type(s):
- 1099-R - 2018, 2019, 2020 (standard and customized versions)
- Return type(s):
1099r[1], 1099r_2018, 1099r_2019, 1099r_2020
- Return type(s):
- 1120 - 2018, 2019, 2020 (standard and customized versions)
- 1120S - 2018, 2019, 2020 (standard and customized versions)
- Bank Statement
- Return type(s):
account_statement_bank
- Return type(s):
- Pay Slip
- Return type(s):
payslip
- Return type(s):
- SSA-1099 - 2018, 2019, 2020 (standard and customized versions)
- US Driver License
- Return type(s):
US_Driver_License
- Return type(s):
- US Pasport
- Return type(s):
US_Passport
- Return type(s):
- W2 - 2018, 2019, 2020 (standard and customized versions)
- Return type(s):
w2[1], w2_2018, w2_2019, w2_2020
- Return type(s):
- W9 - Rev. 10-2018, Rev. 11-2017
- Return type(s):
w9[1], w9_2017, w9_2018
- Return type(s):
- If the splitter cannot identify the type of the document, it returns
other
.
Procurement Document Splitter & Classifier
Show types
- Utility statement: A bill or receipt issued by an utility company (telecommunications, gas, electric, cable service) that shows the amount owed by the customer for the services provided. This may also show the previous payments made by the customer for current or prior services.
- Return type(s):
utility_statement
- Return type(s):
- Debit note: A document issued by a business stating a monetary amount a client owes to the business.
- Return type(s):
debit_note
- Return type(s):
- Credit note: A document issued by a business that needs to provide a client with a discount or a refund, or to correct a previous invoicing error.
- Return type(s):
credit_note
- Return type(s):
- Credit Card Slip: A document that shows a payment made by credit card. It typically includes the total charge amount, a tip amount (mostly US documents), and a total payment. Tip amount and total payment are usually handwritten. This doc is relevant for expense processing. It is not a suitable proof of expense in expense processing.
- Return type(s):
credit_card_slip
- Return type(s):
- Restaurant statement: A document issued by a restaurant to a customer itemizing the specific items consumed, the taxes, total amount, tips, and amount paid.
- Return type(s):
restaurant_statement
- Return type(s):
- Air travel statement: A document issued by an airline to a customer itemizing the specific flight and non-flight charges, and the amount paid (if available).
- Return type(s):
air_travel_statement
- Return type(s):
- Hotel statement: A document issued by a hotel to a customer itemizing the specific charges related to a hotel stay, and the amount paid (if available).
- Return type(s):
hotel_statement
- Return type(s):
- Car rental statement: A document issued by a car rental company to a customer itemizing the specific charges related to a car rental, and the amount paid (if available).
- Return type(s):
car_rental_statement
- Return type(s):
- Ground transportation statement: A document issued by a ground transportation company (ride sharing, train/subway) to a customer itemizing the specific charges related to a trip, and the amount paid (if available).
- Return type(s):
ground_transportation_statement
- Return type(s):
- Invoice statement: A document sent by the seller to the customer that requests payments for products or services and (for the purpose of our taxonomy) is not covered by any other document type definition.
- Return type(s):
invoice_statement
- Return type(s):
- Receipt statement: A document that shows proof of payment which confirms that a customer has received the goods and services they paid a business for. Conversely, this can be a document showing the business was compensated for the goods or services they sold to a customer and (for the purpose of our taxonomy) is not covered by any other document type definition.
- Return type(s):
receipt_statement
- Return type(s):
- If the splitter cannot identify the type of the document, it returns
other
.
[1] The corresponding parser for this form does not support this doc type. This means that the splitter can identify and classify documents of this type, but Document AI does not provide a parser to extract information.
Output examples
Processors | Output samples | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lending Document Splitter & Classifier
|
|
||||||||||||||||||||
Procurement Document Splitter & Classifier
|
|
Code Samples
Splitters identify page boundaries, but don't actually split the input document for you. You can use Document AI Toolbox to physically split a PDF file by using the page boundaries. The following code samples print the page ranges without splitting the PDF:
Java
For more information, see the Document AI Java API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
For more information, see the Document AI Node.js API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
For more information, see the Document AI Python API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Document
.
Python
For more information, see the Document AI Python API reference documentation.
To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.