Document splitters behavior

General splitter behavior

Splitter output contains split information for the input document, including a confidence score. The Document AI API outputs a Document JSON object, and the output format uses the entities field for representing document splits. Additional information depends on the specific type of splitter.

The splitter is not designed to split logical documents that are over 30 pages long. Logical documents that are more than 30 pages long (e.g. a 40-page bank statement) may be split into two or more docs and classified separately.

Splitters identify page boundaries, but do not actually split the input document for you. Here is a code sample that physically splits a PDF file by using the page boundaries:

Output examples:

Document


{
  "text": "page1 page2 page3",
  "entities": [
    {
      "type": "",
      "confidence": 0.9,
      "text_anchor": {
        "text_segments": {
          "start_index": 0,
          "end_index": 12
        }
      },
      "page_anchor": {
        "page_refs": [
          {
            "page": 0
          },
          {
            "page": 1
          }
        ]
      }
    },
    {
      "type": "",
      "confidence": 0.8,
      "text_anchor": {
        "text_segments": {
          "start_index": 12,
          "end_index": 18
        }
      },
      "page_anchor": {
        "page_refs": [
          {
            "page": 2
          }
        ]
      }
    }
  ]
}