Document AI Warehouse pipeline connector

What is this?

This is a bulk pipeline connector that collects documents from a Cloud Storage bucket and, optionally, processes them through one or more Document AI processors (for example OCR by default, but can be configured to other specialized or custom Document AI extractors) and indexes their content (extracted text and any data) in Document AI Warehouse, so that they can use Document AI Warehouse for various tasks as, por example: searching, document management workflows, or even simply, testing out Document AI outputs.

This connector handles a large amount of documents and for migrating production workloads reliably, and helps you accelerate time-to-onboard Document AI Warehouse customers for proofs of concept or production workloads.

Why use this pipeline connector?

Building a reliable pipeline and managing it is very complex because Document AI inherently has, by default, a low throughput of ~10 qps, and may entail LROs (long-running-operations) whereas Document AI Warehouse supports up to 500 qps. This connector is available at no cost* and manages this complexity, accelerating your time-to-value or time-to-live with Document AI Warehouse.

This connector uses other Google Cloud components such as Workflows, Cloud Tasks and Cloud Functions, and handles the following:

  • Schema mapping - it uses a Cloud Functions that maps entities from the doc.proto file (Document AI output) to key-value pairs of a Document AI Warehouse MAP property called "DocaiEntities" what-is-1.
  • Queue management - Queuing and flow control of the ingest pipeline
  • Reliable, fault-tolerant - LRO (long-running-operation) completion checks and retry logic. It also maintains an audit log of all the successful and failed ingests at each step
  • Extensible - It supports Document AI OCR extraction by default, but can be extended to other Document AI specialized or custom workbench parsers. You need to transform the Document AI output (doc.proto) format to Document AI Warehouse schema properties if you do so.

Next step