Custom extractor overview

Custom extractor extracts entities from documents of a particular type. For example, it can extract the items in a menu or the name and contact information from a resume.

Overview

The goal of the custom extractor is to enable Document AI users to build custom entity extraction solutions for new document types for which no pre-trained processors are available. Custom extractor includes a combination of layout-aware deep learning models (for generative AI and custom models) and template-based models.

Which training method should I use?

Custom extractor supports a wide range of use cases with three different modes.

Training method Document examples Document layout variation Free form text or paragraphs Number of training documents for production-ready quality, depending on variability
Fine tune and foundation model (generative AI). Contract, terms of service, invoice, bank statement, bill of lading, payslips. High to Low (preferred). High. Medium: 0-50+ documents.
Custom model. Model. Similar forms with layout variation across years or vendors (for example, W9). Low to medium. Low. High: 10-100+ documents.
Template. Tax forms with a fixed layout (for example, Forms 941 and 709). None. Low. Low (3 documents).

Because foundation models typically require fewer training documents, they're recommended as the first option for all variable layouts.