Custom-based extraction

Custom model training and extraction lets you to build your own model designed specifically for your documents without the use of generative AI. It's ideal if you don't want to use generative AI and want to control all aspects of the trained model.

Dataset configuration

A document dataset is required to train, up-train, or evaluate a processor version. Document AI processors learn from examples, just like humans. Dataset fuels processor stability in terms of performance.

Train dataset

To improve the model and its accuracy, train a dataset on your documents. The model is made up of documents with ground-truth.
  • For fine-tuning, you need a minimum of 10 documents to train a new model.
  • For a few-shot, 5 documents is recommended.
  • For zero-shot, only a schema is required.

Test dataset

The test dataset is what the model uses to generate an F1 score (accuracy). It is made up of documents with ground-truth. To see how often the model is right, the ground truth is used to compare the model's predictions (extracted fields from the model) with the correct answers. The test dataset should have at least 10 documents for fine-tuning.

Before getting started

If not done so already, enable billing and the Document AI API.

Build and evaluate a custom model

Begin by building and then evaluating a custom processor.

  1. Create a processor and define fields you want to extract, which is important because it impacts extraction quality.

  2. Set dataset location: Select the default option folder Google-managed. This might be done automatically shortly after creating the processor.

  3. Navigate to the Build tab and select Import Documents with auto-labeling enabled (see Auto-labeling with the foundation model). You need a minimum of 10 documents in the training set and 10 in the testing set to train a custom model.

  4. Train model:

    1. Select Train new version and name the processor version.
    2. Go to Show advanced options and select the Model based option.

    custom-based-extraction-1

  5. Evaluation:

    • Go to Evaluate & test, select the version you just trained, then select View full evaluation.

    custom-based-extraction-2

    • You now see metrics such as f1, precision, and recall for the entire document and each field.
    • Decide if performance meets your production goals. If it does not, then reevaluate training and testing sets, typically adding documents to the training test set that don't parse well.
  6. Set a new version as the default.

    1. Navigate to Manage versions.
    2. Navigate to the menu and then select Set as default.

    custom-based-extraction-3

Your model is now deployed and documents sent to this processor are now using your custom version. You want to evaluate the model's performance to check if it requires further training.

Evaluation reference

The evaluation engine can do both exact match or fuzzy matching. For an exact match, the extracted value must exactly match the ground truth or is counted as a miss.

Fuzzy matching extractions that had slight differences such as capitalization differences still count as a match. This can be changed at the Evaluation screen.

custom-based-extraction-4

Auto-labeling with the foundation model

The foundation model can accurately extract fields for a variety of document types, but you can also provide additional training data to improve the accuracy of the model for specific document structures.

Document AI uses the label names you define and previous annotations to label documents at scale with auto-labeling.

  1. When you've created a custom processor, go to the Get Started tab.
  2. Select Create new field.

custom-based-extraction-5

  1. Navigate to the Build tab, then select Import documents.

    custom-based-extraction-6

  2. Select the path of the documents and which set the documents should be imported into. Check the auto-labeling box, and select the foundation model.

  3. In the Build tab, select Manage Dataset. You should see your imported documents. Select one of your documents.

    custom-based-extraction-7

You now see the predictions from the model highlighted in purple.

  1. Review each label predicted by the model and ensure is correct. If there are missing fields, add those as well.

custom-based-extraction-8

  1. After the document has been reviewed, select Mark as labeled. The document is now ready to be used by the model. Make sure the document is in either the Testing or Training set.