Custom extractor mechanisms

You can create custom extractors that are specifically suited to your documents, and trained and evaluated with your data. This processor identifies and extracts entities from your documents. You can then use this trained processor on additional documents.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Document AI, Cloud Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Document AI, Cloud Storage APIs.

    Enable the APIs

Create a processor

  1. In the Google Cloud console, in the Document AI section, go to the Workbench page.

    Workbench

  2. For custom extractor, select Create processor.

    updated-cde-2.0-path-to-docai-1

  3. In the Create processor menu, enter a name for your processor, such as my-custom-document-extractor.

    updated-cde-2.0-path-to-docai-2

  4. Select the region closest to you.

  5. Optional: Open Advanced options.

    • You have the option to let Google create a Cloud Storage bucket for you, or you can create your own. For this tutorial, select Google-managed storage.

    • You also have the option to use Google-managed or Customer-managed encryption keys (CMEK). For this tutorial, select Google-managed encryption key.

  6. Select Create to create your processor.

Define processor fields

You are now on the Processor overview page of the processor you just created.

updated-cde-2.0-path-to-docai-3

You can specify the fields you want the processor to extract and begin labeling documents.

  1. Select the Get started tab. The fields menu appears.

  2. Select Create new field.

  3. Enter the name for the field. Select the Data type and the Occurrence.

  4. Select Create. Refer to Define processor schema for detailed instructions on creating and editing a schema.

  1. Create each of the following labels for the processor schema.

    Name Data Type Occurrence
    control_number Number Optional multiple
    employees_social_security_number Number Required multiple
    employer_identification_number Number Required multiple
    employers_name_address_and_zip_code Address Required multiple
    federal_income_tax_withheld Money Required multiple
    social_security_tax_withheld Money Required multiple
    social_security_wages Money Required multiple
    wages_tips_other_compensation Money Required multiple

    You can also create and use other types of labels in your processor schema, such as checkboxes and tabular entities. For example, the W-2 forms contain statutory employee, retirement plan, and third party sick pay check boxes that you could also add to the schema.

    updated-cde-2.0-path-to-docai-4

Upload a sample document

Test with a sample document.

  1. Select Upload sample document.

  2. In the sidebar, select Import documents from Cloud Storage.

  3. For this example, enter this bucket name in Source path. This links directly to one document.

    cloud-samples-data/documentai/Custom/W2/PDF/W2_XL_input_clean_2950.pdf
    
  4. Select Import.

You are redirected to the labeling console.

Label a document

The process of selecting text in a document and applying labels is known as annotation or labeling.

  1. When you're at the labeling console, notice that many of the labels are already populated. This is because the default custom extractor model type is a foundation model, which can perform zero-shot prediction, that is, without training.

    updated-cde-2.0-path-to-docai-5

  2. To use the suggested labels, hold the pointer over each label in the side panel, and select the check mark to confirm the label is correct. Don't edit the text, even if the OCR reads the text incorrectly.

  3. In this example, the values at the bottom of the document were not identified automatically, so you need to label them manually.

  4. Use the icons in the toolbar above the document to label. Use the bounding box tool by default, or the Select text tool for multi-line values, to select the content and apply the label.

  5. After text is selected, then a drop-down menu appears with all defined fields (entities) for you to select one. In this example, the value of wages_tips_other_compensation is selected with the bounding box tool, and that label is applied.

    updated-cde-2.0-path-to-docai-6

  6. Review the detected text values to ensure that they reflect the correct location of text for each field. The labeled W2 document should look like this when complete:

    updated-cde-2.0-path-to-docai-7

  7. If needed, you can select Create new field to add a new field to the schema from this page.

  8. Select Mark as labeled when you have finished annotating the document. You are redirected to the Get started tab.

Build processor version using foundation model

After labeling a single document, you can create a processor version using the pretrained foundation model to extract entities.

  1. Select the Build tab.

    updated-cde-2.0-path-to-docai-8

  2. Under Call foundation model, select Create new version.

  3. Enter a name for your processor version, such as w2-foundation-model.

  4. Select Create version. It takes a few minutes to create.

  5. Optional: select the Deploy & use tab. On this page, you can view the available processor versions and the deployment status of the new version.

Use generative AI to auto-label documents

The foundation model can accurately extract fields for a variety of document types, but you can also provide additional training data to improve the accuracy of the model for specific document structures.

Custom extractor uses the label names you define and previous annotations to make it quicker and easier to label documents at scale with auto-labeling.

  1. Go to the Build page.

  2. Select Import documents.

  3. In the sidebar, select Import documents from Google Cloud Storage.

  4. Enter this bucket name containing your documents.

  5. From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set and 20% in the test set.

  6. In the Auto-labeling section, select the Import with auto-labeling checkbox.

  7. Select the foundation model processor version to label the documents.

  8. select Import and wait for the documents to import. You can leave this page and return later.

  9. You must verify the auto-labeled documents before you can use them for training or testing. Select Start labeling to view the auto-labeled documents.

  10. To use the suggested labels, hold the pointer over each annotation, and select the check mark to confirm the label is correct. For training purposes, don't edit the values if they don't match the document text. Only change the bounding box if the wrong text was selected.

  11. Select Mark as labeled when you have finished annotating the document.

  12. Repeat for each auto-labeled document.

Import prelabeled training documents

  1. Go to the Build page.

  2. Select Import documents.

  3. In the sidebar, select Import documents from Cloud Storage.

  4. Enter your path in Source path containing your documents. This bucket should contain prelabeled documents in the Document JSON format.

  5. From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set, and 20% in the test set. Leave Import with auto-labeling unchecked.

  6. Select Import. Import takes several minutes.

Optional: View and manage dataset

  1. From the Build page, you can access the Manage dataset console to view and edit all documents and labels in the dataset.

Train custom model based processor

Training might take several hours. Make sure you have set up the processor with the appropriate data and labels before you begin training.

  1. For information about the dataset requirements, under Train a custom model, select Create new version or View full requirements. This is not a generative AI model. At least 10 training instances and 10 test instances of each field are required for a custom model based processor.

  2. In the Version name field, enter a name for this processor version, such as w2-custom-model.

  3. Optional: select View label stats to find information about the document labels. That can help determine your coverage. Select Close to return to the training setup.

  4. Under Model training method, select Model based.

  5. Select Start training. Training takes a few hours. You can close this page and come back later.

  6. Optional: select the Deploy & use tab. On this page, you can view the available processor versions and the training status of the new version.

Deploy the processor version

  1. After training is complete, select the Deploy & use tab.

  2. Select the checkbox on the left of the version you want to deploy, and select Deploy.

  3. Select Deploy from the dialog window. Deployment takes a few minutes.

  4. When the version is deployed, you can set it as the Default version, or you can provide the version ID when processing documents with the API.

Evaluate and test the processor

  1. Select the Evaluate tab to test the processor version. On this page, you can view evaluation metrics including the F1 score, precision and recall for the full document, and individual labels. For more information about evaluation and statistics, refer to evaluate processor.

  2. Select the Version selector and select the version using the foundation model.

  3. Download a document that has not been involved in previous training or testing so that you can use it to evaluate the processor version. If using your own data, you would use a document set aside for this purpose.

    Download PDF

  4. Select Upload Test Document and select the document you just downloaded. The Custom Document Extractor analysis page opens. The screen output demonstrates how well the document was extracted.

  5. Test the document again using the version with a custom trained model.

Use the processor

You have successfully created and trained a custom extractor processor.

You can manage your custom trained processor versions just like any other processor version. For more information, refer to Managing processor versions.

To use the Document AI API:

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

To avoid unnecessary Google Cloud charges, use the Google Cloud console to delete your processor and project if you don't need them.

If you created a new project to learn about Document AI and you no longer need the project, delete the project.

If you used an existing Google Cloud project, delete the resources you created to avoid incurring charges to your account:

  1. In the Google Cloud console navigation menu, select Document AI and select My Processors.

  2. Select More actions in the same row as the processor you want to delete.

  3. Select Delete processor, enter the processor name, then select Delete again to confirm.

What's next

For details, see Guides.