Custom extractor mechanisms
bookmark_border Stay organized with collections Save and categorize content based on your preferences.

You can create custom extractors that are specifically suited to your documents, and trained and evaluated with your data. This processor identifies and extracts entities from your documents. You can then use this trained processor on additional documents.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Document AI, Cloud Storage APIs.

Enable the APIs

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

Enable the Document AI, Cloud Storage APIs.

Enable the APIs

Create a processor

In the Google Cloud console, in the Document AI section, go to the Workbench page.

Workbench
For custom extractor, select Create processor.
In the Create processor menu, enter a name for your processor, such as my-custom-document-extractor.
Select the region closest to you.
Optional: Open Advanced options.
- You have the option to let Google create a Cloud Storage bucket for you, or you can create your own. For this tutorial, select Google-managed storage.
- You also have the option to use Google-managed or Customer-managed encryption keys (CMEK). For this tutorial, select Google-managed encryption key.
Select Create to create your processor.

Define processor fields

You are now on the Processor overview page of the processor you just created.

You can specify the fields you want the processor to extract and begin labeling documents.

Select the Get started tab. The fields menu appears.
Select Create new field.
Enter the name for the field. Select the Data type and the Occurrence. Give the label a descriptive, distinct Description. Property description lets you provide additional context, insights, and prior knowledge for each entity to improve extraction accuracy and performance.

Select Create. Refer to Define processor schema for detailed instructions on creating and editing a schema.

Create each of the following labels for the processor schema.

Name	Data Type	Occurrence
`control_number`	Number	Optional multiple
`employees_social_security_number`	Number	Required multiple
`employer_identification_number`	Number	Required multiple
`employers_name_address_and_zip_code`	Address	Required multiple
`federal_income_tax_withheld`	Money	Required multiple
`social_security_tax_withheld`	Money	Required multiple
`social_security_wages`	Money	Required multiple
`wages_tips_other_compensation`	Money	Required multiple

You can also create and use other types of labels in your processor schema, such as checkboxes and tabular entities. For example, the W-2 forms contain statutory employee, retirement plan, and third party sick pay check boxes that you could also add to the schema.

updated-cde-2.0-path-to-docai-4

Upload a sample document

Test with a sample document.

Select Upload sample document.
In the sidebar, select Import documents from Cloud Storage.
For this example, enter this bucket name in Source path. This links directly to one document.
```
cloud-samples-data/documentai/Custom/W2/PDF/W2_XL_input_clean_2950.pdf
```
Select Import.

You are redirected to the labeling console.

Label a document

The process of selecting text in a document and applying labels is known as annotation or labeling.

When you're at the labeling console, notice that many of the labels are already populated. This is because the default custom extractor model type is a foundation model, which can perform zero-shot prediction, that is, without training.

Note: Your results might look slightly different than the sample image.
To use the suggested labels, hold the pointer over each label in the side panel, and select the check mark to confirm the label is correct. Don't edit the text, even if the OCR reads the text incorrectly.
In this example, the values at the bottom of the document were not identified automatically, so you need to label them manually.
Use the icons in the toolbar above the document to label. Use the bounding box tool by default, or the Select text tool for multi-line values, to select the content and apply the label.

Note: The select text tool does not work for all text values, so use the bounding box if appropriate. You can also select non-text fields such as checkboxes using the bounding box tool.
After text is selected, then a drop-down menu appears with all defined fields (entities) for you to select one. In this example, the value of wages_tips_other_compensation is selected with the bounding box tool, and that label is applied.
Review the detected text values to ensure that they reflect the correct location of text for each field. The labeled W2 document should look like this when complete:

Note: If we were labeling for production review purposes, then incorrect OCR should be edited.
If needed, you can select Create new field to add a new field to the schema from this page.
Select Mark as labeled when you have finished annotating the document. You are redirected to the Get started tab.

Build processor version using foundation model

After labeling a single document, you can create a processor version using the pretrained foundation model to extract entities.

Select the Build tab.
Under Call foundation model, select Create new version.
Enter a name for your processor version, such as w2-foundation-model.
Select Create version. It takes a few minutes to create.

Note: After you create a processor version, you can't change or delete fields you have created. You can disable them on the fields page if you no longer need them.
Optional: select the Deploy & use tab. On this page, you can view the available processor versions and the deployment status of the new version.

Use generative AI to auto-label documents

The foundation model can accurately extract fields for a variety of document types, but you can also provide additional training data to improve the accuracy of the model for specific document structures.

Custom extractor uses the label names you define and previous annotations to make it quicker and easier to label documents at scale with auto-labeling.

Go to the Build page.
Select Import documents.
In the sidebar, select Import documents from Google Cloud Storage.
Enter this bucket name containing your documents.
From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set and 20% in the test set.
In the Auto-labeling section, select the Import with auto-labeling checkbox.
Select the foundation model processor version to label the documents.
select Import and wait for the documents to import. You can leave this page and return later.
You must verify the auto-labeled documents before you can use them for training or testing. Select Start labeling to view the auto-labeled documents.
To use the suggested labels, hold the pointer over each annotation, and select the check mark to confirm the label is correct. For training purposes, don't edit the values if they don't match the document text. Only change the bounding box if the wrong text was selected.
Select Mark as labeled when you have finished annotating the document.
Repeat for each auto-labeled document.

Import prelabeled training documents

Go to the Build page.
Select Import documents.
In the sidebar, select Import documents from Cloud Storage.
Enter your path in Source path containing your documents. This bucket should contain prelabeled documents in the Document JSON format.
From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set, and 20% in the test set. Leave Import with auto-labeling unchecked.
Select Import. Import takes several minutes.

Optional: View and manage dataset

From the Build page, you can access the Manage dataset console to view and edit all documents and labels in the dataset.

Train custom model based processor

Training might take several hours. Make sure you have set up the processor with the appropriate data and labels before you begin training.

For information about the dataset requirements, under Train a custom model, select Create new version or View full requirements. This is not a generative AI model. At least 10 training instances and 10 test instances of each field are required for a custom model based processor.
In the Version name field, enter a name for this processor version, such as w2-custom-model.
Optional: select View label stats to find information about the document labels. That can help determine your coverage. Select Close to return to the training setup.
Under Model training method, select Model based.
Select Start training. Training takes a few hours. You can close this page and come back later.
Optional: select the Deploy & use tab. On this page, you can view the available processor versions and the training status of the new version.

Deploy the processor version

After training is complete, select the Deploy & use tab.
Select the checkbox on the left of the version you want to deploy, and select Deploy.
Select Deploy from the dialog window. Deployment takes a few minutes.
When the version is deployed, you can set it as the Default version, or you can provide the version ID when processing documents with the API.

Evaluate and test the processor

Select the Evaluate tab to test the processor version. On this page, you can view evaluation metrics including the F1 score, precision and recall for the full document, and individual labels. For more information about evaluation and statistics, refer to evaluate processor.
Select the Version selector and select the version using the foundation model.
Download a document that has not been involved in previous training or testing so that you can use it to evaluate the processor version. If using your own data, you would use a document set aside for this purpose.

Download PDF
Select Upload Test Document and select the document you just downloaded. The Custom Document Extractor analysis page opens. The screen output demonstrates how well the document was extracted.
Test the document again using the version with a custom trained model.

Use the processor

You have successfully created and trained a custom extractor processor.

You can manage your custom trained processor versions just like any other processor version. For more information, refer to Managing processor versions.

To use the Document AI API:

Follow the code samples in send a processing request to use online or batch processing.
- Refer to Quotas and limits for the number of pages supported for online and batch processing.
Follow the custom extractor code sample in Handle the processing response to get the extracted entities from the processor.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

To avoid unnecessary Google Cloud charges, use the Google Cloud console to delete your processor and project if you don't need them.

If you created a new project to learn about Document AI and you no longer need the project, delete the project.

If you used an existing Google Cloud project, delete the resources you created to avoid incurring charges to your account:

In the Google Cloud console navigation menu, select Document AI and select My Processors.
Select More actions in the same row as the processor you want to delete.
Select Delete processor, enter the processor name, then select Delete again to confirm.

What's next

For details, see Guides.

Custom extractor overview

Custom-based extraction

Custom extractor mechanisms bookmark_borderbookmark Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Create a processor

Define processor fields

Upload a sample document

Label a document

Build processor version using foundation model

Use generative AI to auto-label documents

Import prelabeled training documents

Optional: View and manage dataset

Train custom model based processor

Deploy the processor version

Evaluate and test the processor

Use the processor

Clean up

What's next

Custom extractor mechanisms
bookmark_border Stay organized with collections Save and categorize content based on your preferences.