Create a Custom Document Extractor in the Google Cloud console

You can create Custom Document Extractors (CDE) that are specifically suited to your documents, and trained and evaluated with your data. This processor identifies and extracts entities from your documents. You can then use this trained processor on additional documents. You typically would use a CDE on documents that are all of one type, such as your institution's enrollment forms.

This guide describes how to use Document AI Workbench to create and train a Custom Document Extractor that processes W-2 (US tax form) documents. Most of the document preparation work has been done so that you can focus on the other mechanics of creating a CDE.

The dataset used in these examples comes from Kaggle with a CC0: Public Domain License.

A typical workflow to create and use a CDE is as follows:

  1. Create a Custom Document Extractor in Document AI Workbench.
  2. Define and create the processor schema.
  3. Import documents.
  4. Assign documents to the Training and Test sets.
  5. Annotate documents manually in Document AI Workbench or with Labeling Tasks.
  6. Train the processor.
  7. Evaluate the processor.
  8. Deploy the processor.
  9. Test the processor.
  10. Configure Human-in-the-Loop (HITL) for review.
  11. Use the processor on your documents.

You can make your own configuration choices that suit your workflow.


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. Enable the Document AI, Cloud Storage APIs.

    Enable the APIs

  5. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Google Cloud project.

  7. Enable the Document AI, Cloud Storage APIs.

    Enable the APIs

Create a processor

  1. In the Google Cloud console, in the Document AI section, go to the Workbench page.

    Workbench

  2. For Custom Document Extractor, click Create processor.

    Select CDE processor

  3. In the Create processor menu, enter a name for your processor, such as my-custom-document-extractor.

    Create CDE processor

  4. Select the region closest to you.

  5. Optional: Open Advanced Options

    1. You have the option to let Google create a Cloud Storage bucket for you, or you can create your own. For this tutorial, select Google-managed storage.

    2. You also have the option to use Google-managed or Customer-managed encryption keys (CMEK). For this tutorial, select Google-managed encryption key.

  6. Click Create to create your processor.

Define processor fields

You are now on the Processor overview page of the processor you just created.

CDE overview

You can specify the fields you want the processor to extract and begin labeling documents.

  1. Click on the Get started tab. The Fields menu appears.

  2. Click Create New Field.

  3. Enter the name for the field. Select the Data type and the Occurrence. Click Create. Refer to Define processor schema for detailed instructions on creating and editing a schema.

  4. Create each of the following labels for the processor schema.

    Name Data Type Occurrence
    control_number Number Optional multiple
    employees_social_security_number Number Required multiple
    employer_identification_number Number Required multiple
    employers_name_address_and_zip_code Address Required multiple
    federal_income_tax_withheld Money Required multiple
    social_security_tax_withheld Money Required multiple
    social_security_wages Money Required multiple
    wages_tips_other_compensation Money Required multiple

    You can also create and use other types of labels in your processor schema, such as checkboxes and tabular entities. For example, the W-2 forms contain a Statutory employee, Retirement plan, and Third party sick pay check boxes that you could also add to the schema.

    Manage labels console

Upload a sample document

Next, you upload a sample W-2 PDF and label it.

  1. Click Upload Sample Document.

  2. In the sidebar, click Import documents from Google Cloud Storage.

  3. For this example, enter this bucket name in Source path. This links directly to one document.

    cloud-samples-data/documentai/Custom/W2/PDF/W2_XL_input_clean_2950.pdf
    
  4. Click Import.

You are redirected to the labeling console.

Label a document

The process of selecting text in a document, and applying labels is known as annotation.

  1. When you're at the labeling console, notice that many of the labels are already populated.

    Note: Your results might look slightly different than the sample image. GenAI Labeled W-2

  2. To use the suggested labels, hold the pointer over each label in the side panel, and click on the check mark to confirm the label is correct. You can edit the values if they do not match the document text.

  3. In this example, the values at the bottom of the document were not identified automatically, so you will need to label them manually.

  4. Use the Bounding box tool by default, or the Select text tool for multi-line values, to select the content and apply the label.

    Note: The Select text tool does not work for all text values, so use the Bounding box if appropriate. You can also select non-text fields such as checkboxes using the Bounding box tool.

  5. In this example, the value of wages_tips_other_compensation was selected with the Bounding box tool, and that label is applied.

    Select wages with bounding box

  6. Review the detected text values to ensure that they reflect the correct text from the document.

    The labeled W-2 document should look like this when complete:

    Labeled W-2 document

  7. If needed, you can click Create New Field to add a new field to the schema from this page.

  8. Click Mark as Labeled when you have finished annotating the document.

    You are redirected to the Get started tab.

Build processor version using foundation model

After labeling a single document, you can create a processor version using the pretrained foundation model to extract entities.

  1. Click on the Build tab.

    CDE Build

  2. Under Call foundation model, click Create New Version.

  3. Enter a name for your processor version, such as w2-foundation-model.

  4. Click Create Version. It takes a few minutes to create.

  5. Optional: Click on the Deploy & Use tab. On this page, you can view the available processor versions and the deployment status of the new version.

You test and evaluate this version later in the tutorial.

Use generative AI to auto-label documents

The foundation model can accurately extract fields for a variety of document types, but you can also provide additional training data to improve the accuracy of the model for specific document structures.

Document AI Workbench uses the label names you define and previous annotations to make it quicker and easier to label documents at scale with auto-labeling.

  1. Go to the Build page.

  2. Click Import Documents.

  3. In the sidebar, click Import documents from Google Cloud Storage.

  4. Enter this bucket name in Source path. This contains unlabeled W-2 PDF files.

    cloud-samples-data/documentai/Custom/W2/AutoLabel
    
  5. From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set and 20% in the test set.

  6. In the Auto-labeling section, select the Import with auto-labeling checkbox.

  7. Select the foundation model processor version to label the documents.

  8. Click Import and wait for the documents to import. You can leave this page and return later.

  9. You must verify the auto-labeled documents before you can use them for training or testing. Click Start Labeling to view the auto-labeled documents.

  10. To use the suggested labels, hold the pointer over each annotation, and click on the check mark to confirm the label is correct. You can edit the values if they do not match the document text.

  11. Click Mark as Labeled when you have finished annotating the document.

  12. Repeat for each auto-labeled document. For this tutorial, you can skip any documents that were not successfully auto-labeled.

Import prelabeled training documents

In this guide, you are provided with prelabeled data. If working on your own project, you have to determine how to label your data. Refer to Labeling options for more details. In general, more training data produces higher accuracy.

  1. Go to the Build page.

  2. Click Import Documents.

  3. In the sidebar, click Import documents from Google Cloud Storage.

  4. Enter the following path in Source path. This bucket contains prelabeled documents in the Document JSON format.

    cloud-samples-data/documentai/Custom/W2/JSON-2
    
  5. From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set, and 20% in the test set. Leave Import with auto-labeling unchecked.

  6. Click Import. Import takes several minutes.

Optional: View and manage dataset

  1. From the Build page, you can access the Manage Dataset console to view and edit all documents and labels in the dataset.

Train the processor

Now that you have sufficient training and test data, you can train the processor. Because training might take several hours, make sure you have set up the processor with the appropriate data and labels before you begin training.

  1. Under Train a custom model, click Create New Version.

    • If Create New Version cannot be clicked, click on View Full Requirements for information about the dataset requirements.
  2. In the Version name field, enter a name for this processor version, such as w2-custom-model.

  3. Optional: Click View Label Stats to find information about the document labels. That can help determine your coverage. Click Close to return to the training setup.

  4. Under Model training method, select Model based.

  5. Under Choose base version, select Train from scratch.

  6. Click Start training. Training takes a few hours. You can close this page and come back later.

  7. Optional: Click on the Deploy & Use tab. On this page, you can view the available processor versions and the training status of the new version.

Deploy the processor version

  1. After training is complete, click on the Deploy & Use tab.

  2. Click the checkbox on the left of the version you want to deploy, and select Deploy.

  3. Select Deploy from the dialog window. Deployment takes a few minutes.

  4. When the version is deployed, you can set it as the Default version or you can provide the version id when processing documents with the API.

Evaluate and test the processor

  1. Click on the Evaluate tab to test the processor version.

    On this page, you can view evaluation metrics including the F1 score, Precision and Recall for the full document, and individual labels. For more information about evaluation and statistics, refer to Evaluate processor.

  2. Click on the Version selector and select the version using the foundation model.

  3. Download a document that has not been involved in previous training or testing so that you can use it to evaluate the processor version. If using your own data, you would use a document set aside for this purpose.

    Download PDF

  4. Click Upload Test Document and select the document you just downloaded.

    The Custom Document Extractor analysis page opens. The screen output demonstrates how well the document was extracted.

  5. Test the document again using the version with a custom trained model.

Use the processor

You have successfully created and trained a Custom Document Extractor processor.

You can manage your custom trained processor versions just like any other processor version. For more information, refer to Managing processor versions.

To use the Document AI API:

Clean up

To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.

To avoid unnecessary Google Cloud charges, use the Google Cloud console to delete your processor and project if you do not need them.

If you created a new project to learn about Document AI and you no longer need the project, delete the project.

If you used an existing Google Cloud project, delete the resources you created to avoid incurring charges to your account:

  1. In the Google Cloud console navigation menu, click Document AI and select My Processors.

  2. Click More actions in the same row as the processor you want to delete.

  3. Click Delete processor, type the processor name, then click Delete again to confirm.

What's next