Stay organized with collections Save and categorize content based on your preferences.

Create custom document extractor in the Cloud Console

Creating and using a Custom Document Extractor

You can create Custom Document Extractors (CDE) that are specifically suited to your documents, and trained and evaluated with your data. This processor identifies and extracts entities from your documents. You can then use this trained processor on additional documents. You typically would use a CDE on documents that are all of one type, such as your institution's enrollment forms.

A typical workflow to create and use a CDE is as follows:

  1. Create a Custom Document Extractor in Document AI Workbench.
  2. Create a dataset using an empty Cloud Storage bucket.
  3. Define and create the processor schema.
  4. Import documents.
  5. Assign documents to the Training and Test sets.
  6. Annotate documents manually in Document AI Workbench or with Labeling Tasks.
  7. Train the processor.
  8. Evaluate the processor.
  9. Deploy the processor.
  10. Test the processor.
  11. Configure Human-in-the-Loop (HITL) for review.
  12. Use the processor on your documents.

You can make your own configuration choices that suit your workflow.

Quickstart: Create a W-2 Custom Processor

This guide describes how to use Document AI Workbench to create and train a Custom Document Extractor that processes W-2 (US tax form) documents. Most of the document preparation work has been done so that you can focus on the other mechanics of creating a CDE.

The dataset used in these examples comes from Kaggle with a CC0: Public Domain License.


To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:

Guide me


Prerequisites

Follow the instructions in Set up the Document AI API to ensure that you have a valid Google Cloud project and access to Document AI.

Step 1: Create a processor

  1. Open the project in which you want to create the custom data processor.

  2. In the Google Cloud console, in the Document AI section, go to the Workbench page

  3. For Custom Document Extractor, click Create processor.

    Select CDE processor

  4. Enter a name, such as my-custom-document-extractor.

    Create CDE processor

  5. Select a region.

  6. Click Create. The Processor Details tab appears.

Step 2: Create a Cloud Storage bucket for the dataset

To create a dataset, you first create an empty Cloud Storage bucket, and then import documents into it.

  1. On your processor's Train tab, click Create Dataset. You are prompted to select or create an empty Cloud Storage bucket or folder.

    Create a bucket

  2. Click Browse to open Select folder. Click the Create a new bucket icon and follow the prompts to create a new bucket. For more information on creating a Cloud Storage bucket, see Cloud Storage buckets.

    After you create the bucket, the Select folder page appears for that bucket.

  3. On the Select folder page for your bucket, click the Select button at the bottom of the dialog box.

    Select bucket

  4. Make sure the destination path is populated with the bucket name you selected. Click Create Dataset. The dataset might take up to several minutes to create.

    Create dataset

Step 3: Import documents into a dataset

Next, you will import your documents into your dataset to be stored in the empty bucket you created.

  1. On the Train tab, click Import documents. The Import documents: Add documents to your dataset page opens at the right.

    Import documents

  2. For this example, enter this bucket name in Source path. This bucket contains one document.

    cloud-samples-data/documentai/Custom/W2/PDF
    
  3. For Data split, select Unassigned. The document in this folder will not be assigned to either the testing or training set. This bucket contains one W-2 PDF file.

  4. Click Import. Document AI reads the documents from the bucket into the dataset. It does not modify the import bucket or read from the bucket after the import is complete.

When you import documents, you can optionally assign the documents to the training or test set when imported, or wait to assign them later.

If you want to delete a document or documents that you have imported, select them on the Train tab, and click Delete.

For more information about preparing your data for import, see the import documents.

Step 4: Define processor schema

You can create the processor schema either before or after you import documents into your dataset. The schema provides labels that you will use to annotate documents.

To create this schema, you create labels. create labels for it.

  1. Click Edit Schema in the lower left of the Train tab. The Manage labels page opens.

    =

  2. Click Create label. Enter the name for the label. If you are creating a tabular entity, select the Parent label checkbox. Select the Data type and the Occurrence. Click Create. See Define processor schema for detailed instructions on creating and editing a schema.

  3. Create each of the following labels for the processor schema.

    Name Data Type Occurrence
    CONTROL_NUMBER Number Required multiple
    EMPL_SSN Plain Text Required multiple
    EMPLR_ID_NUMBER Plain Text Required multiple
    EMPLR_NAME_ADDRESS Address Required multiple
    FEDERAL_INCOME_TAX_WH Money Required multiple
    SS_TAX_WH Money Required multiple
    SS_WAGES Money Required multiple
    WAGES_TIPS_OTHER_COMP Money Required multiple

    You can also create and use other types of labels in your processor schema, such as checkboxes and tabular entities. For example, the W-2 forms contain a Statutory employee, Retirement plan, and Third party sick pay check boxes that you could also add to the schema.

  4. Click Save when the labels are complete.

    Manage labels console

Step 5: Label a document

The process of selecting text in a document, and applying labels is known as annotation.

  1. On the Train tab, click a document to open the Label management console.

  2. Next, you will select content that corresponds to a label, and apply the label. Use the Select text tool, or the Bounding box tool, to select the content. and apply the label. In this example, the value of WAGES_TIPS_OTHER_COMP was selected with the Bounding box tool, and that label is applied.

    Select wages with bounding box

    Apply wages label

  3. Review the detected text values to ensure that they reflect the correct text from the document.

    The labeled W-2 document should look like this when complete:

    Labeled W-2 document

  4. Click Mark as Labeled when you have finished annotating the document.

    On the Train tab, the left-hand panel shows that 1 document has been labeled.

Step 6: Assign annotated document to the training set

Now that you have labeled this example document, you can assign it to the training set.

  • On the Train tab, click the checkbox to select the document. From the Assign to Set dropdown list, select Training.

In the left-hand panel, you can see that 1 document has been assigned to the training set.

Step 7: Import pre-labeled data to the training and test sets

In this guide, you are provided with pre-labeled data.

If working on your own project, you will have to determine how to label your data. See Labeling options. Document AI Custom Processors require a minimum of 10 documents in both the training and test sets, along with 10 instances of each label in each set. We recommend that you have at least 50 documents in each set, with 50 instances of each label for best performance. In general, more training data produces higher accuracy.

  1. Click Import Documents.

  2. Enter the following bucket in Source path:

    cloud-samples-data/documentai/Custom/W2/JSON
    
  3. From the Data split dropdown list, select Auto-split. This will automatically split the documents to have 80% in the training set, and 20% in the test set.

  4. Click Import. The import might take several minutes to complete.

When the import is finished, you will see the documents on the Train tab.

Step 8: Train the processor

Now that you have imported the training and test data, you can train the processor. Because training might take several hours, make sure you have set up the processor with the appropriate data and labels before you begin training.

  1. Click Train New Version. Enter a name for this processor version, such as my-cde-version-1.

  2. (Optional) Click View Label Stats to see information about the document labels. That can help determine your coverage. Click Close to return to the training setup.

  3. Click Start training. You can see the status on the right-hand panel.

Step 9: Evaluate and test the processor

  1. Navigate to the Manage Versions tab. You can view details about the version you just trained.

  2. Click the three vertical dots on the right, and select Deploy version.

  3. Select Deploy from the popup window.

  4. When the version has completed deployment, navigate to the Evaluate & Test tab. View evaluation metrics including the F1 score, Precision and Recall for the full document as well as individual labels. See Evaluate processor.

  5. You will next download a document that has not been involved in previous training or testing so that you can use it to evaluate the processor version. If using your own data, you would use a document set aside for this purpose.

    Download PDF

  6. Click Upload Test Document. The screen output will demonstrate how well the labels are applied to the text entities.

You can also re-run the evaluation against a different test set or processor version.

For more information about evaluation and what the different statistics mean, see Evaluate the performance of processors.

Step 10: Deploy and use the processor

You can deploy and manage your custom-trained processor versions just like any other processor version. For more information, see Managing processor versions.

Once deployed, you can Send a processing request to your custom processor, and the response can be handled the same as other specialized processors.

Optional: Auto-label newly imported documents

After deploying a trained processor version, you can use Auto-labeling to save time on labeling when importing new documents.

  1. On the Train tab, click Import documents.

  2. Enter the following Cloud Storage path. This directory contains 5 unlabeled W2 PDFs.

    cloud-samples-data/documentai/Custom/W2/AutoLabel
    
  3. From the Data split list, select Training.

  4. In the Auto-labeling section, select the Import with auto-labeling checkbox.

  5. Select an existing processor version to label the documents—for example, 2af620b2fd4d1fcf

  6. Click Import and wait for the documents to import. You can leave this page and return later.

    • When complete, the documents appear in the Train page in the Auto-labeled section.
  7. You cannot use auto-labeled documents for training or testing without marking them as labeled. Go to the Auto-labeled section to view the auto-labeled documents.

  8. Select the first document to enter the labeling console.

  9. Verify the labels, bounding boxes, and values to ensure they are correct. Label any values that were omitted.

  10. Select Mark as labeled when finished.

  11. Repeat the label verification for each auto-labeled document, then return to the Train page to use the data for training.