Label documents

A labeled dataset of documents is required to train, uptrain, or evaluate a processor version.

This page describes how to apply labels from your processor schema to imported documents in your dataset.

This page assumes you have already created a processor that supports training, uptraining, or evaluation. If your processor is supported, you will see the Train tab in the Google Cloud console. It also assumes you have created a dataset, imported documents, and defined a processor schema.

Name fields for generative AI extraction

The way fields are named influences how accurately fields are extracted using generative AI. We recommend the following best practices when naming fields:

  • Name the field with the same language used to describe it in the document: For example, if a document has a field described as Employer Address, then name the field employer_address. Do not use abbreviations such as emplr_addr.

  • Spaces are currently not supported in field names: Instead of using spaces, use _. For example: First Name would be named first_name.

  • Iterate on names to improve accuracy: Currently, Document AI has a limitation that does not allow field names to change. To test different names, disable or delete the existing field and re-create it.

Labeling options

Here are your options for labeling documents:

Manually label in the Google Cloud console

In the Train tab, click a document to open the labeling tool.

From the list of schema labels on the left side of the labeling tool, click the 'Add' symbol to select the Bounding box tool to highlight entities in the document and assign them to a label.

In the following screenshot, the EMPL_SSN EMPLR_ID_NUMBER, EMPLR_NAME_ADDRESS, FEDERAL_INCOME_TAX_WH, SS_TAX_WH, SS_WAGES, and WAGES_TIPS_OTHER_COMP fields in the document have been assigned labels.

Labeled W-2 document

When you select a checkbox entity with the Bounding box tool, only select the checkbox itself, and not any associated text. Ensure that the checkbox entity shown on the left is either selected or de-selected to match what is in the document.

When you label parent/child entities, don't label the parent entities. The parent entities are just containers of the child entities. Only label the child entities. The parent entities are updated automatically.

When you label child entities, label the first child entity and then associate the related child entities with that line. You will notice this at the second child entity the first time you label such entities. For example, with an invoice, if you label description, it will seem like any other entity. However, if you label quantity next, you are prompted to pick the parent.

Repeat this step for each line item by clicking NEW PARENT ENTITY for each new line item.

Parent / child entities are a Preview feature and only supported for tables with one layer of nesting.

Quick tables

When labeling a table, it could be tedious to label each row over and over again. There is a very convenient tool that can replicate a row's entity structure. Note that, this feature only works on horizontally aligned rows.

  1. First, label the first row as usual.
  2. Then, hover over the parent entity representing the row. Click on Add more rows.
    The row becomes a template to create more rows.

  3. Select the rest of the area of the table.

The tool guesses the annotations, and it usually works. For any tables it can't handle, annotate those manually.

Use keyboard shortcuts in console

To see the keyboard shortcuts that are available, click the three vertical dots at the upper right of the labeling console. It displays a list of keyboard shortcuts, as shown in the following table.

Action Shortcut
Zoom in Alt + = (Option + = on macOS)
Zoom out Alt + - (Option + - on macOS)
Zoom to fit Alt + 0 (Option + 0 on macOS)
Scroll to zoom Alt + Scroll (Option + Scroll on macOS)
Panning Scroll
Reversed panning Shift + Scroll
Drag to pan Space + Mouse drag
Undo Ctrl + Z (Control + Z on macOS)
Redo Ctrl + Shift + Z (Control + +Shift + Z on macOS)

Auto-labeling

If available, you can use an existing version of your processor to start labeling. Note: Auto-labeling can populate labels only if the processor version supports that label. The following introduces two methods to initiate the auto-labeling process.

  1. Auto-label can be initiated during import. All documents are annotated using the specified processor version.

  2. Auto-label can be initiated after import for documents in the unlabeled or auto-labeled category. All selected documents are annotated using the specified processor version.

You can't train or uptrain on auto-labeled documents, or use them in the test set, without marking them as labeled. Manually review and correct the auto-labeled annotations, then click Mark as Labeled to save the corrections. You can then assign the documents as appropriate.

Document labeling tasks

Labeling tasks let you outsource and manage the labeling of your documents to a team labeling specialists (either your own internal team or 3rd party team).

Before creating a labeling task, make sure you have already defined a processor schema.

Create labeling tasks

  1. In dataset tab, in action panel, click Create labeling task.

  2. Specify the documents you want labeled in this task.

  3. Add labeling instructions. These instructions should cover details and corner cases that your specialists will encounter. See Best practices for labeling and sample labeling instructions. The labeling instructions must be stored in a publicly accessible storage bucket, or you will not be able to create the task.

  4. Click the Specialist pool drop-down list. If you do not already have a specialist pool that you want to use, click New Specialist Pool. Fill out the form with the pool name, the email addresses for the managers of the pool, and the email addresses for the specialists in the pool. Click Create Pool. An entry for the new specialist pool will appear in the Specialist pool drop-down list.

  5. If the specialist pool you want is in the Specialist pool drop-down list, select that option from the list. Select the specialists (the team of people who label the document). Consider the following factors when selecting the specialists:

    • Privacy: Do you want to control who has access to your documents?
    • Convenience: Do you already have a team of labeling specialists?

It may take several minutes for a newly created labeling task to appear in the list of labeling tasks on the Train tab.

Manage labeling tasks

Labeling managers will receive an automated email with a link to the Labeling Manager UI. The console is the similar to that for Human in the Loop (HITL). The only difference is that the documents for Labeling tasks are automatically assigned; the manager does not need to manually assign each document to a specialist. For more information, see Labeling Manager UI and Labeler Workbench Manager UI.

Specialists will receive an automated email with a link to the Specialists (Worker) Console where they can perform the labeling. The console for the specialists is the same as that for HITL specialists. For more information see, step 6 in the HITL Codelab.

You can monitor the labeling task's status in the processor's Train tab.

When the labeling task is complete, the labeled documents are automatically updated in your dataset in Document AI Workbench, although this may take several minutes.

You can stop a labeling task directly from Document AI Workbench at any time. If documents in the stopped labeling task were already labeled, these documents are automatically updated in your dataset with the new labels.

Import pre-labeled documents

You can import JSON Document files. If the entity in the document matches the label in the processor schema, the entity is converted to a label instance by the importer. There are several ways you can get JSON Document files:

  • Human-in-the-loop.

  • Exporting a dataset from another processor. See Export dataset.

  • Sending a processing request to an existing processor.

  • Write a script to convert existing labels from another system (for example, CSV labels) to JSON documents.

Best practices for labeling documents

Consistent labeling is required to train a high quality processor. We recommend that you:

  • Create labeling instructions: Your instructions should include examples for both the common and corner cases. Some tips:

    • Explain which fields should be annotated and how exactly to make labeling consistent. For example, when labeling "amount", specify whether the currency symbol should be labeled. If the labels are not consistent, then processor quality will be reduced.
    • Label all occurrences of an entity, even if the label type is REQUIRED_ONCE or OPTIONAL_ONCE. For example, if invoice_id appears two times in the doc, label all occurrences of them.
    • Prefer labeling with the bounding box tool first. If that fails, use the select text tool. If the value of the label is not correctly detected by OCR, manually correct the value.

Here are some sample labeling instructions:

  • Train annotators: make sure that annotators understand and can follow the guidelines without any systematic errors. One way to achieve this is to have different trainees annotate the same set of documents. The trainer can then check the quality of each trainee's annotation work. You might need to repeat this process until the trainees achieve a benchmark level of accuracy.

  • Annotation reviews: Given the laborious nature of annotation, even trained annotators may make mistakes. We recommend that annotations are checked by at least one more trained annotator.

Resync dataset

Resync keeps your Dataset's Cloud Storage folder consistent with Document AI's internal index of metadata. This is useful if you've accidentally made changes to the Cloud Storage folder and would like to synchronize the data.

To resync:

In the Processor Details tab, next to the Storage location row, click and then click Re-sync Dataset.

Usage notes:

  • If you delete a document from the Cloud Storage folder, resync will remove it from the dataset.
  • If you add a document to the Cloud Storage folder, resync will not add it to the dataset. To add documents, import them.
  • If you modify document labels in the Cloud Storage folder, resync will update the document labels in the dataset.

Migrate dataset

Import and export lets you move all the documents in a dataset from one processor to another. This can be useful if you have processors in different regions or Google Cloud projects, if you have different processors for staging and production, or for general offline consumption.

Note that only the documents and their labels are exported. Dataset metadata, such as processor schema, document assignments (training/test/unassigned), and document labeling status (labeled/unlabeled/auto-labeled) are not exported.

Export dataset

To export all documents as JSON Document files to a Cloud Storage folder, click Export Dataset.

A few important things to note:

  1. During export, three sub-folders will be created: Test, Train, and Unassigned. Your documents are placed into those sub-folders accordingly.

  2. A document's labeling status is not exported. If you later import the documents, they will not be marked auto-labeled.

  3. If your Cloud Storage is in a different Google Cloud project, make sure to grant access so that Document AI is allowed to write files to that location. Specifically, you must grant the Storage Object Creator role to Document AI's Core Service Agent service-{project-id}@gcp-sa-prod-dai-core.iam.gserviceaccount.com. For more information, see Google-managed service accounts.

Import dataset

The procedure is the same as Import documents.

What's next

Train a processor

Evaluate processor performance