Stay organized with collections Save and categorize content based on your preferences.

Label documents

A labeled dataset of documents is required to train, uptrain, or evaluate a processor version.

This page describes how to apply labels from your processor schema to imported documents in your dataset.

This page assumes you have already created a processor that supports training, uptraining, or evaluation. If your processor is supported, you will see the Train tab in the Google Cloud console. It also assumes you have created a dataset, imported documents, and defined a processor schema.

Labeling options

Here are your options for labeling documents:

Manually label in the Google Cloud console

In the Train tab, click a document to open the labeling tool.

Use either the Select text or Bounding box tool to highlight entities in the document and assign them to a label.

In the following screenshot, the first_name, last_name, SSN, address, and city fields in the document have been assigned labels.

When you select a checkbox entity with the Bounding box tool, only select the checkbox itself, and not any associated text. Ensure that the checkbox entity shown on the left is either selected or de-selected to match what is in the document.

When you label parent/child entities, don't label the parent entities. The parent entities are just containers of the child entities. Only label the child entities. The parent entities are updated automatically.

When you label child entities, label the first child entity and then associate the related child entities with that line. You will notice this at the second child entity the first time you label such entities. For example, with an invoice, if you label description, it will seem like any other entity. However, if you label quantity next, you are prompted to pick the parent.

Repeat this step for each line item by clicking NEW PARENT ENTITY for each new line item.

Auto-labeling

If available, you can use an existing version of your processor to get a head start on labeling. Note that auto-labeling can populate labels only if the processor version supports that label.

  1. Auto-label can be initiated during import.

  2. Choose the desired processor version.

  3. You cannot train or uptrain on auto-labeled documents, or use them in the test set, without marking them as labeled. After you import auto-labeled documents, manually review and correct the auto-labeled documents. Then, click Save to save the corrections and mark the document as labeled. You can then assign the documents as appropriate.

Document labeling tasks

Labeling tasks let you outsource and manage the labeling of your documents to a team labeling specialists (either your own internal team or 3rd party team).

Before creating a labeling task, make sure you have already defined a processor schema.

Create labeling tasks

  1. In dataset tab, in action panel, click Create labeling task.

  2. Specify the documents you want labeled in this task.

  3. Add labeling instructions. These instructions should cover details and corner cases that your specialists will encounter. See Best practices for labeling and sample labeling instructions. The labeling instructions must be stored in a publicly accessible storage bucket, or you will not be able to create the task.

  4. Click the Specialist pool drop-down list. If you do not already have a specialist pool that you want to use, click New Specialist Pool. Fill out the form with the pool name, the email addresses for the managers of the pool, and the email addresses for the specialists in the pool. Click Create Pool. An entry for the new specialist pool will appear in the Specialist pool drop-down list.

  5. If the specialist pool you want is in the Specialist pool drop-down list, select that option from the list. Select the specialists (the team of people who label the document). Consider the following factors when selecting the specialists:

    • Privacy: Do you want to control who has access to your documents?
    • Convenience: Do you already have a team of labeling specialists?

It may take several minutes for a newly created labeling task to appear in the list of labeling tasks on the Train tab.

Manage labeling tasks

Labeling managers will receive an automated email with a link to the Labeling Manager UI. The console is the similar to that for Human in the Loop (HITL). The only difference is that the documents for Labeling tasks are automatically assigned; the manager does not need to manually assign each document to a specialist. For more information, see Labeling Manager UI and Labeler Workbench Manager UI.

Specialists will receive an automated email with a link to the Specialists (Worker) Console where they can perform the labeling. The console for the specialists is the same as that for HITL specialists. For more information see, step 6 in the HITL Codelab.

You can monitor the labeling task's status in the processor's Train tab.

When the labeling task is complete, the labeled documents are automatically updated in your dataset in Document AI Workbench, although this may take several minutes.

You can stop a labeling task directly from Document AI Workbench at any time. If documents in the stopped labeling task were already labeled, these documents are automatically updated in your dataset with the new labels.

Import pre-labeled documents

You can import JSON Document files. If the entity in the document matches the label in the processor schema, the entity is converted to a label instance by the importer. There are several ways you can get JSON Document files:

  • Human-in-the-loop.

  • Exporting a dataset from another processor. See Export dataset.

  • Sending a processing request to an existing processor.

  • Write a script to convert existing labels from another system (for example, CSV labels) to JSON documents.

Best practices for labeling documents

Consistent labeling is required to train a high quality processor. We recommend that you:

Resync dataset

Resync keeps your Dataset's Cloud Storage folder consistent with Document AI's internal index of metadata. This is useful if you've accidentally made changes to the Cloud Storage folder and would like to synchronize the data.

To resync:

In the Processor Details tab, next to the Storage location row, click and then click Re-sync Dataset.

Usage notes:

  • If you delete a document from the Cloud Storage folder, resync will remove it from the dataset.
  • If you add a document to the Cloud Storage folder, resync will not add it to the dataset. To add documents, import them.
  • If you modify document labels in the Cloud Storage folder, resync will update the document labels in the dataset.

Migrate dataset

Import and export lets you move all the documents in a dataset from one processor to another. This can be useful if you have processors in different regions or Google Cloud projects, if you have different processors for staging and production, or for general offline consumption.

Note that only the documents and their labels are exported. Dataset metadata, such as processor schema, document assignments (training/test/unassigned), and document labeling status (labeled/unlabeled/auto-labeled) are not exported.

Export dataset

To export all documents as JSON Document files to a Cloud Storage folder, click Export Dataset.

A few important things to note:

  1. During export, three sub-folders will be created: Test, Train, and Unassigned. Your documents are placed into those sub-folders accordingly.

  2. A document's labeling status is not exported. If you later import the documents, they will not be marked auto-labeled.

  3. If your Cloud Storage is in a different Google Cloud project, make sure to grant access so that Document AI is allowed to write files to that location. Specifically, you must grant the Storage Object Creator role to Document AI's Core Service Agent service-{project-id}@gcp-sa-prod-dai-core.iam.gserviceaccount.com. For more information, see Google-managed service accounts.

Import dataset

The procedure is the same as Import documents.

What's next

Train a processor

Evaluate processor performance