Create a Custom Document Classifier in the Google Cloud console
You can create Custom Document Classifiers (CDC) that are specifically suited to your documents, and trained and evaluated with your data. This processor identifies classes of documents from a user-defined set of classes. You can then use this trained processor on additional documents. You typically would use a CDC on documents that are different types, then use the identification to pass the documents to an extraction processor to extract the entities.
A typical workflow to create and use a CDC is as follows:
- Create a Custom Document Classifier in Document AI Workbench.
- Create a dataset using an empty Cloud Storage bucket.
- Define and create the processor schema.
- Import documents.
- Assign documents to the Training and Test sets.
- Annotate documents manually in Document AI Workbench or with Labeling Tasks.
- Train the processor.
- Evaluate the processor.
- Deploy the processor.
- Test the processor.
- Configure Human-in-the-Loop (HITL) for review.
- Use the processor on your documents.
You can make your own configuration choices that suit your workflow.
To follow step-by-step guidance for this task directly in the Google Cloud console, click Guide me:
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Document AI, Cloud Storage APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the Document AI, Cloud Storage APIs.
Create a processor
In the Google Cloud console, in the Document AI section, go to the Workbench page.
For Custom Document Classifier, click
Create processor .In the Create processor menu, enter a name for your processor, such as
my-custom-document-classifier
.Select the region closest to you.
Click Create. The Processor Details tab appears.
Create a Cloud Storage bucket for the dataset
In order to train this new processor, you must create a dataset with training and testing data to help the processor identify the documents that you want to classify.
This dataset requires a new Cloud Storage bucket. Do not use the same bucket where your documents are currently stored.
Go to your processor's
Train tab.Click
Set Dataset Location . You are prompted to select or create an empty Cloud Storage bucket or folder.Click
Browse to open Select folder.Click the
Create a new bucket icon and follow the prompts to create a new bucket. For more information on creating a Cloud Storage bucket, refer to Cloud Storage buckets.Note: A bucket is the top-level storage entity, in which you can nest folders. Instead of creating and selecting a bucket, you could also create and select an empty folder inside an existing bucket, if you prefer. Refer to Cloud Storage folders.
After you create the bucket, the Select folder page appears for that bucket.
On the Select folder page for your bucket, click the
Select button at the bottom of the dialog box.Make sure the destination path is populated with the bucket name you selected. Click
Create Dataset . The dataset might take up to several minutes to create.
Import documents into a dataset
Next, you will import your documents into your dataset.
On the Train tab, click
Import documents .For this example, enter this bucket name in
Source path . This links directly to one document.cloud-samples-data/documentai/Custom/Patents/PDF/computer_vision_20.pdf
For Data split, select Unassigned. The document in this folder will not be assigned to either the testing or training set. Leave Import with auto-labeling unchecked.
Click Import. Document AI reads the documents from the bucket into the dataset. It does not modify the import bucket or read from the bucket after the import is complete.
When you import documents, you can optionally assign the documents to the Training or Test set when imported, or wait to assign them later.
If you want to delete a document or documents that you have imported, select them on the Train tab, and click Delete.
For more information about preparing your data for import, refer to the Data preparation guide.
Define processor schema
You can create the processor schema either before or after you import documents into your dataset. The schema provides labels that you will use to annotate documents.
On the Train tab, click
Edit Schema in the lower left. The Manage labels page opens.Click
Create label .Enter the name for the label. Select the Data type. Click Create. Refer to Define processor schema for detailed instructions on creating and editing a schema.
Note: Labels cannot be deleted. Instead, you can disable any label you do not want to use.
Create each of the following labels for the processor schema.
Name Data Type computer_vision
Document type crypto
Document type med_tech
Document type other
Document type Click
Save when the labels are complete.
Label a document
The process of selecting text in a document, and applying labels is known as annotation.
Return to the Train tab, and click
a document to open the Label management console.In the
Document type dropdown , select the appropriate label for the document.If you're using the sample document provided, select
computer_vision
.The labeled document should look like this when complete:
Click
Mark as Labeled when you have finished annotating the document.On the Train tab, the left-hand panel shows that 1 document has been labeled.
Assign annotated document to the training set
Now that you have labeled this example document, you can assign it to the training set.
On the Train tab, select the
Select All checkbox.From the
Assign to Set list, select Training.
In the left-hand panel, you can find that 1 document has been assigned to the training set.
Import pre-labeled data to the training and test sets
In this guide, you are provided with pre-labeled data.
If working on your own project, you will have to determine how to label your data. Refer to Labeling options. Document AI Custom Processors require a minimum of 10 documents in both the training and test sets, along with 10 instances of each label in each set. We recommend that you have at least 50 documents in each set, with 50 instances of each label for best performance. In general, more training data produces higher accuracy.
Click
Import documents .Enter the following path in
Source path . This bucket contains pre-labeled documents in the Document JSON format.cloud-samples-data/documentai/Custom/Patents/JSON/Classification-InventionType
From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set, and 20% in the test set. Ignore the Apply labels section.
Click Import. The import might take several minutes to complete.
When the import is finished, you will find the documents on the Train tab.
Optional: Batch label documents at import
After the schema has been configured, you can label all documents that are in a particular directory at import to save time with labeling.
Click
Import documents .Enter the following path in
Source path . This bucket contains unlabeled documents in PDF format.cloud-samples-data/documentai/Custom/Patents/PDF-CDC-BatchLabel
From the Data split list, select Auto-split. This automatically splits the documents to have 80% in the training set, and 20% in the test set.
In the Apply labels section, select Choose label.
For these sample documents, select
other
.Click Import and wait for the documents to import. You can leave this page and return later.
When the import is finished, you will find the documents on the Train tab with the label applied.
Train the processor
Now that you have imported the training and test data, you can train the processor. Because training might take several hours, make sure you have set up the processor with the appropriate data and labels before you begin training.
Click
Train New Version .In the
Version name field, enter a name for this processor version, such asmy-cdc-version-1
.(Optional) Click View Label Stats to find information about the document labels. That can help determine your coverage. Click Close to return to the training setup.
Click
Start training You can check the status on the right-hand panel.
Deploy the processor version
After training is complete, navigate to the
Manage Versions tab. You can view details about the version you just trained.Click the
three vertical dots on the right of the version you want to deploy, and select Deploy version.Select
Deploy from the popup window.Deployment takes a few minutes to complete.
Evaluate and test the processor
After deployment is complete, navigate to the
Evaluate & Test tab.On this page, you can view evaluation metrics including the F1 score, Precision and Recall for the full document, and individual labels. For more information about evaluation and statistics, refer to Evaluate processor.
Download a document that has not been involved in previous training or testing so that you can use it to evaluate the processor version. If using your own data, you would use a document set aside for this purpose.
Click
Upload Test Document and select the document you just downloaded.The Custom Document Classifier analysis page opens. The screen output will demonstrate how well the document was classified.
You can also re-run the evaluation against a different test set or processor version.
Optional: Auto-label newly imported documents
After deploying a trained processor version, you can use Auto-labeling to save time on labeling when importing new documents.
On the Train page,
Import documents .Copy and paste the following Cloud Storage path. This directory contains 5 unlabeled Patent PDFs. From the Data split dropdown list, select Training.
cloud-samples-data/documentai/Custom/Patents/PDF-CDC-AutoLabel
In the Apply labels section, select Auto-labeling.
Select an existing processor version to label the documents.
- For example:
2af620b2fd4d1fcf
- For example:
Click Import and wait for the documents to import. You can leave this page and return later.
- When complete, the documents appear in the Train page in the Auto-labeled section.
You cannot use auto-labeled documents for training or testing without marking them as labeled. Go to the
Auto-labeled section to view the auto-labeled documents.Select the first document to enter the labeling console.
Verify the label to ensure it is correct. Adjust if it is incorrect.
Select
Mark as Labeled when finished.Repeat the label verification for each auto-labeled document, then return to the Train page to use the data for training.
Use the processor
You have successfully created and trained a Custom Document Classifier processor.
You can manage your custom-trained processor versions just like any other processor version. For more information, refer to Managing processor versions.
You can Send a processing request to your custom processor, and the response can be handled the same as other classifier processors.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
To avoid unnecessary Google Cloud charges, use the Google Cloud console to delete your processor and project if you do not need them.
If you created a new project to learn about Document AI and you no longer need the project, delete the project.
If you used an existing Google Cloud project, delete the resources you created to avoid incurring charges to your account:
In the Google Cloud console navigation menu, click Document AI and select My Processors.
Click
More actions in the same row as the processor you want to delete.Click Delete processor, type the processor name, then click Delete again to confirm.