Evaluate performance

Document AI generates evaluation metrics, such as precision and recall, to help you determine the predictive performance of your processors.

These evaluation metrics are generated by comparing the entities returned by the processor (the predictions) against the annotations in the test documents. If your processor does not have a test set, then you must first create a dataset and label the test documents.

Run an evaluation

An evaluation is automatically run whenever you train or uptrain a processor version.

You can also manually run an evaluation. This is required to generate updated metrics after you've modified the test set, or if you are evaluating a pretrained processor version.

Web UI

In the Google Cloud console, go to the Processors page and choose your processor.

Go to the Processors page
In the Evaluate & Test tab, select the Version of the processor to evaluate and then click Run new evaluation.

Once complete, the page contains evaluation metrics for all labels and for each individual label.

Python

For more information, see the Document AI Python API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore

# TODO(developer): Uncomment these variables before running the sample.
# project_id = 'YOUR_PROJECT_ID'
# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'
# processor_id = 'YOUR_PROCESSOR_ID'
# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'
# gcs_input_uri = # Format: gs://bucket/directory/


def evaluate_processor_version_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version_id: str,
    gcs_input_uri: str,
) -> None:
    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor version
    # e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    name = client.processor_version_path(
        project_id, location, processor_id, processor_version_id
    )

    evaluation_documents = documentai.BatchDocumentsInputConfig(
        gcs_prefix=documentai.GcsPrefix(gcs_uri_prefix=gcs_input_uri)
    )

    # NOTE: Alternatively, specify a list of GCS Documents
    #
    # gcs_input_uri = "gs://bucket/directory/file.pdf"
    # input_mime_type = "application/pdf"
    #
    # gcs_document = documentai.GcsDocument(
    #     gcs_uri=gcs_input_uri, mime_type=input_mime_type
    # )
    # gcs_documents = [gcs_document]
    # evaluation_documents = documentai.BatchDocumentsInputConfig(
    #     gcs_documents=documentai.GcsDocuments(documents=gcs_documents)
    # )
    #

    request = documentai.EvaluateProcessorVersionRequest(
        processor_version=name,
        evaluation_documents=evaluation_documents,
    )

    # Make EvaluateProcessorVersion request
    # Continually polls the operation until it is complete.
    # This could take some time for larger files
    operation = client.evaluate_processor_version(request=request)
    # Print operation details
    # Format: projects/PROJECT_NUMBER/locations/LOCATION/operations/OPERATION_ID
    print(f"Waiting for operation {operation.operation.name} to complete...")
    # Wait for operation to complete
    response = documentai.EvaluateProcessorVersionResponse(operation.result())

    # After the operation is complete,
    # Print evaluation ID from operation response
    print(f"Evaluation Complete: {response.evaluation}")

Get results of an evaluation

Web UI

In the Google Cloud console, go to the Processors page and choose your processor.

Go to the Processors page
In the Evaluate & Test tab, select the Version of the processor to view evaluation.

Once complete, the page contains evaluation metrics for all labels and for each individual label.

Python

For more information, see the Document AI Python API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore

# TODO(developer): Uncomment these variables before running the sample.
# project_id = 'YOUR_PROJECT_ID'
# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'
# processor_id = 'YOUR_PROCESSOR_ID' # Create processor before running sample
# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'
# evaluation_id = 'YOUR_EVALUATION_ID'


def get_evaluation_sample(
    project_id: str,
    location: str,
    processor_id: str,
    processor_version_id: str,
    evaluation_id: str,
) -> None:
    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the evaluation
    # e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    evaluation_name = client.evaluation_path(
        project_id, location, processor_id, processor_version_id, evaluation_id
    )
    # Make GetEvaluation request
    evaluation = client.get_evaluation(name=evaluation_name)

    create_time = evaluation.create_time
    document_counters = evaluation.document_counters

    # Print the Evaluation Information
    # Refer to https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.processorVersions.evaluations
    # for more information on the available evaluation data
    print(f"Create Time: {create_time}")
    print(f"Input Documents: {document_counters.input_documents_count}")
    print(f"\tInvalid Documents: {document_counters.invalid_documents_count}")
    print(f"\tFailed Documents: {document_counters.failed_documents_count}")
    print(f"\tEvaluated Documents: {document_counters.evaluated_documents_count}")

List all evaluations for a processor version

Python

For more information, see the Document AI Python API reference documentation.

To authenticate to Document AI, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from google.api_core.client_options import ClientOptions
from google.cloud import documentai  # type: ignore

# TODO(developer): Uncomment these variables before running the sample.
# project_id = 'YOUR_PROJECT_ID'
# location = 'YOUR_PROCESSOR_LOCATION' # Format is 'us' or 'eu'
# processor_id = 'YOUR_PROCESSOR_ID' # Create processor before running sample
# processor_version_id = 'YOUR_PROCESSOR_VERSION_ID'


def list_evaluations_sample(
    project_id: str, location: str, processor_id: str, processor_version_id: str
) -> None:
    # You must set the api_endpoint if you use a location other than 'us', e.g.:
    opts = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")

    client = documentai.DocumentProcessorServiceClient(client_options=opts)

    # The full resource name of the processor version
    # e.g. `projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}`
    parent = client.processor_version_path(
        project_id, location, processor_id, processor_version_id
    )

    evaluations = client.list_evaluations(parent=parent)

    # Print the Evaluation Information
    # Refer to https://cloud.google.com/document-ai/docs/reference/rest/v1beta3/projects.locations.processors.processorVersions.evaluations
    # for more information on the available evaluation data
    print(f"Evaluations for Processor Version {parent}")

    for evaluation in evaluations:
        print(f"Name: {evaluation.name}")
        print(f"\tCreate Time: {evaluation.create_time}\n")

Evaluation metrics for all labels

evaluate-the-performance-of-processors-1

Metrics for All labels are computed based on the number of true positives, false positives, and false negatives in the dataset across all labels, and thus, are weighted by the number of times each label appears in the dataset. For definitions of these terms, see Evaluation metrics for individual labels.

Precision: the proportion of predictions that match the annotations in the test set. Defined as True Positives / (True Positives + False Positives)
Recall: the proportion of annotations in the test set that are correctly predicted. Defined as True Positives / (True Positives + False Negatives)
F1 score: the harmonic mean of precision and recall, which combines precision and recall into a single metric, providing equal weight to both. Defined as 2 * (Precision * Recall) / (Precision + Recall)

Evaluation metrics for individual labels

evaluate-the-performance-of-processors-2

True Positives: the predicted entities that match an annotation in the test document. For more information, see matching behavior.
False Positives: the predicted entities that don't match any annotation in the test document.
False Negatives: the annotations in the test document that don't match any of the predicted entities.
- False Negatives (Below Threshold): the annotations in the test document that would have matched a predicted entity, but the predicted entity's confidence value is below the specified confidence threshold.

Confidence threshold

The evaluation logic ignores any predictions with confidence below the specified Confidence Threshold, even if the prediction is correct. Document AI provides a list of False Negatives (Below Threshold), which are the annotations that would have a match if the confidence threshold were set lower.

Document AI automatically computes the optimal threshold, which maximizes the F1 score, and by default, sets the confidence threshold to this optimal value.

You are free to choose your own confidence threshold by moving the slider bar. In general, higher confidence threshold results in:

higher precision, because the predictions are more likely to be correct.
lower recall, because there are fewer predictions.

Tabular entities

The metrics for a parent label are not calculated by directly averaging the child metrics, but rather, by applying the parent's confidence threshold to all of its child labels and aggregating the results.

The optimal threshold for the parent is the confidence threshold value that, when applied to all children, yields the maximum F1 score for the parent.

Matching behavior

A predicted entity matches an annotation if:

the type of the predicted entity (entity.type) matches the annotation's label name
the value of the predicted entity (entity.mention_text or entity.normalized_value.text) matches the annotation's text value, subject to fuzzy matching if it is enabled.

Note that type and text value are all that is used for matching. Other information, such as text anchors and bounding boxes (with the exception of tabular entities described below) are not used.

Single- versus multi-occurrence labels

Single-occurrence labels have one value per document (for example, invoice ID) even if that value is annotated multiple times in the same document (for example, the invoice ID appears in every page of the same document). Even if the multiple annotations have different text, they are considered equal. In other words, if a predicted entity matches any of the annotations, it counts as a match. The extra annotations are considered duplicate mentions and don't contribute towards any of the true positive, false positive, or false negative counts.

Multi-occurrence labels can have multiple, different values. Thus, each predicted entity and annotation is considered and matched separately. If a document contains N annotations for a multi-occurrence label, then there can be N matches with the predicted entities. Each predicted entity and annotation are independently counted as a true positive, false positive, or false negative.

Fuzzy Matching

The Fuzzy Matching toggle lets you tighten or relax some of the matching rules to decrease or increase the number of matches.

For example, without fuzzy matching, the string ABC does not match abc due to capitalization. But with fuzzy matching, they match.

When fuzzy matching is enabled, here are the rule changes:

Whitespace normalization: removes leading-trailing whitespace and condenses consecutive intermediate whitespaces (including newlines) into single spaces.
Leading/trailing punctuation removal: removes the following leading-trailing punctuation characters !,.:;-"?|.
Case-insensitive matching: converts all characters to lowercase.
Money normalization: For labels with the data type money, remove the leading-trailing currency symbols.

Tabular entities

Parent entities and annotations don't have text values and are matched based on the combined bounding boxes of their children. If there is only one predicted parent and one annotated parent, they are automatically matched, regardless of bounding boxes.

Once parents are matched, their children are matched as if they were non-tabular entities. If parents are not matched, Document AI won't attempt to match their children. This means that child entities can be considered incorrect, even with the same text contents, if their parent entities are not matched.

Parent / child entities are a Preview feature and only supported for tables with one layer of nesting.

Export evaluation metrics

In the Google Cloud console, go to the Processors page and choose your processor.

Go to the Processors page
In the Evaluate & Test tab, click Download Metrics, to download the evaluation metrics as a JSON file.

Up-train a pretrained processor

Base64 encoding