Stay organized with collections Save and categorize content based on your preferences.

Evaluate the performance of processors

Document AI generates evaluation metrics, such as precision and recall, to help you determine the predictive performance of your processors.

These evaluation metrics are generated by comparing the entities returned by the processor (the predictions) against the annotations in the test documents. If your processor does not have a test set, then you must first create a dataset and label the test documents.

Get evaluation metrics

An evaluation is automatically run whenever you train or uptrain a processor version.

You can also manually run an evaluation. This is required to generate updated metrics after you've modified the test set, or if you are evaluating a pretrained processor version.

  1. In the Google Cloud console, go to the Processors page and choose your processor.

    Go to the Processors page

  2. In the Evaluate & Test tab, select the Version of the processor to evaluate and then click Run new evaluation.

Once complete, the page contains evaluation metrics for all labels and for each individual label.

Evaluation metrics for all labels

Metrics for All labels are computed based on the number of true positives, false positives, and false negatives in the dataset across all labels, and thus, are weighted by the number of times each label appears in the dataset. For definitions of these terms, see Evaluation metrics for individual labels.

  • Precision: the proportion of predictions that match the annotations in the test set. Defined as True Positives / (True Positives + False Positives)

  • Recall: the proportion of annotations in the test set that are correctly predicted. Defined as True Positives / (True Positives + False Negatives)

  • F1 score: the harmonic mean of precision and recall, which combines precision and recall into a single metric, providing equal weight to both. Defined as 2 * (Precision * Recall) / (Precision + Recall)

Evaluation metrics for individual labels

  • True Positives: the predicted entities that match an annotation in the test document. For more information, see matching behavior.

  • False Positives: the predicted entities that do not match any annotation in the test document.

  • False Negatives: the annotations in the test document that do not match any of the predicted entities.

    • False Negatives (Below Threshold): the annotations in the test document that would have matched a predicted entity, but the predicted entity's confidence value is below the specified confidence threshold.

Confidence threshold

The evaluation logic ignores any predictions with confidence below the specified Confidence Threshold, even if the prediction is correct. Document AI provides a list of False Negatives (Below Threshold), which are the annotations that would have a match if the confidence threshold were set lower.

Document AI automatically computes the optimal threshold, which maximizes the F1 score, and by default, sets the confidence threshold to this optimal value.

You are free to choose your own confidence threshold by moving the slider bar. In general, higher confidence threshold results in:

  • higher precision, because the predictions are more likely to be correct.
  • lower recall, because there are fewer predictions.

Tabular entities

The metrics for a parent label are not calculated by directly averaging the child metrics, but rather, by applying the parent's confidence threshold to all of its child labels and aggregating the results.

The optimal threshold for the parent is the confidence threshold value that, when applied to all children, yields the maximum F1 score for the parent.

Matching behavior

A predicted entity matches an annotation if:

Note that type and text value are all that is used for matching. Other information, such as text anchors and bounding boxes (with the exception of tabular entities described below) are not used.

Single-occurrence vs. multi-occurrence labels

Single-occurrence labels have one value per document (for example, invoice ID) even if that value is annotated multiple times in the same document (for example, the invoice ID appears in every page of the same document). Even if the multiple annotations have different text, they are considered equal. In other words, if a predicted entity matches any of the annotations, it counts as a match. The extra annotations are considered duplicate mentions and do not contribute towards any of the true positive, false positive, or false negative counts.

Multi-occurrence labels can have multiple, different values. Thus, each predicted entity and annotation is considered and matched separately. If a document contains N annotations for a multi-occurrence label, then there can be N matches with the predicted entities. Each predicted entity and annotation are independently counted as a true positive, false positive, and/or false negative.

Fuzzy Matching

The Fuzzy Matching toggle lets you tighten or relax some of the matching rules to decrease or increase the number of matches.

For example, without fuzzy matching, the string ABC will not match abc due to capitalization. But with fuzzy matching, they will match.

When fuzzy matching is enabled, here are the rule changes:

  • Whitespace normalization: removes leading/trailing whitespace and condenses consecutive intermediate whitespaces (including newlines) into single spaces.

  • Leading/trailing punctuation removal: removes the following leading/trailing punctuation characters !,.:;?|.

  • Case-insensitive matching: converts all characters to lowercase.

  • Money normalization: For labels with the data type money, remove the leading/trailing currency symbols.

Tabular entities

Parent entities and annotations do not have text values and are matched based on the combined bounding boxes of their children. If there is only one predicted parent and one annotated parent, they are automatically matched, regardless of bounding boxes.

Once parents are matched, their children are matched as if they were non-tabular entities. If parents are not matched, Document AI will not attempt to match their children. This means that child entities can be considered incorrect, even with the same text contents, if their parent entities are not matched.

Export evaluation metrics

  1. In the Google Cloud console, go to the Processors page and choose your processor.

    Go to the Processors page

  2. In the Evaluate & Test tab, click Download Metrics, to download the evaluation metrics as a JSON file.

What's next

Once you're ready to process documents, see send a processing request.

Or if the processor's quality is insufficient, refine your datasets and train another processor version.