Evaluating annotation stores

This page describes how to use the projects.locations.datasets.annotationStores.evaluate method to evaluate the quality of annotation records generated by a machine learning algorithm.

Overview

The evaluate method compares annotation records in one annotation store (eval_store) to a manually annotated ground truth annotation store (golden_store) that describes the same resource. The annotation resource is defined in each store's AnnotationSource.

The annotation records in eval_store or golden_store could be generated individually by projects.locations.datasets.annotationStores.annotations.create or by:

Calling datasets.deidentify with an AnnotationConfig object
Calling projects.locations.datasets.annotationStores.import

Evaluation requirements

To perform evaluation, the following conditions must be met:

In the eval_store, each annotated resource defined in AnnotationSource can have only one annotation record for each annotation type:
SensitiveTextAnnotation must store the quotes obtained from the annotated resource. If you generated annotation records using datasets.deidentify, set store_quote in AnnotationConfig to true.

Evaluation output

The evaluate method reports the evaluation metrics to BigQuery. The method outputs a row in a specified BigQuery table with the following schema:

Field name	Type	Mode	Description
`opTimestamp`	`TIMESTAMP`	`NULLABLE`	Timestamp when the method was called
`opName`	`STRING`	`NULLABLE`	Name of evaluate long-running operation (LRO)
`evalStore`	`STRING`	`NULLABLE`	Name of `eval_store`
`goldenStore`	`STRING`	`NULLABLE`	Name of `golden_store`
`goldenCount`	`INTEGER`	`NULLABLE`	Number of annotation records in the `golden_store`
`matchedCount`	`INTEGER`	`NULLABLE`	Number of annotation records in the `eval_store` matched to the annotation records in the `golden_store`
`averageResults`	`RECORD`	`NULLABLE`	Average results across all infoTypes
`averageResults.` `sensitiveTextMetrics`	`RECORD`	`NULLABLE`	Average results for `SensitiveTextAnnotation`
`averageResults.` `sensitiveTextMetrics.` `truePositives`	`INTEGER`	`NULLABLE`	Number of correct predictions
`averageResults.` `sensitiveTextMetrics.` `falsePositives`	`INTEGER`	`NULLABLE`	Number of incorrect predictions
`averageResults.` `sensitiveTextMetrics.` `falseNegatives`	`INTEGER`	`NULLABLE`	Number of predictions that were missed
`averageResults.` `sensitiveTextMetrics.` `precision`	`FLOAT`	`NULLABLE`	`truePositives / (truePositives + falsePositives)`, ranges from `[0..1]` where `1.0` indicates all correct predictions
`averageResults.` `sensitiveTextMetrics.` `recall`	`FLOAT`	`NULLABLE`	`truePositives / (truePositives + falseNegatives)`, ranges from `[0..1]` where `1.0` indicates no missing prediction
`averageResults.` `sensitiveTextMetrics.` `fScore`	`FLOAT`	`NULLABLE`	`2 * precision * recall / (precision + recall)`, harmonic average of the precision and recall, ranges from `[0..1]` where `1.0` indicates perfect predictions
`infoResults`	`RECORD`	`REPEATED`	similar to `averageResults`, but broken down per infoType
`infoResults.` `sensitiveTextMetrics`	`RECORD`	`NULLABLE`	infoType results for `SensitiveTextAnnotation`
`infoResults.` `sensitiveTextMetrics.` `infoType`	`STRING`	`NULLABLE`	infoType category
`infoResults.` `sensitiveTextMetrics.` `truePositives`	`INTEGER`	`NULLABLE`	Number of correct predictions
`infoResults.` `sensitiveTextMetrics.` `falsePositives`	`INTEGER`	`NULLABLE`	Number of incorrect predictions
`infoResults.` `sensitiveTextMetrics.` `falseNegatives`	`INTEGER`	`NULLABLE`	Number of predictions that were missed
`infoResults.` `sensitiveTextMetrics.` `precision`	`FLOAT`	`NULLABLE`	`truePositives / (truePositives + falsePositives)`, ranges from `[0..1]` where `1.0` indicates all correct predictions
`infoResults.` `sensitiveTextMetrics.` `recall`	`FLOAT`	`NULLABLE`	`truePositives / (truePositives + falseNegatives)`, ranges from `[0..1]` where `1.0` indicates no missing prediction
`infoResults.` `sensitiveTextMetrics.` `fScore`	`FLOAT`	`NULLABLE`	`2 * precision * recall / (precision + recall)`, harmonic average of the precision and recall, ranges from `[0..1]` where `1.0` indicates perfect predictions

You can refer to EvaluateAnnotationStore for a detailed definition of the method.

Evaluating annotation stores

Overview

Evaluation requirements

Evaluation output

See also