Creating a continuous evaluation job

A continuous evaluation job defines how AI Platform Data Labeling Service performs continuous evaluation on a model version that you have deployed to AI Platform Prediction. When you create an evaluation job for a model version, two things start to happen:

  • As the model version serves online predictions, the input and output for some of these predictions get saved in a BigQuery table.
  • At a regular interval, the continuous evaluation job runs, performing the following tasks:
    1. The job creates a Data Labeling Service dataset with all the new rows in the BigQuery since the last run.
    2. (Optional) The job submits a labeling request to have human reviewers provide ground truth labels for the predictions.
    3. The job calculates a set of evaluation metrics, which you can view in the Google Cloud console.

Before you begin

Before you begin, you must deploy a model version to AI Platform Prediction that meets specific model requirements. You must also enable certain Google Cloud APIs. Read Before you begin continuous evaluation to learn how to meet these requirements.

Configure basic options

The following sections describe how to navigate to the job creation page and configure basic details for your evaluation job.

To create an evaluation job for a model version, navigate to the page for that model version in the Google Cloud console and open the job creation page:

  1. Open the AI Platform models page in the Google Cloud console:

    Go to the AI Platform models page

  2. Click on the name of the model containing the model version that you want to create an evaluation job for.

  3. Click the name of the model version that you want to create an evaluation job for. It cannot already have an evaluation job attached to it.

  4. Click the Evaluation tab. Then click Set up evaluation job.

Specify description, model objective, labels, and sampling percentage

The following steps describe the basic configuration details that you must specify in the job creation form:

  1. Add a description for your evaluation job in the Job description field.

  2. Specify what type of task your machine learning model performs in the Model objective field. Learn more about the types of machine learning models supported by continuous evaluation.

    • If your model performs classification, specify whether it performs single label classification or multilabel classification in the Classification type field.
    • If your model performs image object detection, specify an intersection-over-union (IOU) minimum between 0 and 1. This defines how similar your model's predicted bounding boxes need to be compared to ground truth bounding boxes in order to be considered a correct prediction.
  3. In the Prediction label file path field, specify the path to a CSV file in Cloud Storage that contains the possible labels for your model's predictions. This file defines an annotation specification set for your model. Learn how to structure this file.

  4. In the Daily sample percentage field, specify the percentage of predictions served by your model version that you want exported to BigQuery and analyzed as part of continuous evaluation.

    Additionally, specify a Daily sample limit to set a maximum for the number of predictions you want sampled during any single evaluation period.

    For example, perhaps you want to sample 10% of predictions for continuous evaluation. But in case you get a lot of predictions on a particular day, you want to make sure you never sample more than 100 predictions for that day's evaluation job run. (A large number of predictions may take a long time for human reviewers to label and incur more Data Labeling Service costs than you expect.)

Specify your BigQuery table

In the BigQuery table field, you must specify the name of a BigQuery table where Data Labeling Service can store predictions sampled from your model version.

If you specify the name of a table that doesn't exist yet, Data Labeling Service will create a table with the correct schema for you.

You must provide the full name of the table in the following format: bq://YOUR_PROJECT_ID.YOUR_DATASET_NAME.YOUR_TABLE_NAME

  • YOUR_PROJECT_ID must be the ID of the project where you are currently creating an evaluation job.
  • YOUR_DATASET_NAME can be any valid BigQuery dataset name. The dataset does not need to exist yet.
  • YOUR_TABLE_NAME can be any valid BigQuery table name.

If the table you specify does already exist, it must have the correct schema for continuous evaluation:

Field nameTypeMode
modelSTRINGREQUIRED
model_versionSTRINGREQUIRED
timeTIMESTAMPREQUIRED
raw_dataSTRINGREQUIRED
raw_predictionSTRINGNULLABLE
groundtruthSTRINGNULLABLE

The table must not have any additional columns besides these.

Specify prediction keys

You must specify the keys to certain fields in your input so that Data Labeling Service can extract necessary information from the raw prediction input and output JSON stored in your BigQuery table. Make sure your model version accepts input and returns predictions in the required format. Then provide the relevant keys:

  • Data key: The key to a field in your model version's prediction input that contains the data used for prediction. If you enable human labeling, Data Labeling Service gives this data to human reviewers to provide ground truth labels. It also uses it to show side by side comparisons when you view evaluation metrics in the Google Cloud console.

    If your model version performs text classification or general classification, you must provide this key. If your model version performs image classification or image object detection and accepts base64-encoded images as prediction input, you also must provide this key.

  • Data reference key: The key to a field in your model version's prediction input that contains a Cloud Storage path to an image. Data Labeling Service loads this image and uses it for the same purposes as it uses Data key.

    Only provide this key if your model version performs image classification or image object detection and accepts paths to images in Cloud Storage as prediction input. At least one of Data key and Data reference key is required.

  • Prediction label key: The key to a field in your model version's prediction output that contains an array of predicted labels. Data Labeling Service compares these values to ground truth values in order to calculate evaluation metrics like confusion matrixes.

    This field is required.

  • Prediction score key: The key to a field in your model version's prediction output that contains an array of predicted scores. Data Labeling Service uses these values together with prediction labels and ground truth labels in order to calculate evaluation metrics like precision-recall curves.

    This field is required.

  • Bounding box key: The key to a field in your model version's prediction output that contains an array of bounding boxes. This is required to evaluate image object detection.

    Only provide this key if your model version performs image object detection.

Prediction key examples

The following section provides examples of how to provide prediction keys for different types of models:

Image classification

Base64-encoded example

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "image_bytes": {
        "b64": "iVBORw0KGgoAAAANSUhEUgAAAAYAAAAGCAYAAADgzO9IAAAAhUlEQVR4AWOAgZeONnHvHcXiGJDBqyDTXa+dVC888oy51F9+eRdY8NdWwYz/RyT//znEsAjEt277+syt5VMJw989DM/+H2MI/L8tVBQk4d38xcWp7ctLhi97ZCZ0rXV6yLA4b6dH59sjTq3fnji1fp4AsWS5j7PXstRg+/b3gU7N351AQgA8+jkf43sjaQAAAABJRU5ErkJggg=="
      }
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "sentiments": [
        "happy"
      ],
      "confidence": [
        "0.8"
      ]
    }
  ]
}

Then provide the following keys:

  • Data key: image_bytes/b64
  • Prediction label key: sentiments
  • Prediction score key: confidence

Cloud Storage reference example

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "image_path": "gs://cloud-samples-data/datalabeling/image/flower_1.jpeg"
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "sentiments": [
        "happy"
      ],
      "confidence": [
        "0.8"
      ]
    }
  ]
}

Then provide the following keys:

  • Data reference key: image_path
  • Prediction label key: sentiments
  • Prediction score key: confidence

Text classification

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "text": "If music be the food of love, play on;"
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "sentiments": [
        "happy"
      ],
      "confidence": [
        "0.8"
      ]
    }
  ]
}

Then provide the following keys:

  • Data key: text
  • Prediction label key: sentiments
  • Prediction score key: confidence

General classification

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "weather": [
        "sunny",
        72,
        0.22
      ]
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "sentiments": [
        "happy"
      ],
      "confidence": [
        "0.8"
      ]
    }
  ]
}

Then provide the following keys:

  • Data key: weather
  • Prediction label key: sentiments
  • Prediction score key: confidence

Image object detection

Base64-encoded example

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "image_bytes": {
        "b64": "iVBORw0KGgoAAAANSUhEUgAAAAYAAAAGCAYAAADgzO9IAAAAhUlEQVR4AWOAgZeONnHvHcXiGJDBqyDTXa+dVC888oy51F9+eRdY8NdWwYz/RyT//znEsAjEt277+syt5VMJw989DM/+H2MI/L8tVBQk4d38xcWp7ctLhi97ZCZ0rXV6yLA4b6dH59sjTq3fnji1fp4AsWS5j7PXstRg+/b3gU7N351AQgA8+jkf43sjaQAAAABJRU5ErkJggg=="
      }
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "bird_locations": [
        {
          "top_left": {
            "x": 53,
            "y": 22
          },
          "bottom_right": {
            "x": 98,
            "y": 150
          }
        }
      ],
      "species": [
        "rufous hummingbird"
      ],
      "probability": [
        0.77
      ]
    }
  ]
}

Then provide the following keys:

  • Data key: image_bytes/b64
  • Prediction label key: species
  • Prediction score key: probability
  • Bounding box key: bird_locations

Cloud Storage reference example

Suppose your model version can accept the following input:

{
  "instances": [
    {
      "image_path": "gs://cloud-samples-data/datalabeling/image/flower_1.jpeg"
    }
  ]
}

And suppose it returns the following output:

{
  "predictions": [
    {
      "bird_locations": [
        {
          "top_left": {
            "x": 53,
            "y": 22
          },
          "bottom_right": {
            "x": 98,
            "y": 150
          }
        }
      ],
      "species": [
        "rufous hummingbird"
      ],
      "probability": [
        0.77
      ]
    }
  ]
}

Then provide the following keys:

  • Data reference key: image_path
  • Prediction label key: species
  • Prediction score key: probability
  • Bounding box key: bird_locations

Specify ground truth method

Continuous evaluation works by comparing your machine learning model's predictions with ground truth labels annotated by humans. Select how you want to create ground truth labels by clicking on your preferred Ground truth method:

  • Google-managed labeling service: If you select this option, every time the evaluation job runs, Data Labeling Service sends all the new sampled data to human reviewers to label with ground truth. Data Labeling Service pricing applies. If you choose this option, you must provide PDF instructions for labeling your prediction input. Learn how to write good instructions.

  • Provide your own labels: If you select this option, you must add ground truth labels to your evaluation job's BigQuery table yourself. You must add ground truth labels for any new prediction input that gets sampled before the next run of the evaluation job. By default, the evaluation job runs daily at 10:00 AM UTC, so you must add ground truth labels every day for any new rows in the BigQuery table before that time. Otherwise, that data will not be evaluated and you will see an error in the Google Cloud console.

    This is the only option if your model version performs general classification.

Create your job

Click the Create button to create your evaluation job. Prediction input and output should begin getting sampled from your model version into your BigQuery table immediately.

What's next

Learn how to view evaluation metrics.