Continuous evaluation regularly samples prediction input and output from trained machine learning models that you have deployed to AI Platform Prediction. AI Platform Data Labeling Service then assigns human reviewers to provide ground truth labels for your prediction input; alternatively, you can provide your own ground truth labels. Data Labeling Service compares your models' predictions with the ground truth labels to provide continual feedback on how your model is performing over time.
How it works
To use continuous evaluation, you must have already deployed a trained machine learning model to AI Platform Prediction as a model version. Then you can create an evaluation job for the model version.
Creating an evaluation job for a model version causes two things to happen:
- As the model version serves online predictions, the input and output for some of these predictions get saved in a BigQuery table. This sampling happens frequently, anytime your model serves predictions. You can customize how much data gets sampled.
- Intermittently, the evaluation job runs, generating evaluation metrics.
You can view the resulting evaluation metrics in the Google Cloud console.
Ground truth labels are what humans determine to be the correct labels for your machine learning task. Continuous evaluation uses these as an answer key and calculates metrics by comparing your model version's predictions to the ground truth labels. When you create an evaluation job, you must decide how you want to generate ground truth labels for your prediction data. There are two options:
- Data Labeling Service can assign human reviewers to provide ground truth labels for your prediction data.
- You can provide ground truth labels yourself.
Evaluation job runs
By default, evaluation jobs run daily at 10:00 AM UTC. What occurs during a run depends on how you decide to generate ground truth labels.
If Data Labeling Service provides ground truth labels
When the evaluation job runs, Data Labeling Service creates a dataset with all the new rows in the BigQuery since the last run. These contain the prediction input and output sampled from your model version.
Then, Data Labeling Service submits a labeling request based on this dataset to have human reviewers provide ground truth labels.
When the labeling request is complete, Data Labeling Service calculates evaluation metrics using the new ground truth labels and your model version's prediction output.
Especially if you sample a lot of prediction data into your BigQuery table, it can take over a day for human labelers to complete your labeling request. If this happens, your evaluation job still runs again on the next day, according to its schedule. This means that the job is running multiple times in parallel. Each run applies to a separate sample of prediction data: your model version's predictions in the day before the run. Regardless of how long the runs take, they produce evaluation metrics for a specific day of predictions.
If you provide ground truth labels yourself
As in the previous section, when the evaluation job runs, Data Labeling Service
creates a dataset with all the new rows in the
BigQuery since the last run. But in this case, you must have
already added ground truth labels to the table's
groundtruth column before
the run occurs. The evaluation job skips any rows without a ground truth label,
and these rows cannot be included in a future evaluation job run.
After creating the dataset, Data Labeling Service immediately calculates evaluation metrics.
If you use this method, you must add ground truth labels for new predictions every day, before the evaluation job runs.
If Data Labeling Service assigns human reviewers to provide ground truth labels, Data Labeling Service pricing applies. Besides this, there are no specific costs for continuous evaluation.
Read about what you need to do before you can create an evaluation job.