Evaluating models

This page describes how to use evaluation metrics for your model after it is trained, and provides some basic suggestions for ways you might be able to improve model performance.

Introduction

After training a model, AutoML Tables uses the test dataset to evaluate the quality and accuracy of the new model, and provides an aggregate set of evaluation metrics indicating how well the model performed on the test dataset.

Using the evaluations metrics to determine the quality of your model depends on your business need and the problem you model is trained to solve. For example, there might be a higher cost to false positives than for false negatives, or vice versa. For regression models, does the delta between the prediction and the correct answer matter or not? These kinds of questions affect how you will look at your model evaluation metrics.

If you included a weight column in your training data, it does not affect evaluation metrics. Weights are considered only during the training phase.

Evaluation metrics for classification models

Classification models provide the following metrics:

  • AUC PR: The area under the precision-recall (PR) curve. This value ranges from zero to one, where a higher value indicates a higher-quality model.

  • AUC ROC: The area under the receiver operating characteristic (ROC) curve. This ranges from zero to one, where a higher value indicates a higher-quality model.

  • Accuracy: The fraction of classification predictions produced by the model that were correct.

  • Log loss: The cross-entropy between the model predictions and the target values. This ranges from zero to infinity, where a lower value indicates a higher-quality model.

  • F1 score: The harmonic mean of precision and recall. F1 is a useful metric if you're looking for a balance between precision and recall and there's an uneven class distribution.

  • Precision: The fraction of classification predictions produced by the model that were correct.

  • Recall: The fraction of rows with this label that the model correctly predicted. Also called "True positive rate".

  • False positive rate: The fraction of rows predicted by the model to be the target label but aren't (false positive).

These metrics are returned for every distinct value of the target column. For multi-class classifications, these metrics are micro-averaged and returned as the summary metrics. For binary classifications, the metrics for the minority class are used as the summary metrics. The micro-averaged metrics are the expected value of each metric on a random sample from your dataset.

In addition to the above metrics, AutoML Tables provides two other ways to understand your classification model, the confusion matrix and a feature importance graph.

  • Confusion matrix: The confusion matrix helps you understand where misclassifications occur (which classes get "confused" with each other). Each row is a predicted class and each column is an observed class. The cells of the table indicate how often each classification prediction coincides with each observed class.

    Confusion matrices are provided only for classification models with 10 or fewer values for the target column.

    AutoML Tables evaluate page

  • Feature importance: AutoML Tables tells you what features it found to be most important for building this model in the Feature importance graph. Feature importance is computed by measuring the impact that each feature has on the prediction, when perturbed across a wide spectrum of values sampled from the dataset. You should review this information to ensure that all of the most important features make sense for your data.

    AutoML Tables evaluate page

How micro-averaged precision is calculated

The micro-averaged precision is calculated by adding together the number of true positives (TP) for each potential value of the target column and dividing it by the number of true positives (TP) and true negatives (TN) for each potential value.

\[ precision_{micro} = \dfrac{TP_1 + \ldots + TP_n} {TP_1 + \ldots + TP_n + FP_1 + \ldots + FP_n} \]

where

  • \(TP_1 + \ldots + TP_n\) is the sum of the true positives for each of n classes
  • \(FP_1 + \ldots + FP_n\) is the sum of false positives for each of n classes

Score threshold

The score threshold is a number that ranges from 0 to 1. It provides a way to specify the minimum confidence level where a given prediction value should be taken as true. For example, if you have a class that is quite unlikely to be the actual value, then you would want to lower the threshold for that class; using a threshold of .5 or higher would result in that class being predicted extremely rarely (or never).

As you move the threshold from 0 to 1, you can see how the threshold affects precision and recall. A higher threshold results in an increase in precision (because the model never makes a prediction unless it is extremely sure) but the recall (the percentage of positive examples that the model gets right) decreases.

Evaluation metrics for regression models

Regression models provide the following metrics:

  • MAE: The mean absolute error (MAE) is the average absolute difference between the target values and the predicted values. This metric ranges from zero to infinity; a lower value indicates a higher quality model.

  • RMSE: The root-mean-square error metric is a frequently used measure of the differences between the values predicted by a model or an estimator and the values observed. This metric ranges from zero to infinity; a lower value indicates a higher quality model.

  • RMSLE: The root-mean-squared logarithmic error metric is similar to RMSE, except that it uses the natural logarithm of the predicted and actual values plus 1. RMSLE penalizes under-prediction more heavily than over-prediction. It can also be a good metric when you don't want to penalize differences for large prediction values more heavily than for small prediction values. This metric ranges from zero to infinity; a lower value indicates a higher quality model. The RMSLE evaluation metric is returned only if all label and predicted values are non-negative.

  • R^2: R squared (R^2), also known as the coefficient of determination, is the square of the Pearson correlation coefficient between the labels and predicted values. This metric ranges between zero and one; a higher value indicates a higher quality model.

  • MAPE: Mean absolute percentage error (MAPE) is the average absolute percentage difference between the labels and the predicted values. This metric ranges between zero and infinity; a lower value indicates a higher quality model.

  • Feature importance: AutoML Tables tells you what features it found to be most important for building this model in the Feature importance graph. Feature importance is computed by measuring the impact that each feature has on the prediction, when perturbed across a wide spectrum of values sampled from the dataset. You should review this information to ensure that all of the most important features make sense for your data.

    AutoML Tables evaluate page

Getting the evaluation metrics for your model

To evaluate how well your model did on the test dataset, you inspect the evaluation metrics for your model.

Console

To see your model's evaluation metrics using the Google Cloud Platform Console:

  1. Go to the AutoML Tables page in the Google Cloud Platform Console.

    Go to the AutoML Tables page

  2. Select the Models tab in the left navigation pane, and select the model you want to get the evaluation metrics for.

  3. Open the Evaluate tab.

    The summary evaluation metrics are displayed across the top of the screen. For binary classification models, the summary metrics are the metrics of the minority class. For multi-class classification models, the summary metrics are themicro-averaged metrics.

    For classification metrics, you can click on individual target values to see the metrics for that value.

    Evaluation metrics for a trained model

curl command

* Replace model-id with your model ID.
curl -X GET \
  -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
  -H "Content-Type: application/json" \
  https://automl.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/us-central1/models/model-id/modelEvaluations

For a classification model, the results contain the total number of evaluated rows, the micro-averaged metrics (for multi-class classifications), and metrics for each target column value.

The metrics are provided for each combination of confidence threshold range and position threshold. The confidence threshold is the threshold at which you want to declare a specific value to be true. It is equivalent to the score threshold provided by the GCP Console. The position threshold determines how many outcomes are considered for a prediction. Generally, you should use the metrics returned for a position threshold value of one.

Classification model evaluation results look similar to the following example:

  "modelEvaluation": [
    {
      "name": "projects/1234/locations/us-central1/models/TBL5678/modelEvaluations/2398935",
      "createTime": "2019-05-15T00:20:03.305056Z",
      "evaluatedExampleCount": 4546,
      "classificationEvaluationMetrics": {
        "auPrc": 0.9773522,
        "confidenceMetricsEntry": [
          {
            "recall": 1,
            "precision": 0.5,
            "f1Score": 0.6666667,
            "falsePositiveRate": 1,
            "truePositiveCount": "4546",
            "falsePositiveCount": "4546",
            "positionThreshold": 2147483647
          },
          {
            "confidenceThreshold": 8.994935e-08,
            "recall": 1,
            "precision": 0.500055,
            "f1Score": 0.66671556,
            "falsePositiveRate": 0.99978,
            "truePositiveCount": "4546",
            "falsePositiveCount": "4545",
            "trueNegativeCount": "1",
            "positionThreshold": 2147483647
          },
          {
            "confidenceThreshold": 6.5273593e-06,
            "recall": 1,
            "precision": 0.50232047,
            "f1Score": 0.6687261,
            "falsePositiveRate": 0.9907611,
            "truePositiveCount": "4546",
            "falsePositiveCount": "4504",
            "trueNegativeCount": "42",
            "positionThreshold": 2147483647
          },
          ...
       ],
        "confusionMatrix": {
          "row": [
            {
              "exampleCount": [
                3868,
                153
              ]
            },
            {
              "exampleCount": [
                256,
                269
              ]
            }
          ]
        },
        "auRoc": 0.9767073,
        "logLoss": 0.19408731
      }
    },
    ...
          {
            "confidenceThreshold": 1,
            "precision": "NaN",
            "f1Score": "NaN",
            "falseNegativeCount": "525",
            "trueNegativeCount": "4021",
            "positionThreshold": 1
          }
        ],
        "auRoc": 0.73951846,
        "logLoss": 0.997382
      },
      "displayName": "blue"
    },
    {
     "name": "projects/1234/locations/us-central1/models/TBL5678/modelEvaluations/5739933",
      "annotationSpecId": "not available",
      "createTime": "2019-05-15T00:20:03.305056Z",
      "evaluatedExampleCount": 525,
      "classificationEvaluationMetrics": {
        "auPrc": 0.37806523,
        "confidenceMetricsEntry": [
...
       ],
        "auRoc": 0.9349546,
        "logLoss": 0.08920552
      },
      "displayName": "red"
    }
  ]
}

For a regression model, you should see output similar to the following example:

{
  "modelEvaluation": [
    {
      "name": "projects/1234/locations/us-central1/models/TBL2345/modelEvaluations/68066093",
      "createTime": "2019-05-15T22:33:06.471561Z",
      "evaluatedExampleCount": 418
    },
    {
      "name": "projects/1234/locations/us-central1/models/TBL2345/modelEvaluations/852167724",
      "createTime": "2019-05-15T22:33:06.471561Z",
      "evaluatedExampleCount": 418,
      "regressionEvaluationMetrics": {
        "rootMeanSquaredError": 1.9845301,
        "meanAbsoluteError": 1.48482,
        "meanAbsolutePercentageError": 15.155516,
        "rSquared": 0.6057632,
        "rootMeanSquaredLogError": 0.16848126
      }
    }
  ]
}

Java

/**
 * Demonstrates using the AutoML client to list model evaluations.
 *
 * @param projectId the Id of the project.
 * @param computeRegion the Region name. (e.g., "us-central1")
 * @param modelId the Id of the model.
 * @param filter the Filter expression.
 * @throws IOException
 */
public static void listModelEvaluations(
    String projectId, String computeRegion, String modelId, String filter) throws IOException {
  // Instantiates a client.
  AutoMlClient client = AutoMlClient.create();

  // Get the full path of the model.
  ModelName modelFullId = ModelName.of(projectId, computeRegion, modelId);

  // Create list model evaluations request.
  ListModelEvaluationsRequest modelEvaluationsRequest =
      ListModelEvaluationsRequest.newBuilder()
          .setParent(modelFullId.toString())
          .setFilter(filter)
          .build();

  // List all the model evaluations in the model by applying filter.
  for (ModelEvaluation element :
      client.listModelEvaluations(modelEvaluationsRequest).iterateAll()) {
    // Display the model evaluations information.
    System.out.println(String.format("Model evaluation name: %s", element.getName()));
    System.out.println(
        String.format(
            "Model evaluation Id: %s",
            element.getName().split("/")[element.getName().split("/").length - 1]));
    System.out.println(
        String.format("Model evaluation annotation spec Id: %s", element.getAnnotationSpecId()));
    System.out.println(
        String.format("Model evaluation example count: %s", element.getEvaluatedExampleCount()));
    System.out.println(
        String.format("Model evaluation display name: %s", element.getDisplayName()));

    DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ");
    String createTime =
        dateFormat.format(new java.util.Date(element.getCreateTime().getSeconds() * 1000));
    System.out.println(String.format("Model evaluation create time: %s", createTime));

    int regressionLength = element.getRegressionEvaluationMetrics().toString().length();
    int classificationLength = element.getClassificationEvaluationMetrics().toString().length();

    if (classificationLength > 0) {
      ClassificationEvaluationMetrics classificationMetrics =
          element.getClassificationEvaluationMetrics();
      List<ConfidenceMetricsEntry> confidenceMetricsEntries =
          classificationMetrics.getConfidenceMetricsEntryList();

      // Display tables classification model information.
      System.out.println("Table classification evaluation metrics:");
      System.out.println(String.format("\tModel au_prc: %f ", classificationMetrics.getAuPrc()));
      System.out.println(String.format("\tModel au_roc: %f ", classificationMetrics.getAuRoc()));
      System.out.println(
          String.format("\tModel log loss: %f ", classificationMetrics.getLogLoss()));

      if (!confidenceMetricsEntries.isEmpty()) {
        System.out.println("\tConfidence metrics entries:");
        // Showing table classification evaluation metrics.
        for (ConfidenceMetricsEntry confidenceMetricsEntry : confidenceMetricsEntries) {
          System.out.println(
              String.format(
                  "\t\tModel confidence threshold: %.2f ",
                  confidenceMetricsEntry.getConfidenceThreshold()));
          System.out.println(
              String.format(
                      "\t\tModel precision: %.2f ", confidenceMetricsEntry.getPrecision() * 100)
                  + '%');
          System.out.println(
              String.format("\t\tModel recall: %.2f ", confidenceMetricsEntry.getRecall() * 100)
                  + '%');
          System.out.println(
              String.format(
                      "\t\tModel f1 score: %.2f ", confidenceMetricsEntry.getF1Score() * 100)
                  + '%');
          System.out.println(
              String.format(
                      "\t\tModel precision@1: %.2f ",
                      confidenceMetricsEntry.getPrecisionAt1() * 100)
                  + '%');
          System.out.println(
              String.format(
                      "\t\tModel recall@1: %.2f ", confidenceMetricsEntry.getRecallAt1() * 100)
                  + '%');
          System.out.println(
              String.format(
                      "\t\tModel f1 score@1: %.2f ", confidenceMetricsEntry.getF1ScoreAt1() * 100)
                  + '%');
          System.out.println(
              String.format(
                  "\t\tModel false positive rate: %.2f ",
                  confidenceMetricsEntry.getFalsePositiveRate()));
          System.out.println(
              String.format(
                  "\t\tModel true positive count: %s ",
                  confidenceMetricsEntry.getTruePositiveCount()));
          System.out.println(
              String.format(
                  "\t\tModel false positive count: %s ",
                  confidenceMetricsEntry.getFalsePositiveCount()));
          System.out.println(
              String.format(
                  "\t\tModel true negative count: %s ",
                  confidenceMetricsEntry.getTrueNegativeCount()));
          System.out.println(
              String.format(
                  "\t\tModel false negative count: %s ",
                  confidenceMetricsEntry.getFalseNegativeCount()));
          System.out.println(
              String.format(
                      "\t\tModel position threshold: %s ",
                      confidenceMetricsEntry.getPositionThreshold())
                  + '\n');
        }
      }
    } else if (regressionLength > 0) {
      RegressionEvaluationMetrics regressionMetrics = element.getRegressionEvaluationMetrics();
      System.out.println("Table regression evaluation metrics:");
      // Showing tables regression evaluation metrics
      System.out.println(
          String.format(
                  "\tModel root mean squared error: ",
                  regressionMetrics.getRootMeanSquaredError())
              + '%');
      System.out.println(
          String.format("\tModel mean absolute error: ", regressionMetrics.getMeanAbsoluteError())
              + '%');
      System.out.println(
          String.format(
                  "\tModel mean absolute percentage error: ",
                  regressionMetrics.getMeanAbsolutePercentageError())
              + '%');
      System.out.println(
          String.format("\tModel rsquared: ", regressionMetrics.getRSquared()) + '%' + '\n');
    }
  }
}

Node.js

const automl = require(`@google-cloud/automl`);
const math = require(`mathjs`);
const client = new automl.v1beta1.AutoMlClient();

/**
 * Demonstrates using the AutoML client to list model evaluations.
 * TODO(developer): Uncomment the following lines before running the sample.
 */
// const projectId = '[PROJECT_ID]' e.g., "my-gcloud-project";
// const computeRegion = '[REGION_NAME]' e.g., "us-central1";
// const modelId = '[MODEL_ID]' e.g., "TBL4704590352927948800";
// const filter = '[FILTER_EXPRESSIONS]' e.g., "tablesModelMetadata:*";

// Get the full path of the model.
const modelFullId = client.modelPath(projectId, computeRegion, modelId);

// List all the model evaluations in the model by applying filter.
client
  .listModelEvaluations({parent: modelFullId, filter: filter})
  .then(responses => {
    const element = responses[0];
    console.log(`List of model evaluations:`);
    for (let i = 0; i < element.length; i++) {
      const classMetrics = element[i].classificationEvaluationMetrics;
      const regressionMetrics = element[i].regressionEvaluationMetrics;
      const evaluationId = element[i].name.split(`/`)[7].split('`')[0];

      console.log(`Model evaluation name: ${element[i].name}`);
      console.log(`Model evaluation Id: ${evaluationId}`);
      console.log(
        `Model evaluation annotation spec Id: ${element[i].annotationSpecId}`
      );
      console.log(`Model evaluation display name: ${element[i].displayName}`);
      console.log(
        `Model evaluation example count: ${element[i].evaluatedExampleCount}`
      );

      if (classMetrics) {
        const confidenceMetricsEntries = classMetrics.confidenceMetricsEntry;

        console.log(`Table classification evaluation metrics:`);
        console.log(`\tModel auPrc: ${math.round(classMetrics.auPrc, 6)}`);
        console.log(`\tModel auRoc: ${math.round(classMetrics.auRoc, 6)}`);
        console.log(
          `\tModel log loss: ${math.round(classMetrics.logLoss, 6)}`
        );

        if (confidenceMetricsEntries.length > 0) {
          console.log(`\tConfidence metrics entries:`);

          for (const confidenceMetricsEntry of confidenceMetricsEntries) {
            console.log(
              `\t\tModel confidence threshold: ${math.round(
                confidenceMetricsEntry.confidenceThreshold,
                6
              )}`
            );
            console.log(
              `\t\tModel position threshold: ${math.round(
                confidenceMetricsEntry.positionThreshold,
                4
              )}`
            );
            console.log(
              `\t\tModel recall: ${math.round(
                confidenceMetricsEntry.recall * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel precision: ${math.round(
                confidenceMetricsEntry.precision * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel false positive rate: ${confidenceMetricsEntry.falsePositiveRate}`
            );
            console.log(
              `\t\tModel f1 score: ${math.round(
                confidenceMetricsEntry.f1Score * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel recall@1: ${math.round(
                confidenceMetricsEntry.recallAt1 * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel precision@1: ${math.round(
                confidenceMetricsEntry.precisionAt1 * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel false positive rate@1: ${confidenceMetricsEntry.falsePositiveRateAt1}`
            );
            console.log(
              `\t\tModel f1 score@1: ${math.round(
                confidenceMetricsEntry.f1ScoreAt1 * 100,
                2
              )} %`
            );
            console.log(
              `\t\tModel true positive count: ${confidenceMetricsEntry.truePositiveCount}`
            );
            console.log(
              `\t\tModel false positive count: ${confidenceMetricsEntry.falsePositiveCount}`
            );
            console.log(
              `\t\tModel false negative count: ${confidenceMetricsEntry.falseNegativeCount}`
            );
            console.log(
              `\t\tModel true negative count: ${confidenceMetricsEntry.trueNegativeCount}`
            );
            console.log(`\n`);
          }
        }
        console.log(
          `\tModel annotation spec Id: ${classMetrics.annotationSpecId}`
        );
      } else if (regressionMetrics) {
        console.log(`Table regression evaluation metrics:`);
        console.log(
          `\tModel root mean squared error: ${regressionMetrics.rootMeanSquaredError}`
        );
        console.log(
          `\tModel mean absolute error: ${regressionMetrics.meanAbsoluteError}`
        );
        console.log(
          `\tModel mean absolute percentage error: ${regressionMetrics.meanAbsolutePercentageError}`
        );
        console.log(`\tModel rSquared: ${regressionMetrics.rSquared}`);
      }
      console.log(`\n`);
    }
  })
  .catch(err => {
    console.error(err);
  });

Python

# TODO(developer): Uncomment and set the following variables
# project_id = 'PROJECT_ID_HERE'
# compute_region = 'COMPUTE_REGION_HERE'
# model_id = 'MODEL_ID_HERE'
# filter_ = 'filter expression here'

from google.cloud import automl_v1beta1 as automl

client = automl.AutoMlClient()

# Get the full path of the model.
model_full_id = client.model_path(project_id, compute_region, model_id)

# List all the model evaluations in the model by applying filter.
response = client.list_model_evaluations(model_full_id, filter_)

print("List of model evaluations:")
for evaluation in response:
    print("Model evaluation name: {}".format(evaluation.name))
    print("Model evaluation id: {}".format(evaluation.name.split("/")[-1]))
    print("Model evaluation example count: {}".format(
        evaluation.evaluated_example_count))
    print("Model evaluation time:")
    print("\tseconds: {}".format(evaluation.create_time.seconds))
    print("\tnanos: {}".format(evaluation.create_time.nanos))
    print("\n")

Troubleshooting model issues

Model evaluation metrics should be good, but not perfect. Poor model performance and perfect model performance are both indications that something went wrong with the training process.

Poor performance

If your model is not performing as well as you would like, here are some things to try.

  • Review your schema.

    Make sure all your columns have the correct type, and that you excluded from training any columns that were not predictive, such as ID columns.

  • Review your data

    Missing values in non-nullable columns cause that row to be ignored. Make sure your data does not have too many errors.

  • Export the test dataset and examine it.

    By inspecting the data and analyzing when the model is making incorrect predictions, you might determine that you need more training data for a particular outcome, or that your training data introduced leakage.

  • Increase the amount of training data.

    If you don't have enough training data, model quality suffers. Make sure your training data is as unbiased as possible.

  • Increase the training time

    If you had a short training time, you might get a higher-quality model by allowing it to train for a longer period of time.

Perfect performance

If your model returned near-perfect evaluation metrics, something might be wrong with your training data. Here are some things to look for:

  • Target leakage

    Target leakage happens when a feature is included in the training data that cannot be known at training time, and which is based on the outcome. For example, if you included a Frequent Buyer number for a model trained to decide whether a first-time user would make a purchase, that model would have very high evaluation metrics, but would perform poorly on real data, because the Frequent Buyer number could not be included.

    To check for target leakage, review the Feature importance graph on the EVALUATE tab for your model. Make sure the columns with high importance are truly predictive and are not leaking information about the target.

  • Time column

    If the time of your data matters, make sure you used a Time column or a manual split based on time. Not doing so can skew your evaluation metrics. Learn more.

Downloading your test dataset to BigQuery

You can download your test dataset, including the target column, along with the model's result for each row. Inspecting the rows that the model got wrong can provide clues for how to improve the model.

  1. Open AutoML Tables in the GCP Console.

    Go to the AutoML Tables page

  2. Select Models in the left navigation pane and click your model.

  3. Open the Evaluate tab and click Export predictions on test dataset to BigQuery.

  4. After the export completes, click View your evaluation results in BigQuery to see your data.

What's next

هل كانت هذه الصفحة مفيدة؟ يرجى تقييم أدائنا:

إرسال تعليقات حول...

AutoML Tables Documentation