BigQuery ML Model Evaluation Overview

Introduction

Machine learning model evaluation metrics are used to 1) assess quality of fit between the model and the data, 2) to compare different models, and 3) in the context of model selection, and to predict how accurate each model can be expected to perform on a specific data set.

Supervised and unsupervised learning model evaluations work differently.

  • For supervised learning models, model evaluation is well-defined. An evaluation set, the data not seen by the model, is typically excluded from the training set to evaluate model performance. We recommend you not use a training set for evaluation because it will render the entire training set, generalizing the prediction results poorly for the new data. This is known as "overfitting".
  • For unsupervised learning models, model evaluation is less defined and typically varies from model to model. As unsupervised learning models do not reserve an evaluation set, the evluation metrics are calculated using the whole input data set.

For information about the supported model types of each SQL statement and function, and all of the supported SQL statements and functions for each model type, read the End-to-end user journey for each model.

Model evaluation offerings in BigQuery ML

BigQuery ML supports the following functions that calculate evaluation metrics for machine learning models:

Model category Model types Model evaluation functions What the function does
Supervised learning Linear regression

Boosted Trees regressor

DNN regressor

Wide-and-Deep regressor

AutoML Tables regressor
ML.EVALUATE Reports the mean absolute error, the mean squared error, the mean squared log error, the median absolute error, the r2 score and the explained variance.
Logistic Regression

Boosted Trees classifier

DNN classifier

Wide-and-Deep classifier

AutoML Tables classifier
ML.EVALUATE Reports the precision, the recall, the accuracy, the F1 score, the log los and the roc auc.
ML.CONFUSION_MATRIX Reports the confusion matrix.
ML.ROC_CURVE Reports metrics for different threshold values, including: recall, false positive rate, true positives, false positives, true negatives and false negatives.

Only applies to binary-class classification models.
Unsupervised learning Kmeans ML.EVALUATE Reports the Davies-Bouldin index, and the mean squared distance between data points and the centroids of the assigned clusters.
Matrix Factorization ML.EVALUATE For explicit feedback-based models, reports the mean absolute error, the mean squared error, the mean squared log error, the median absolute error, the r2 score, and the explained variance.
For implicit feedback-based models, reports the mean average precision, the mean squared error, the normalized discounted cumulative gain, and the average rank.
PCA ML.EVALUATE Reports the total explained variance ratio.
Autoencoder ML.EVALUATE Reports the mean absolute error, the mean squared error, and the mean squared log error.
Time series models ARIMA_PLUS ML.EVALUATE
ML.ARIMA_EVALUATE1
Reports metrics (including log_likelihood, AIC, variance) for all ARIMA candidate models characterized by different (p, d, q, has_drift) tuples, as well as other information about seasonality, holiday effects, spikes-and-dips outliers, etc.

1ML.ARIMA_EVALUATE is an extended version of ML.EVALUATE for ARIMA_PLUS models.

Automatic evaluation in CREATE MODEL statements

BigQuery ML supports automatic evaluation during model creation. Depending on the models, the data split training options, and whether the training uses hyperparameter tuning, the evaluation metrics can be calculated upon the reserved evaluation dataset, the reserved test dataset, or the entire input dataset.

  • For Kmeans, PCA, Autoencoder, and ARIMA_PLUS models, we use all of the input data as training data and evaluation metrics are calculated against the entire input dataset.

  • For linear models, Boosted trees, DNN, Wide and Deep, and Matrix Factorization models, evaluation metrics are calculated against the dataset specified in the training options based on the values specified in the CREATE MODEL statement, including: DATA_SPLIT_METHOD, DATA_SPLIT_EVAL_FRACTION, and DATA_SPLIT_COL. When these models are trained with automatic hyperparameter tuning, the DATA_SPLIT_TEST_FRACTION training option also helps define the dataset that the evaluation metrics are calculated against. See data split for more information.

  • For AutoML Tables models, see how data splits are used for training and evaluation.

For evaluation metrics calculated during model creation, you can use evaluation functions such as ML.EVALUATE on the model with no input data specified. See this query for an example of model evaluation without specified input data.

Evaluation with new dataset

You can specify new datasets for evaluation after model creation. To provide a new dataset, use an evaluation functions like ML.EVALUATE on the model with input data specified. See this query for an example of model evaluation with a new dataset.