BigQuery ML model evaluation overview

This document describes how BigQuery ML supports machine learning (ML) model evaluation.

Overview of model evaluation

You can use ML model evaluation metrics for the following purposes:

To assess the quality of the fit between the model and the data.
To compare different models.
To predict how accurately you can expect each model to perform on a specific dataset, in the context of model selection.

Supervised and unsupervised learning model evaluations work differently:

For supervised learning models, model evaluation is well-defined. An evaluation set, which is data that hasn't been analyzed by the model, is typically excluded from the training set and then used to evaluate model performance. We recommend that you don't use the training set for evaluation because this causes the model to perform poorly when generalizing the prediction results for new data. This outcome is known as overfitting.
For unsupervised learning models, model evaluation is less defined and typically varies from model to model. Because unsupervised learning models don't reserve an evaluation set, the evaluation metrics are calculated using the whole input dataset.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

Model evaluation offerings

BigQuery ML provides the following functions to calculate evaluation metrics for ML models:

Model category	Model types	Model evaluation functions	What the function does
Supervised learning	Linear regression Boosted trees regressor Random forest regressor DNN regressor Wide-and-deep regressor AutoML Tables regressor	`ML.EVALUATE`	Reports the following metrics: mean absolute error mean squared error mean squared log error median absolute error r2 score explained variance
	Logistic regression Boosted trees classifier Random forest classifier DNN classifier Wide-and-deep classifier AutoML Tables classifier	`ML.EVALUATE`	Reports the following metrics: precision recall accuracy F1 score log loss roc auc
		`ML.CONFUSION_MATRIX`	Reports the confusion matrix.
		`ML.ROC_CURVE`	Reports metrics for different threshold values, including the following: recall false positive rate true positives false positives true negatives false negatives Only applies to binary-class classification models.
Unsupervised learning	K-means	`ML.EVALUATE`	Reports the Davies-Bouldin index, and the mean squared distance between data points and the centroids of the assigned clusters.
	Matrix factorization	`ML.EVALUATE`	For explicit feedback-based models, reports the following metrics: mean absolute error mean squared error mean squared log error median absolute error r2 score explained variance
	Matrix factorization	`ML.EVALUATE`	For implicit feedback-based models, reports the following metrics: mean average precision mean squared error normalized discounted cumulative gain average rank
	PCA	`ML.EVALUATE`	Reports the total explained variance ratio.
	Autoencoder	`ML.EVALUATE`	Reports the following metrics: mean absolute error mean squared error mean squared log error
Time series	ARIMA_PLUS	`ML.EVALUATE`	Reports the following metrics: mean absolute error mean squared error mean absolute percentage error symmetric mean absolute percentage error This function requires new data as input.
Time series	ARIMA_PLUS	`ML.ARIMA_EVALUATE`	Reports the following metrics for all ARIMA candidate models characterized by different (p, d, q, has_drift) tuples: log_likelihood AIC variance It also reports other information about seasonality, holiday effects, and spikes-and-dips outliers. This function doesn't require new data as input.

Automatic evaluation in `CREATE MODEL` statements

BigQuery ML supports automatic evaluation during model creation. Depending on the model type, the data split training options, and whether you're using hyperparameter tuning, the evaluation metrics are calculated upon the reserved evaluation dataset, the reserved test dataset, or the entire input dataset.

For k-means, PCA, autoencoder, and ARIMA_PLUS models, BigQuery ML uses all of the input data as training data, and evaluation metrics are calculated against the entire input dataset.
For linear and logistic regression, boosted tree, random forest, DNN, Wide-and-deep, and matrix factorization models, evaluation metrics are calculated against the dataset that's specified by the following CREATE MODEL options:
When you train these types of models using hyperparameter tuning, the DATA_SPLIT_TEST_FRACTION option also helps define the dataset that the evaluation metrics are calculated against. For more information, see Data split.
For AutoML Tables models, see how data splits are used for training and evaluation.

To get evaluation metrics calculated during model creation, use evaluation functions such as ML.EVALUATE on the model with no input data specified. For an example, see ML.EVALUATE with no input data specified.

Evaluation with a new dataset

After model creation, you can specify new datasets for evaluation. To provide a new dataset, use evaluation functions like ML.EVALUATE on the model with input data specified. For an example, see ML.EVALUATE with a custom threshold and input data.

BigQuery ML model evaluation overview

Overview of model evaluation

Model evaluation offerings

Automatic evaluation in CREATE MODEL statements

Evaluation with a new dataset

Automatic evaluation in `CREATE MODEL` statements