Machine learning model evaluation metrics are used to 1) assess quality of fit between the model and the data, 2) to compare different models, and 3) in the context of model selection, and to predict how accurate each model can be expected to perform on a specific data set.
Supervised and unsupervised learning model evaluations work differently.
- For supervised learning models, model evaluation is well-defined. An evaluation set, the data not seen by the model, is typically excluded from the training set to evaluate model performance. We recommend you not use a training set for evaluation because it will render the entire training set, generalizing the prediction results poorly for the new data. This is known as "overfitting".
- For unsupervised learning models, model evaluation is less defined and typically varies from model to model. As unsupervised learning models do not reserve an evaluation set, the evaluation metrics are calculated using the whole input data set.
For information about the supported model types of each SQL statement and function, and all of the supported SQL statements and functions for each model type, read the End-to-end user journey for each model.
Model evaluation offerings in BigQuery ML
BigQuery ML supports the following functions that calculate evaluation metrics for machine learning models:
|Model category||Model types||Model evaluation functions||What the function does|
|Supervised learning||Linear regressionBoosted Trees regressorRandom Forest regressorDNN regressorWide-and-Deep regressorAutoML Tables regressor||ML.EVALUATE||Reports the mean absolute error, the mean squared error, the mean squared log error, the median absolute error, the r2 score and the explained variance.|
|Logistic RegressionBoosted Trees classifierRandom Forest classifierDNN classifierWide-and-Deep classifierAutoML Tables classifier||ML.EVALUATE||Reports the precision, the recall, the accuracy, the F1 score, the log los and the roc auc.|
|ML.CONFUSION_MATRIX||Reports the confusion matrix.|
|ML.ROC_CURVE||Reports metrics for different threshold values, including: recall, false positive rate, true positives, false positives, true negatives and false negatives.Only applies to binary-class classification models.|
|Unsupervised learning||Kmeans||ML.EVALUATE||Reports the Davies-Bouldin index, and the mean squared distance between data points and the centroids of the assigned clusters.|
|Matrix Factorization||ML.EVALUATE||For explicit feedback-based models, reports the mean absolute error, the mean squared error, the mean squared log error, the median absolute error, the r2 score, and the explained variance.|
|For implicit feedback-based models, reports the mean average precision, the mean squared error, the normalized discounted cumulative gain, and the average rank.|
|PCA||ML.EVALUATE||Reports the total explained variance ratio.|
|Autoencoder||ML.EVALUATE||Reports the mean absolute error, the mean squared error, and the mean squared log error.|
|Time series models||ARIMA_PLUS||ML.EVALUATE||Reports forecasting accuracy metrics that includes the mean absolute error, the mean squared error, the mean absolute percentage error, and the symmetric mean absolute percentage error. This requires new data as input.|
|ML.ARIMA_EVALUATE||Reports metrics (including log_likelihood, AIC, and variance) for all ARIMA candidate models characterized by different (p, d, q, has_drift) tuples, as well as other information about seasonality, holiday effects, spikes-and-dips outliers, etc. This doesn't require new data as input.|
Automatic evaluation in
CREATE MODEL statements
BigQuery ML supports automatic evaluation during model creation. Depending on the models, the data split training options, and whether the training uses hyperparameter tuning, the evaluation metrics can be calculated upon the reserved evaluation dataset, the reserved test dataset, or the entire input dataset.
For Kmeans, PCA, Autoencoder, and ARIMA_PLUS models, we use all of the input data as training data and evaluation metrics are calculated against the entire input dataset.
For linear models, Boosted trees, Random forest, DNN, Wide and Deep, and Matrix Factorization models, evaluation metrics are calculated against the dataset specified in the training options based on the values specified in the
CREATE MODELstatement, including: DATA_SPLIT_METHOD, DATA_SPLIT_EVAL_FRACTION, and DATA_SPLIT_COL. When these models are trained with automatic hyperparameter tuning, the DATA_SPLIT_TEST_FRACTION training option also helps define the dataset that the evaluation metrics are calculated against. See data split for more information.
For AutoML Tables models, see how data splits are used for training and evaluation.
For evaluation metrics calculated during model creation, you can use evaluation functions such as
ML.EVALUATE on the model with no input data specified. See this query for an example of model evaluation without specified input data.
Evaluation with new dataset
You can specify new datasets for evaluation after model creation. To provide a new dataset, use an evaluation functions like
ML.EVALUATE on the model with input data specified. See this query for an example of model evaluation with a new dataset.