The ML.EVALUATE function

This document describes the ML.EVALUATE function, which lets you evaluate model metrics.

Syntax

ML.EVALUATE(
  MODEL `project_id.dataset.model`
  [, { TABLE `project_id.dataset.table` | (query_statement) }]
    STRUCT(
      [threshold_value AS threshold]
      [, perform_aggregation AS perform_aggregation]
      [, horizon_value AS horizon]
      [, confidence_level AS confidence_level]
      [, trial_id AS trial_id]))

Arguments

ML.EVALUATE takes the following arguments:

  • project_id: Your project ID.
  • dataset: The BigQuery dataset that contains the model.
  • model: The name of the model.
  • table: The name of the input table that contains the evaluation data.

    If table is specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named label in the training data is used.

  • query_statement: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of the query_statement clause in GoogleSQL, see Query syntax.

    If query_statement is specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named label in the training data is used. The extra columns are ignored.

    If you used the TRANSFORM clause in the CREATE MODEL statement that created the model, then only the input columns present in the TRANSFORM clause must appear in query_statement.

  • threshold: a FLOAT64 value that specifies a custom threshold for the binary-class classification model to use for evaluation. The default value is 0.5.

    A 0 value for precision or recall means that the selected threshold produced no true positive labels. A NaN value for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.

    If both table_name and query_statement are unspecified, you can't use a threshold.

    You can only use threshold with binary-class classification models.

  • perform_aggregation: a BOOL value that indicates the level of evaluation for forecasting accuracy. If you specify TRUE, then the forecasting accuracy is on the time series level. If you specify FALSE, the forecasting accuracy is on the timestamp level. The default value is TRUE.

  • horizon: an INT64 value that specifies the number of forecasted time points against which the evaluation metrics are computed. The default value is the horizon value specified in the CREATE MODEL statement for the time series model, or 1000 if unspecified. When evaluating multiple time series at the same time, this parameter applies to each time series.

    You can only use horizon when the model type is ARIMA_PLUS and either table_name or query_statement is specified.

  • confidence_level: a FLOAT64 value that specifies the percentage of the future values that fall in the prediction interval. The default value is 0.95. The valid input range is [0, 1).

    You can only use confidence_level when the model type is ARIMA_PLUS, either table_name or query_statement is specified, and perform_aggregation is set to FALSE. The value of confidence_level affects the upper_bound and lower_bound values in the output.

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.

Output

ML.EVALUATE returns a single row containing metrics applicable to the type of model specified.

precision, recall, f1_score, log_loss, and roc_auc are macro-averaged for all of the class labels. For a macro-average, metrics are calculated for each label and then an unweighted average is taken of those values. accuracy is computed as a global total or micro-average. For a micro-average, the metric is calculated globally by counting the total number of correctly predicted rows.

Regression models

Regression models include the following:

  • Linear regression
  • Boosted tree regressor
  • Random forest regressor
  • Deep neural network (DNN) regressor
  • Wide & Deep regressor
  • AutoML Tables regressor

ML.EVALUATE returns the following columns for regression models:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.
  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.
  • mean_squared_log_error: a FLOAT64 value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.
  • median_absolute_error: a FLOAT64 value that contains the median absolute error for the model.
  • r2_score: a FLOAT64 value that contains the R2 score for the model.
  • explained_variance: a FLOAT64 value that contains the explained variance for the model.

Classification models

Classification models include the following:

  • Logistic regressor
  • Boosted tree classifier
  • Random forest classifier
  • DNN classifier
  • Wide & Deep classifier
  • AutoML Tables classifier

ML.EVALUATE returns the following columns for classification models:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.
  • precision: a FLOAT64 value that contains the precision for the model.
  • recall: a FLOAT64 value that contains the recall for the model.
  • accuracy: a FLOAT64 value that contains the accuracy for the model.
  • f1_score: a FLOAT64 value that contains the F1 score for the model.
  • log_loss: a FLOAT64 value that contains the logistic loss for the model.
  • roc_auc: a FLOAT64 value that contains the area under the receiver operating characteristic curve for the model.

K-means models

ML.EVALUATE returns the following columns for k-means models:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.
  • davies_bouldin_index: a FLOAT64 value that contains the Davies-Bouldin Index for the model.
  • mean_squared_distance: a FLOAT64 value that contains the mean squared distance for the model, which is the average of the distances between training data points to their closest centroid.

Matrix factorization models

ML.EVALUATE returns the following columns for matrix factorization models with implicit feedback:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.
  • recall: a FLOAT64 value that contains the recall for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.
  • normalized_discounted_cumulative_gain: a FLOAT64 value that contains the normalized discounted cumulative gain for the model.
  • average_rank: a FLOAT64 value that contains the average rank (PDF download) for the model.

ML.EVALUATE returns the following columns for matrix factorization models with explicit feedback:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model.
  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.
  • mean_squared_log_error: a FLOAT64 value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.
  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • r2_score: a FLOAT64 value that contains the R2 score for the model.
  • explained_variance: a FLOAT64 value that contains the explained variance for the model.

PCA models

ML.EVALUATE returns the following column for PCA models:

  • total_explained_variance_ratio: a FLOAT64 value that contains the percentage of the cumulative variance explained by all the returned principal components. For more information, see the ML.PRINCIPAL_COMPONENT_INFO function.

Time series models

ML.EVALUATE returns the following columns for ARIMA_PLUS or ARIMA_PLUS_XREG models when input data is provided and perform_aggregation is FALSE:

  • time_series_id_col or time_series_id_cols: a value that contains the identifiers of a time series. time_series_id_col can be an INT64 or STRING value. time_series_id_cols can be an ARRAY<INT64> or ARRAY<STRING> value. Only present when forecasting multiple time series at once. The column names and types are inherited from the TIME_SERIES_ID_COL option as specified in the CREATE MODEL statement. ARIMA_PLUS_XREG models don't support this column.
  • time_series_timestamp_col: a STRING value that contains the timestamp column for a time series. The column name and type are inherited from the TIME_SERIES_TIMESTAMP_COL option as specified in the CREATE MODEL statement.
  • time_series_data_col: a STRING value that contains the data column for a time series. The column name and type are inherited from the TIME_SERIES_DATA_COL option as specified in the CREATE MODEL statement.
  • forecasted_time_series_data_col: a STRING value that contains the same data as time_series_data_col but with forecasted_ prefixed to the column name.
  • lower_bound: a FLOAT64 value that contains the lower bound of the prediction interval.
  • upper_bound: a FLOAT64 value that contains the upper bound of the prediction interval.
  • absolute_error: a FLOAT64 value that contains the absolute value of the difference between the forecasted value and the actual data value.
  • absolute_percentage_error: a FLOAT64 value that contains the absolute value of the absolute error divided by the actual value.

ML.EVALUATE returns the following columns for ARIMA_PLUS or ARIMA_PLUS_XREG models when input data is provided and perform_aggregation is TRUE:

  • time_series_id_col or time_series_id_cols: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from the TIME_SERIES_ID_COL option as specified in the CREATE MODEL statement. ARIMA_PLUS_XREG models don't support this column.
  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.
  • root_mean_squared_error: a FLOAT64 value that contains the root mean squared error for the model.
  • mean_absolute_percentage_error: a FLOAT64 value that contains the mean absolute percentage error for the model.
  • symmetric_mean_absolute_percentage_error: a FLOAT64 value that contains the symmetric mean absolute percentage error for the model.

ML.EVALUATE returns the following columns for an ARIMA_PLUS model when input data isn't provided:

  • time_series_id_col or time_series_id_cols: the identifiers of a time series. Only present when forecasting multiple time series at once. The column names and types are inherited from the TIME_SERIES_ID_COL option as specified in the CREATE MODEL statement.
  • non_seasonal_p: an INT64 value that contains the order for the autoregressive model. For more information, see Autoregressive integrated moving average.
  • non_seasonal_d: an INT64 that contains the degree of differencing for the non-seasonal model. For more information, see Autoregressive integrated moving average.
  • non_seasonal_q: an INT64 that contains the order for the moving-average model. For more information, see Autoregressive integrated moving average.
  • has_drift: a BOOL value that indicates whether the model includes a linear drift term.
  • log_likelihood: a FLOAT64 value that contains the log likelihood for the model.
  • aic: a FLOAT64 value that contains the Akaike information criterion for the model.
  • variance: a FLOAT64 value that measures how far the observed value differs from the predicted value mean.
  • seasonal_periods: a STRING value that contains the seasonal period for the model.
  • has_holiday_effect: a BOOL value that indicates whether the model includes any holiday effects.
  • has_spikes_and_dips: a BOOL value that indicates whether the model performs automatic spikes and dips detection and cleanup.
  • has_step_changes: a BOOL value that indicates whether the model has step changes.

Autoencoder models

ML.EVALUATE returns the following columns for autoencoder models:

  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.
  • mean_squared_log_error: a FLOAT64 value that contains the mean squared logarithmic error for the model. The mean squared logarithmic error measures the distance between the actual and predicted values.

Remote models

ML.EVALUATE returns the following column for remote models:

  • remote_eval_metrics: a JSON column containing appropriate metrics for the model type.

Limitations

ML.EVALUATE is subject to the following limitations:

Examples

The following examples show how to use ML.EVALUATE.

ML.EVALUATE with no input data specified

The following query evaluates a model with no input data specified:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.mymodel`)

ML.EVALUATE with a custom threshold and input data

The following query evaluates s model with input data and a custom threshold of 0.55:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.mymodel`,
    (
    SELECT
      custom_label,
      column1,
      column2
    FROM
      `mydataset.mytable`),
    STRUCT(0.55 AS threshold))

ML.EVALUATE to calculate forecasting accuracy of a time series

The following query evaluates the 30-point forecasting accuracy for a time series model:

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric
    FROM
      `mydataset.mytable`),
    STRUCT(TRUE AS perform_aggregation, 30 AS horizon))

ML.EVALUATE to calculate ARIMA_PLUS forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30 forecasted points of a time series model. It also computes the prediction interval based on a confidence level of 0.9.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric
    FROM
      `mydataset.mytable`),
    STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level,
    30 AS horizon))

ML.EVALUATE to calculate ARIMA_PLUS_XREG forecasting accuracy for each forecasted timestamp

The following query evaluates the forecasting accuracy for each of the 30 forecasted points of a time series model. It also computes the prediction interval based on a confidence level of 0.9. Note that you need to include the side features for the evaluation data.

SELECT
  *
FROM
  ML.EVALUATE(MODEL `mydataset.my_arima_xreg_model`,
    (
    SELECT
      timeseries_date,
      timeseries_metric,
      feature1,
      feature2
    FROM
      `mydataset.mytable`),
    STRUCT(FALSE AS perform_aggregation, 0.9 AS confidence_level,
    30 AS horizon))

What's next