The ML.EVALUATE function

This document describes the ML.EVALUATE function, which lets you evaluate model metrics.

Syntax

ML.EVALUATE(
  MODEL `project_id.dataset.model`
  [, { TABLE `project_id.dataset.table` | (query_statement) }]
    STRUCT(
      [threshold_value AS threshold]
      [, perform_aggregation AS perform_aggregation]
      [, horizon_value AS horizon]
      [, confidence_level AS confidence_level]
      [, trial_id AS trial_id]))

Arguments

ML.EVALUATE takes the following arguments:

  • project_id: Your project ID.
  • dataset: The BigQuery dataset that contains the model.
  • model: The name of the model.
  • table: The name of the input table that contains the evaluation data.

    If table is specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named label in the training data is used.

  • query_statement: a GoogleSQL query that is used to generate the evaluation data. For the supported SQL syntax of the query_statement clause in GoogleSQL, see Query syntax.

    If query_statement is specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named label in the training data is used. The extra columns are ignored.

    If you used the TRANSFORM clause in the CREATE MODEL statement that created the model, then only the input columns present in the TRANSFORM clause must appear in query_statement.

  • threshold: a FLOAT64 value that specifies a custom threshold for the binary-class classification model to use for evaluation. The default value is 0.5.

    A 0 value for precision or recall means that the selected threshold produced no true positive labels. A NaN value for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.

    If both table_name and query_statement are unspecified, you can't use a threshold.

    You can only use threshold with binary-class classification models.

  • perform_aggregation: a BOOL value that indicates the level of evaluation for forecasting accuracy. If you specify TRUE, then the forecasting accuracy is on the time series level. If you specify FALSE, the forecasting accuracy is on the timestamp level. The default value is TRUE.

  • horizon: an INT64 value that specifies the number of forecasted time points against which the evaluation metrics are computed. The default value is the horizon value specified in the CREATE MODEL statement for the time series model, or 1000 if unspecified. When evaluating multiple time series at the same time, this parameter applies to each time series.

    You can only use horizon when the model type is ARIMA_PLUS and either table_name or query_statement is specified.

  • confidence_level: a FLOAT64 value that specifies the percentage of the future values that fall in the prediction interval. The default value is 0.95. The valid input range is [0, 1).

    You can only use confidence_level when the model type is ARIMA_PLUS, either table_name or query_statement is specified, and perform_aggregation is set to FALSE. The value of confidence_level affects the upper_bound and lower_bound values in the output.

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.

Output

ML.EVALUATE returns a single row containing metrics applicable to the type of model specified.

precision, recall, f1_score, log_loss, and roc_auc are macro-averaged for all of the class labels. For a macro-average, metrics are calculated for each label and then an unweighted average is taken of those values. accuracy is computed as a global total or micro-average. For a micro-average, the metric is calculated globally by counting the total number of correctly predicted rows.

Regression models

Regression models include the following:

  • Linear regression
  • Boosted tree regressor
  • Random forest regressor
  • Deep neural network (DNN) regressor
  • Wide & Deep regressor
  • AutoML Tables regressor

ML.EVALUATE returns the following columns for regression models:

  • trial_id: an INT64 value that identifies the hyperparameter tuning trial. This column is only returned if you ran hyperparameter tuning when creating the model. This column doesn't apply for AutoML Tables models.
  • mean_absolute_error: a FLOAT64 value that contains the mean absolute error for the model.
  • mean_squared_error: a FLOAT64 value that contains the mean squared error for the model.