The ML.PREDICT function

ML.PREDICT function

The ML.PREDICT function is used to predict outcomes using the model.

For information about model inference in BigQuery ML, see Model inference overview.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

Prediction can be done during model creation, after model creation, or after a failure (as long as at least 1 iteration is finished). ML.PREDICT always uses the model weights from the last successful iteration.

The output of the ML.PREDICT function has as many rows as the input table, and it includes all columns from the input table and all output columns from the model. The output column names for the model are predicted_<label_column_name> and (for classification models) predicted_<label_column_name>_probs. In both columns, label_column_name is the name of the input label column used during training.

  • For linear regression, boosted tree regressor, random forest regressor and DNN regressor models:

    • The predicted_<label_column_name> output column is the predicted value of the label.
  • For binary class logistic regression, boosted tree classifier, random forest classifier and DNN classifier models:

    • The predicted_<label_column_name> output column is one of the two input labels, depending on which label has the higher predicted probability.
    • The predicted_<label_column_name>_probs output column is an array of STRUCTs of type [<label, prob>] that contains the predicted probability of each label.
  • For multiclass logistic regression, boosted tree classifier, random forest classifier and DNN classifier models:

    • The predicted_<label_column_name> output column is the label with the highest predicted probability score.
    • The predicted_<label_column_name>_probs output column is the probability for each class label calculated using a softmax function.
  • For k-means models:

    • Returns columns labeled centroid_id and nearest_centroids_distance. nearest_centroids_distance contains an ARRAY of STRUCTs named nearest_centroids_distance, which contains the distances to the nearest k clusters, where k is equal to the lesser of num_clusters or 5. If this model was created with the option standardize_features set to TRUE, then the model computes these distances using standardized features; otherwise, it computes these distances on non-standardized features.
  • For PCA models:

    • The principal_component_<index> output columns represent the projection of the input data onto each principal component. They can also be considered as embedded low-dimensional features in the space that is spanned by the principal components.
    • The original input columns are appended if keep_original_columns is set to true.
  • For Autoencoder models:

    • The latent_col_<index> output columns represent dimensions of the latent space.
    • Original input columns are appended after the latent space columns as a reference.
  • For TensorFlow Lite models:

    • The input must be convertible to the type expected by the model.
    • The output is the output of the TensorFlow Lite model's predict method.
  • For remote models:

    • The input columns must contain all Vertex AI endpoint input fields.
    • The output columns contain all Vertex AI endpoint output fields and a remote_model_status field which contains status messages from Vertex AI endpoint.
  • For ONNX models:

    • The input must be convertible to the type expected by the model.
    • The output is the output of the onnx model's predict method.
  • For XGBoost models:

    • The input must be convertible to the type expected by the model.
    • The output is the output of the XGBoost model's predict method.

ML.PREDICT permissions

Both bigquery.models.create and bigquery.models.getData are required to run ML.PREDICT.

ML.PREDICT syntax

ML.PREDICT(MODEL model_name,
          {TABLE table_name | (query_statement)}
          [, STRUCT<threshold FLOAT64,
          keep_original_columns BOOL> settings)])

model_name

model_name is the name of the model you're evaluating. If you do not have a default project configured, prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks); for example, `myproject.mydataset.mymodel`.

table_name

table_name is the name of the input table that contains the evaluation data. If you do not have a default project configured, prepend the project ID to the table name in following format: `[PROJECT_ID].[DATASET].[TABLE]` (including the backticks); for example, `myproject.mydataset.mytable`.

The input column names in the table must contain the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.

If there are unused columns from the table, they will be passed through to output columns.

query_statement

The query_statement clause specifies the GoogleSQL query that is used to generate the evaluation data. See the GoogleSQL Query Syntax page for the supported SQL syntax of the query_statement clause.

The input column names from the query must contain the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.

If there are unused columns from the query, they will be passed through to output columns.

If you are running inference on image data from an object table, you must use the ML.DECODE_IMAGE function to convert image bytes to a multi-dimensional ARRAY representation. You can use ML.DECODE_IMAGE output directly in an ML.PREDICT statement, or you can write the results from ML.DECODE_IMAGE to a table column and reference that column when you call ML.PREDICT. For more information, see Predict an outcome from image data with an imported TensorFlow model.

If the TRANSFORM clause was present in the CREATE MODEL statement that created model_name, then only the input columns present in theTRANSFORM clause must be present in query_statement.

threshold

(Optional) Threshold is a custom threshold for your binary logistic regression model and is used as the cutoff between the two labels. Predictions above the threshold are treated as positive prediction. Predictions below the threshold are negative predictions. The threshold value is type FLOAT64 and is part of the settings STRUCT. The default value is 0.5.

keep_original_columns

(Optional) If true, the columns from the input table are output from this function. The default value is false.

Missing data imputation

In statistics, imputation is used to replace missing data with substituted values. When you train a model in BigQuery ML, NULL values are treated as missing data. When you predict outcomes in BigQuery ML, missing values can occur when BigQuery ML encounters a NULL value or a previously unseen value. BigQuery ML handles missing data differently, based on the type of data in the column.

Column type Imputation method
Numeric In both training and prediction, NULL values in numeric columns are replaced with the mean value of the given column, as calculated by the feature column in the original input data.
One-hot/Multi-hot encoded In both training and prediction, NULL values in the encoded columns are mapped to an additional category that is added to the data. Previously unseen data is assigned a weight of 0 during prediction.
TIMESTAMP TIMESTAMP columns use a mixture of imputation methods from both standardized and one-hot encoded columns. For the generated Unix time column, BigQuery ML replaces values with the mean Unix time across the original columns. For other generated values, BigQuery ML assigns them to the respective NULL category for each extracted feature.
STRUCT In both training and prediction, each field of the STRUCT is imputed according to its type.

ML.PREDICT examples

The following examples assume your model and input table are in your default project.

Predict an outcome

The following query uses the ML.PREDICT function to predict an outcome. The query returns these columns:

  • predicted_label
  • label
  • column1
  • column2
SELECT
  *
FROM
  ML.PREDICT(MODEL `mydataset.mymodel`,
    (
    SELECT
      label,
      column1,
      column2
    FROM
      `mydataset.mytable`))

Compare predictions from two different models

In this example, the following query is used to create the first model.

CREATE MODEL
  `mydataset.mymodel1`
OPTIONS
  (model_type='linear_reg',
    input_label_cols=['label'],
  ) AS
SELECT
  label,
  input_column1
FROM
  `mydataset.mytable`

The following query is used to create the second model.

CREATE MODEL
  `mydataset.mymodel2`
OPTIONS
  (model_type='linear_reg',
    input_label_cols=['label'],
  ) AS
SELECT
  label,
  input_column2
FROM
  `mydataset.mytable`

The following query uses the ML.PREDICT function to compare the output of the two models.

SELECT
  label,
  predicted_label1,
  predicted_label AS predicted_label2
FROM
  ML.PREDICT(MODEL `mydataset.mymodel2`,
    (
    SELECT
      * EXCEPT (predicted_label),
          predicted_label AS predicted_label1
    FROM
      ML.PREDICT(MODEL `mydataset.mymodel1`,
        TABLE `mydataset.mytable`)))

Specify a custom threshold

The following query uses the ML.PREDICT function by specifying input data and a custom threshold of 0.55.

SELECT
  *
FROM
  ML.PREDICT(MODEL `mydataset.mymodel`,
    (
    SELECT
      custom_label,
      column1,
      column2
    FROM
      `mydataset.mytable`),
    STRUCT(0.55 AS threshold))

Predict an outcome from structured data with an imported TensorFlow model

The following query uses the ML.PREDICT function to predict outcomes using an imported TensorFlow model. The input_data table contains inputs in the schema expected by my_model. See the CREATE MODEL statement for TensorFlow models for more information.

SELECT *
FROM ML.PREDICT(MODEL `my_project`.my_dataset.my_model,
               (SELECT * FROM input_data))