The ML.PREDICT function
This document describes the ML.PREDICT
function, which you can use to
predict outcomes by using a model. ML.PREDICT
works with the following models:
- Linear and logistic regression models
- K-means models
- Principal component analysis (PCA) models
- Autoencoder models
- Imported models:
- Vertex AI hosted models
For PCA and autoencoder models, you can use the
ML.GENERATE_EMBEDDING
function as an alternative to the ML.PREDICT
function.
ML.GENERATE_EMBEDDING
generates the same embedding data as ML.PREDICT
as an array in a single column, rather than in a series of columns. Having all
of the embeddings in a single column lets you directly use the
VECTOR_SEARCH
function
on theML.GENERATE_EMBEDDING
output.
You can run prediction during model creation, after model creation, or after a
failure (as long as at least one iteration is finished). ML.PREDICT
always uses
the model weights from the last successful iteration.
Syntax
ML.PREDICT( MODEL `project_id.dataset.model`, { TABLE `project_id.dataset.table` | (query_statement) }, STRUCT( [threshold_value AS threshold] [, keep_columns AS keep_original_columns] [, trial_id AS trial_id]) )
Arguments
ML.PREDICT
takes the following arguments:
project_id
: Your project ID.dataset
: The BigQuery dataset that contains the model.model
: The name of the model.table
: The name of the input table that contains the evaluation data.If
table
is specified, the input column names in the table must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.For TensorFlow Lite, Open Neural Network Exchange (ONNX), and XGBoost models, the input must be convertible to the type expected by the model.
For remote models, the input columns must contain all Vertex AI endpoint input fields.
If there are unused columns from the table, they are passed through as output columns.
query_statement
: The GoogleSQL query that is used to generate the evaluation data. See the GoogleSQL query syntax page for the supported SQL syntax of thequery_statement
clause.If
query_statement
is specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules.For TensorFlow Lite, ONNX, and XGBoost models, the input must be convertible to the type expected by the model.
For remote models, the input columns must contain all Vertex AI endpoint input fields.
If there are unused columns from the query, they are passed through as output columns.
If you used the
TRANSFORM
clause in theCREATE MODEL
statement that created the model, then only the input columns present in theTRANSFORM
clause must appear inquery_statement
.If you are running inference on image data from an object table, you must use the
ML.DECODE_IMAGE
function to convert image bytes to a multi-dimensionalARRAY
representation. You can useML.DECODE_IMAGE
output directly in anML.PREDICT
statement, or you can write the results fromML.DECODE_IMAGE
to a table column and reference that column when you callML.PREDICT
. For more information, see Predict an outcome from image data with an imported TensorFlow model.threshold
: aFLOAT64
value that specifies a custom threshold for a binary classification model. It is used as the cutoff between the two labels. Predictions above the threshold are positive predictions. Predictions below the threshold are negative predictions. The default value is0.5
.keep_original_columns
: aBOOL
value that specifies whether to output the input table columns. IfTRUE
, the columns from the input table are output. The default value isFALSE
.keep_original_columns
only applies to principal component analysis (PCA) models.trial_id
: anINT64
value that identifies the hyperparameter tuning trial that you want the function to evaluate. The function uses the optimal trial by default. Only specify this argument if you ran hyperparameter tuning when creating the model.
Output
The output of the ML.PREDICT
function has as many rows as the input table, and
it includes all columns from the input table and all output columns from the
model. The output column names for the model are predicted_<label_column_name>
and, for classification models, predicted_<label_column_name>_probs
. In
both columns, label_column_name
is the name of the input label column that's
used during training.
Regression models
For the following types of regression models:
- Linear regression
- Boosted tree regressor
- Random forest regressor
- Deep neural network (DNN) regressor
The following column is returned:
predicted_<label_column_name>
: aSTRING
value that contains the predicted value of the label.
Classification models
For the following types of binary-class classification models:
- Logistic regression
- Boosted tree classifier
- Random forest classifier
- DNN classifier
The following columns are returned:
- The
predicted_<label_column_name>
: aSTRING
value that contains one of the two input labels, depending on which label has the higher predicted probability. - The
predicted_<label_column_name>_probs
: anARRAY<STRUCT>
value in the form[<label, probability>]
that contains the predicted probability of each label.
For the following types of multiclass classification models:
- Logistic regression
- Boosted tree classifier
- Random forest classifier
- DNN classifier
The following columns are returned:
- The
predicted_<label_column_name>
: aSTRING
value that contains the label with the highest predicted probability score. - The
predicted_<label_column_name>_probs
: aFLOAT64
value that contains the probability for each class label, calculated using a softmax function.
K-means models
For k-means models, the following columns are returned:
centroid_id
: anINT64
value that identifies the centroid.nearest_centroids_distance
: anARRAY<STRUCT>
value that contains the distances to the nearestk
clusters, wherek
is equal to the lesser ofnum_clusters
or5
. If the model was created with thestandardize_features
option set toTRUE
, then the model computes these distances using standardized features; otherwise, it computes these distances using non-standardized features.
PCA models
For PCA models, the following columns are returned:
principal_component_<index>
: anINT64
value that represents the projection of the input data onto each principal component. These values can also be considered as embedded low-dimensional features in the space that is spanned by the principal components.
The original input columns are appended if the keep_original_columns
argument is set to TRUE
.
Autoencoder models
For autoencoder models, the following columns are returned:
latent_col_<index>
: anINT64
value that represents the dimensions of the latent space.
The original input columns are appended after the latent space columns.
Imported models
For TensorFlow Lite models, the output is the output of the TensorFlow Lite model's predict method.
For ONNX models, the output is the output of the ONNX model's predict method.
For XGBoost models, the output is the output of the XGBoost model's predict method.
Remote models
For remote models, the output columns contain all Vertex AI endpoint
output fields, and also a remote_model_status
field that contains status
messages from Vertex AI endpoint.
Missing data imputation
In statistics, imputation is used to replace missing data with substituted
values. When you train a model in BigQuery ML, NULL
values are
treated as missing data. When you predict outcomes in BigQuery ML,
missing values can occur when BigQuery ML encounters a NULL
value or a previously unseen value. BigQuery ML handles missing
data differently, based on the type of data in the column.
Column type | Imputation method |
---|---|
Numeric | In both training and prediction, NULL values in numeric
columns are replaced with the mean value of the given column, as calculated
by the feature column in the original input data. |
One-hot/Multi-hot encoded | In both training and prediction, NULL values in the
encoded columns are mapped to an additional category that is added to
the data. Previously unseen data is assigned a weight of 0 during
prediction. |
TIMESTAMP |
TIMESTAMP columns use a mixture of imputation methods
from both standardized and one-hot encoded columns. For the generated Unix
time column, BigQuery ML replaces values with the mean Unix
time across the original columns. For other generated values,
BigQuery ML assigns them to the respective NULL
category for each extracted feature. |
STRUCT |
In both training and prediction, each field of the STRUCT
is imputed according to its type. |
Permissions
You must have the bigquery.models.getData
Identity and Access Management (IAM) permission
in order to run ML.PREDICT
.
Examples
The following examples assume your model and input table are in your default project.
Predict an outcome
The following example predicts an outcome and returns the following columns:
predicted_label
label
column1
column2
SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT label, column1, column2 FROM `mydataset.mytable`))
Compare predictions from two different models
The following example creates two models and then compares their output:
Create the first model:
CREATE MODEL `mydataset.mymodel1` OPTIONS (model_type='linear_reg', input_label_cols=['label'], ) AS SELECT label, input_column1 FROM `mydataset.mytable`
Create the second model:
CREATE MODEL `mydataset.mymodel2` OPTIONS (model_type='linear_reg', input_label_cols=['label'], ) AS SELECT label, input_column2 FROM `mydataset.mytable`
Compare the output of the two models:
SELECT label, predicted_label1, predicted_label AS predicted_label2 FROM ML.PREDICT(MODEL `mydataset.mymodel2`, ( SELECT * EXCEPT (predicted_label), predicted_label AS predicted_label1 FROM ML.PREDICT(MODEL `mydataset.mymodel1`, TABLE `mydataset.mytable`)))
Specify a custom threshold
The following example runs prediction with input data and
a custom threshold of 0.55
:
SELECT * FROM ML.PREDICT(MODEL `mydataset.mymodel`, ( SELECT custom_label, column1, column2 FROM `mydataset.mytable`), STRUCT(0.55 AS threshold))
Predict an outcome from structured data with an imported TensorFlow model
The following query predicts outcomes using an imported
TensorFlow model. The input_data
table contains inputs in the
schema expected by my_model
. See
the CREATE MODEL
statement for TensorFlow models
for more information.
SELECT * FROM ML.PREDICT(MODEL `my_project.my_dataset.my_model`, (SELECT * FROM input_data))
Predict an outcome from image data with an imported TensorFlow model
If you are running inference on image data from an
object table, you must use the
ML.DECODE_IMAGE
function
to convert image bytes to a multi-dimensional ARRAY
representation. You can
use ML.DECODE_IMAGE
output directly in an ML.PREDICT
function,
or you can write the results from ML.DECODE_IMAGE
to a table column and
reference that column when you call ML.PREDICT
. You can also pass
ML.DECODE_IMAGE
output to another image processing function for
additional preprocessing during either of these procedures.
You can join the object table to standard BigQuery tables to limit the data used in inference, or to provide additional input to the model.
The following examples show different ways you can use the ML.PREDICT
function with image data.
Example 1
The following example uses the ML.DECODE_IMAGE
function directly in the
ML.PREDICT
function. It returns the inference results for all images in the
object table, for a model with an input field of input
and an output
field of feature
:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data), 480, 480, FALSE) AS input FROM `my_dataset.object_table`) );
Example 2
The following example uses the ML.DECODE_IMAGE
function directly in the
ML.PREDICT
function, and uses the ML.CONVERT_COLOR_SPACE
function in the
ML.PREDICT
function to convert
the image color space from RBG
to YIQ
. It also shows how to
use object table fields to filter the objects included in inference.
It returns the inference results for all JPG images in the
object table, for a model with an input field of input
and an output
field of feature
:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, ML.CONVERT_COLOR_SPACE(ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data), 224, 280, TRUE), 'YIQ') AS input FROM `my_dataset.object_table` WHERE content_type = 'image/jpeg') );
Example 3
The following example uses results from ML.DECODE_IMAGE
that have been
written to a table column but not processed any further. It uses
ML.RESIZE_IMAGE
and ML.CONVERT_IMAGE_TYPE
in the ML.PREDICT
function to
process the image data. It returns the inference results for all images in the
decoded images table, for a model with an input field of input
and an output
field of feature
.
Create the decoded images table:
CREATE OR REPLACE TABLE `my_dataset.decoded_images` AS (SELECT ML.DECODE_IMAGE(data) AS decoded_image FROM `my_dataset.object_table`);
Run inference on the decoded images table:
SELECT * FROM ML.PREDICT( MODEL`my_dataset.vision_model`, (SELECT uri, ML.CONVERT_IMAGE_TYPE(ML.RESIZE_IMAGE(decoded_image, 480, 480, FALSE)) AS input FROM `my_dataset.decoded_images`) );
Example 4
The following example uses results from ML.DECODE_IMAGE
that have been
written to a table column and preprocessed using
ML.RESIZE_IMAGE
. It returns the inference results for all images in the
decoded images table, for a model with an input field of input
and an output
field of feature
.
Create the table:
CREATE OR REPLACE TABLE `my_dataset.decoded_images` AS (SELECT ML.RESIZE_IMAGE(ML.DECODE_IMAGE(data) 480, 480, FALSE) AS decoded_image FROM `my_dataset.object_table`);
Run inference on the decoded images table:
SELECT * FROM ML.PREDICT( MODEL `my_dataset.vision_model`, (SELECT uri, decoded_image AS input FROM `my_dataset.decoded_images`) );
Example 5
The following example uses the ML.DECODE_IMAGE
function directly in the
ML.PREDICT
function. In this example, the model has an output field of
embeddings
and two input fields: one that expects an
image, f_img
, and one that expects a string, f_txt
. The image
input comes from the object table and the string input comes from a
standard BigQuery table that is joined with the object table
by using the uri
column.
SELECT * FROM ML.PREDICT( MODEL `my_dataset.mixed_model`, (SELECT uri, ML.RESIZE_IMAGE(ML.DECODE_IMAGE(my_dataset.my_object_table.data), 224, 224, FALSE) AS f_img, my_dataset.image_description.description AS f_txt FROM `my_dataset.object_table` JOIN `my_dataset.image_description` ON object_table.uri = image_description.uri) );
Predict an outcome with a model trained with the TRANSFORM
clause
The following example trains a model using the
TRANSFORM
clause:
CREATE MODEL `mydataset.mymodel` TRANSFORM(f1 + f2 as c, label) OPTIONS(...) AS SELECT f1, f2, f3, label FROM t;
Because the f3
column doesn't appear in the TRANSFORM
clause,
the following prediction query omits that column in the query_statement
:
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, (SELECT f1, f2 FROM t1));
If f3
is provided in the SELECT
statement, it isn't used for calculating
predictions but is instead passed through for use in the rest of the
SQL statement.
Predict dimensionality reduction results (latent space) with an autoencoder model
The following example runs prediction against a previously built autoencoder model, where the input was 4 dimensional (4 input columns) and the dimensionality reduction had 2 dimensions (2 output columns):
SELECT * FROM ML.PREDICT( MODEL `mydataset.mymodel`, (SELECT f1, f2, f3, f4 FROM t1));
What's next
- For information about model inference, see Model inference overview.
- For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.