The ML.CONFUSION_MATRIX function
ML.CONFUSION_MATRIX
function
Use the ML.CONFUSION_MATRIX
function to return a confusion matrix for the
input classification model and input data.
For information about model evaluation in BigQuery ML, see Model evaluation overview.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
ML.CONFUSION_MATRIX
syntax
ML.CONFUSION_MATRIX(MODEL model_name [, {TABLE table_name | (query_statement)}] [, STRUCT(<T> AS threshold)])
model_name
model_name
is the name of the model you're evaluating. If you do
not have a default project configured, prepend the project ID to the model
name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the
backticks); for example, `myproject.mydataset.mymodel`.
table_name
(Optional) table_name
is the name of the input table that contains the
evaluation data. If you do not have a default project configured, prepend the
project ID to the table name in following format: `[PROJECT_ID].[DATASET].[TABLE]`
(including the backticks); for example, `myproject.mydataset.mytable`.
If table_name
is specified, the input column names in the table must match the
column names in the model, and their types should be compatible according to
BigQuery implicit coercion rules.
The input must have a column that matches the label column name provided during
training. This value is provided using the input_label_cols
option. If
input_label_cols
is unspecified, the column named "label" in the training data
is used.
If neither table_name
nor query_statement
is specified,
ML.CONFUSION_MATRIX
computes the confusion matrix results as follows:
- If the data is split during training, the split evaluation data is used to compute the confusion matrix results.
- If the data is not split during training, the entire training input is used to compute the confusion matrix results.
query_statement
(Optional) The query_statement
clause specifies the GoogleSQL query that is
used to generate the evaluation data. See the
GoogleSQL Query Syntax
page for the supported SQL syntax of the query_statement
clause.
If query_statement
is specified, the input column names from the query must
match the column names in the model, and their types should be compatible
according to BigQuery implicit coercion rules.
The input must have a column that matches the label column name provided during
training. This value is provided using the input_label_cols
option. If
input_label_cols
is unspecified, the column named "label" in the training data
is used. The extra columns are ignored.
If the
clause was present in the TRANSFORM
CREATE MODEL
statement that created model_name
,
then only the input columns present in the TRANSFORM
clause must appear in query_statement
.
If neither table_name
nor query_statement
is specified,
ML.CONFUSION_MATRIX
computes the confusion matrix results as follows:
- If the data is split during training, the split evaluation data is used to compute the confusion matrix results.
- If the data is not split during training, the entire training input is used to compute the confusion matrix results.
threshold
(Optional) threshold
is the custom threshold for your binary classification
model used for evaluation. The default value is 0.5. The threshold value that is
supplied must be of type STRUCT
.
A zero value for precision or recall means that the selected threshold produced no true positive labels. A NaN value for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.
ML.CONFUSION_MATRIX
output
The output columns of the ML.CONFUSION_MATRIX
function depend on the model.
The first output column is always expected_label
. There are N
additional
columns, one for each class in the trained model. The names of the additional
columns depend on the class labels used to train the model.
If the training class labels all conform to BigQuery
column naming rules, the labels are used
as the column names. Columns that do not conform to naming rules are altered to
conform to the column naming rules and to be unique. For example, if the labels
are 0
and 1
, the output column names are _0
and _1
.
The columns are ordered based on the class labels in ascending order. If the labels in the evaluation data match those in the training data, the True Positives are shown on the diagonal from top left to bottom right. The expected (or actual) labels are listed one per row, and the predicted labels are listed one per column.
The values in the expected_label
column are the exact values and type passed
into ML.CONFUSION_MATRIX
in the label column of the evaluation data. This is
true even if they do not exactly match the values or type used during training.
ML.CONFUSION_MATRIX
example
The following example assumes your model and input table are in your default project.
The following query is used to return the confusion matrix for a logistic regression model.
SELECT * FROM ML.CONFUSION_MATRIX(MODEL `mydataset.mymodel`, ( SELECT * FROM `mydataset.mytable`))
ML.CONFUSION_MATRIX
limitations
The ML.CONFUSION_MATRIX
function is subject to the following limitations:
ML.CONFUSION_MATRIX
does not support imported TensorFlow models.