The ML.CONFUSION_MATRIX Function

ML.CONFUSION_MATRIX function

Use the ML.CONFUSION_MATRIX function to return a confusion matrix for the given logistic regression model and input data. The ML.CONFUSION_MATRIX function can be used with both logistic regression and multiclass logistic regression models. ML.CONFUSION_MATRIX does not support linear regression models.

The output columns of the ML.CONFUSION_MATRIX function depend on the model. The first output column is always expected_label. There are N additional columns, one for each class in the trained model. The names of the additional columns depend on the class labels used to train the model.

If the training class labels all conform to BigQuery column naming rules, the labels are used as the column names. Columns that do not conform to naming rules are altered to conform to the column naming rules and to be unique. For example, if the labels are 0 and 1, the output column names are _0 and _1.

The columns are ordered based on the class labels in ascending order. If the labels in the evaluation data match those in the training data, the True Positives are shown on the diagonal from top left to bottom right. The expected (or actual) labels are listed one per row, and the predicted labels are listed one per column.

The values in the expected_label column are the exact values and type passed into ML.CONFUSION_MATRIX in the label column of the evaluation data. This is true even if they do not exactly match the values or type used during training.

ML.CONFUSION_MATRIX syntax

ML.CONFUSION_MATRIX(MODEL model_name,
           {TABLE table_name | (query_statement)}
           [, STRUCT( AS threshold)])

model_name

model_name is the name of the model you're evaluating. If you do not have a default project configured, prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks); for example, `myproject.mydataset.mymodel`.

The input column names and data types in the model must match the column names and data types in the table.

table_name

table_name is the name of the input table that contains the evaluation data. If you do not have a default project configured, prepend the project ID to the table name in following format: `[PROJECT_ID].[DATASET].[TABLE]` (including the backticks); for example, `myproject.mydataset.mytable`.

The input column names and data types in the model must match the column names and data types in the table. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named "label" in the training data is used.

query_statement

The query_statement clause specifies the standard SQL query that is used to generate the evaluation data. See the Standard SQL Query Syntax page for the supported SQL syntax of the query_statement clause.

All columns referenced by the query_statement are used as inputs to the model.

threshold

(Optional) threshold is a custom threshold for your logistic regression model to be used for evaluation. The default value is 0.5. The threshold value that is supplied must be of type STRUCT.

A zero value for precision or recall means that the selected threshold produced no true positive labels. A NaN value for precision means that the selected threshold produced no positive labels, neither true positives nor false positives.

ML.CONFUSION_MATRIX example

The following example assumes your model and input table are in your default project.

The following query is used to return the confusion matrix for a logistic regression model.

SELECT
  *
FROM
  ML.CONFUSION_MATRIX(MODEL `mydataset.mymodel`,
  (
    SELECT
      *
    FROM
      `mydataset.mytable`))
Was this page helpful? Let us know how we did:

Send feedback about...

Need help? Visit our support page.