Stay organized with collections Save and categorize content based on your preferences.

The ML.ROC_CURVE function

ML.ROC_CURVE function

Use the ML.ROC_CURVE function to evaluate binary class classification specific metrics.

For information about model evaluation in BigQuery ML, see Model evaluation overview.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

ML.ROC_CURVE syntax

ML.ROC_CURVE(MODEL model_name
            [, {TABLE table_name | (query_statement)}]
            [, GENERATE_ARRAY(thresholds)])

model_name

model_name is the name of the model you're evaluating. If you do not have a default project configured, prepend the project ID to the model name in following format: `[PROJECT_ID].[DATASET].[MODEL]` (including the backticks); for example, `myproject.mydataset.mymodel`.

table_name

(Optional) table_name is the name of the input table that contains the evaluation data. If you do not have a default project configured, prepend the project ID to the table name in following format: `[PROJECT_ID].[DATASET].[TABLE]` (including the backticks); for example, `myproject.mydataset.mytable`.

If table_name is specified, the input column names in the model must match the column names in the table, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named "label" in the training data is used.

If neither table_name nor query_statement is specified, ML.ROC_CURVE computes the roc curve results as follows:

  • If the data is split during training, the split evaluation data is used to compute the roc curve results.
  • If the data is not split during training, the entire training input is used to compute the roc curve results.

query_statement

(Optional) The query_statement clause specifies the GoogleSQL query that is used to generate the evaluation data. See the GoogleSQL Query Syntax page for the supported SQL syntax of the query_statement clause.

If query_statement is specified, the input column names from the query must match the column names in the model, and their types should be compatible according to BigQuery implicit coercion rules. The input must have a column that matches the label column name provided during training. This value is provided using the input_label_cols option. If input_label_cols is unspecified, the column named "label" in the training data is used. The extra columns are ignored.

If the TRANSFORM clause was present in the CREATE MODEL statement that created model_name, then only the input columns present in the TRANSFORM clause must appear in query_statement.

If neither table_name nor query_statement is specified, ML.ROC_CURVE computes the roc curve results as follows:

  • If the data is split during training, the split evaluation data is used to compute the roc curve results.
  • If the data is not split during training, the entire training input is used to compute the roc curve results.

thresholds

(Optional) thresholds are the percentile values of the prediction output supplied via the GENERATE_ARRAY function. The values supplied should be of type DOUBLE.

If you do not specify the thresholds to use, they are chosen automatically based on the values of the prediction output using 100 approximate quantiles: APPROX_QUANTILES(predicted_label, 100).

ML.ROC_CURVE output

The output ML.ROC_CURVE function includes multiple rows with metrics for different threshold values for the model. The metrics include:

  • threshold
  • recall
  • false_positive_rate
  • true_positives
  • false_positives
  • true_negatives
  • false_negatives

ML.ROC_CURVE examples

The following examples assume your model and input table are in your default project.

Evaluate the ROC curve of a binary class logistic regression model

The following query return these columns: threshold, recall, false_positive_rate, true_positives, false_positives, true_negatives, and false_negatives. The recall and false positive rate can be graphed for an ROC curve. The threshold values returned are chosen based on the percentile values of the prediction output.

SELECT
  *
FROM
  ML.ROC_CURVE(MODEL `mydataset.mymodel`,
    TABLE `mydataset.mytable`)

Evaluate an ROC curve with custom thresholds

The following query returns the same columns as the previous example. The threshold values returned are chosen based on the output of the GENERATE_ARRAY function.

SELECT
  *
FROM
  ML.ROC_CURVE(MODEL `mydataset.mymodel`,
    TABLE `mydataset.mytable`,
    GENERATE_ARRAY(0.4,0.6,0.01))

Evaluate the precision-recall curve

The following query, instead of getting an ROC curve (the recall versus false positive rate), a precision-recall curve is calculated by using the precision from the true and false positive counts.

SELECT
  recall,
  true_positives / (true_positives + false_positives) AS precision
FROM
  ML.ROC_CURVE(MODEL `mydataset.mymodel`,
    TABLE `mydataset.mytable`)