The CREATE MODEL statement for boosted tree models using XGBoost

This document describes the CREATE MODEL statement for creating boosted tree models in BigQuery. Boosted tree models are trained using the XGBoost library. Boosted tree models support hyperparameter tuning.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
OPTIONS(model_option_list)
AS query_statement

model_option_list:
MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' }
    [, APPROX_GLOBAL_FEATURE_CONTRIB = { TRUE | FALSE } ]
    [, CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING` | 'DUMMY_ENCODING' | 'LABEL_ENCODING' } ]
    [, BOOSTER_TYPE = { {'GBTREE' | 'DART'} | HPARAM_CANDIDATES([candidates]) } ]
    [, DART_NORMALIZE_TYPE = { {'TREE' | 'FOREST'} | HPARAM_CANDIDATES([candidates]) } ]
    [, NUM_PARALLEL_TREE = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, MAX_TREE_DEPTH = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, DROPOUT = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, L1_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, L2_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, LEARN_RATE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, TREE_METHOD = { {'AUTO' | 'EXACT' | 'APPROX' | 'HIST'} | HPARAM_CANDIDATES([candidates]) } ]
    [, MIN_TREE_CHILD_WEIGHT = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, COLSAMPLE_BYTREE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, COLSAMPLE_BYLEVEL = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, COLSAMPLE_BYNODE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, MIN_SPLIT_LOSS = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, SUBSAMPLE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ]
    [, INSTANCE_WEIGHT_COL = string_value ]
    [, XGBOOST_VERSION = { '0.9' | '1.1' } ]
    [, AUTO_CLASS_WEIGHTS = { TRUE | FALSE } ]
    [, CLASS_WEIGHTS = struct_array ]
    [, ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } ]
    [, EARLY_STOP = { TRUE | FALSE } ]
    [, MIN_REL_PROGRESS = float64_value ]
    [, INPUT_LABEL_COLS = string_array ]
    [, MAX_ITERATIONS = int64_value ]
    [, DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' } ]
    [, DATA_SPLIT_EVAL_FRACTION = float64_value ]
    [, DATA_SPLIT_TEST_FRACTION = float64_value ]
    [, DATA_SPLIT_COL = string_value ]
    [, NUM_TRIALS = int64_value ]
    [, MAX_PARALLEL_TRIALS = int64_value ]
    [, HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ]
    [, HPARAM_TUNING_OBJECTIVES = { 'ROC_AUC' | 'R2_SCORE' | ... } ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' }

Description

Specifies the model type. This option is required.

APPROX_GLOBAL_FEATURE_CONTRIB

Syntax

APPROX_GLOBAL_FEATURE_CONTRIB = { TRUE | FALSE }

Description

Enables fast approximation for feature contributions. This functionality is provided by the XGBoost library; BigQuery ML only passes this option through to it. For more information, see Package 'xgboost' and search for approxcontrib.

In order to use the fast approximation for feature contribution computations, you need to set both ENABLE_GLOBAL_EXPLAIN and APPROX_GLOBAL_FEATURE_CONTRIB to TRUE.

Arguments

A BOOL value. The default value is FALSE.

CATEGORY_ENCODING_METHOD

Syntax

CATEGORY_ENCODING_METHOD = { 'ONE_HOT_ENCODING' | 'DUMMY_ENCODING' | 'LABEL_ENCODING' }

Description

Specifies which encoding method to use on non-numeric features. For more information about supported encoding methods, see BigQuery ML auto preprocessing.

Arguments

This option accepts the following values:

  • LABEL_ENCODING. This is the default.
  • DUMMY_ENCODING
  • ONE_HOT_ENCODING

BOOSTER_TYPE

Syntax

BOOSTER_TYPE = { { 'GBTREE' | 'DART'} | HPARAM_CANDIDATES([candidates]) }

Description

Specifies the booster type to use.

Arguments

This option accepts the following values:

If you are running hyperparameter training, you can provide more than one value for this option by using HPARAM_CANDIDATES and specifying an array. For example, BOOSTER_TYPE = HPARAM_CANDIDATES(['GBTREE', 'DART']).

DART_NORMALIZE_TYPE

Syntax

DART_NORMALIZE_TYPE = { { 'TREE' | 'FOREST'} | HPARAM_CANDIDATES([candidates]) }

Description

The type of normalization algorithm to use if you are using the DART booster.

Arguments

This option accepts the following values:

  • TREE: New trees have the same weight of each of the dropped trees. This is the default value.
  • FOREST: New trees have the same weight of the sum of the dropped trees (forest).

If you are running hyperparameter training, you can provide more than one value for this option by using HPARAM_CANDIDATES and specifying an array. For example, DART_NORMALIZE_TYPE = HPARAM_CANDIDATES(['TREE', 'FOREST']).

NUM_PARALLEL_TREE

Syntax

NUM_PARALLEL_TREE = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The number of parallel trees constructed during each iteration. To train a boosted random forest model, set this value to larger than 1.

Arguments

If you aren't running hyperparameter tuning, then you can specify an INT64 value. The default value is 1.

If you are running hyperparameter tuning, use one of the following options:

  • The HPARAM_RANGE keyword and two INT64 values that define the range of the hyperparameter. For example, NUM_PARALLEL_TREE = HPARAM_RANGE(1, 5).
  • The HPARAM_CANDIDATES keyword and an array of INT64 values that provide discrete values to use for the hyperparameter. For example, NUM_PARALLEL_TREE = HPARAM_CANDIDATES([0, 1, 3, 5]).

When running hyperparameter tuning, the valid range is (1, ∞], there is no default range, and the scale type is LINEAR.

MAX_TREE_DEPTH

Syntax

MAX_TREE_DEPTH = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The maximum depth of a tree.

Arguments

If you aren't running hyperparameter tuning, then you can specify an INT64 value. The default value is 6.

If you are running hyperparameter tuning, use one of the following options:

  • The HPARAM_RANGE keyword and two INT64 values that define the range of the hyperparameter. For example, MAX_TREE_DEPTH = HPARAM_RANGE(0, 4).
  • The HPARAM_CANDIDATES keyword and an array of INT64 values that provide discrete values to use for the hyperparameter. For example, MAX_TREE_DEPTH = HPARAM_CANDIDATES([1, 5, 10, 15]).

When running hyperparameter tuning, the valid range is (1, 20], the default range is (1, 10], and the scale type is LINEAR.

DROPOUT

Syntax

DROPOUT = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

Specifies the dropout rate, which is the fraction of previous trees to drop during the dropout.

Arguments

If you aren't running hyperparameter tuning, then you can specify a FLOAT64 value between 0 and 1.0. The default value is 0.

If you are running hyperparameter tuning, use one of the following options:

  • The HPARAM_RANGE keyword and two FLOAT64 values that define the range of the hyperparameter. For example, DROPOUT = HPARAM_RANGE(0, 0.6).
  • The HPARAM_CANDIDATES keyword and an array of FLOAT64 values that provide discrete values to use for the hyperparameter. For example, DROPOUT = HPARAM_CANDIDATES([0, 0.1, 0.2, 0.6]).

When running hyperparameter tuning, the valid range is (0, 1.0], there is no default range, and the scale type is LINEAR.

L1_REG

Syntax

L1_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The amount of L1 regularization applied.

Arguments

If you aren't running hyperparameter tuning, then you can specify a FLOAT64 value. The default value is 0.

If you are running hyperparameter tuning, then you can use one of the following options:

  • The HPARAM_RANGE keyword and two FLOAT64 values that define the range to use for the hyperparameter. For example, L1_REG = HPARAM_RANGE(0, 5.0).
  • The HPARAM_CANDIDATES keyword and an array of FLOAT64 values that provide discrete values to use for the hyperparameter. For example, L1_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0]).

When running hyperparameter tuning, the valid range is (0, ∞), the default range is (0, 10.0], and the scale type is LOG.

L2_REG

Syntax

L2_REG = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

The amount of L2 regularization applied.

Arguments

If you aren't running hyperparameter tuning, then you can specify a FLOAT64 value. The default value is 1.0.

If you are running hyperparameter tuning, then you can use one of the following options:

  • The HPARAM_RANGE keyword and two FLOAT64 values that define the range to use for the hyperparameter. For example, L2_REG = HPARAM_RANGE(1.5, 5.0).
  • The HPARAM_CANDIDATES keyword and an array of FLOAT64 values that provide discrete values to use for the hyperparameter. For example, L2_REG = HPARAM_CANDIDATES([0, 1.0, 3.0, 5.0]).

When running hyperparameter tuning, the valid range is (0, ∞), the default range is (0, 10.0], and the scale type is LOG.

LEARN_RATE

Syntax

LEARN_RATE = { float64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }

Description

LEARN_RATE is the step size shrinkage used in updates to prevents overfitting. After each boosting step, LEARN_RATE shrinks the feature weights to make the boosting process more conservative.

Arguments

If you aren't running hyperparameter tuning, then you can specify a FLOAT64 value. The default value is 0.3.

If you are running hyperparameter tuning, use one of the following options:

  • The HPARAM_RANGE keyword and two FLOAT64 values that define the range of the hyperparameter. For example, LEARN_RATE = HPARAM_RANGE(0, 0.5).
  • The HPARAM_CANDIDATES keyword and an array of FLOAT64 values that provide discrete values to use for the hyperparameter. For example, LEARN_RATE = HPARAM_CANDIDATES([0, 0.1, 0.3, 0.5]).

When running hyperparameter tuning, the valid range is (0, ∞], the default range is (0, 1.0], and the scale type is LINEAR.

TREE_METHOD

Syntax

TREE_METHOD = { { 'AUTO' | 'EXACT' | 'APPROX' | 'HIST'} | HPARAM_CANDIDATES([candidates]) }

Description

The type of tree construction algorithm.

HIST is recommended for large datasets in order to increase training speed and reduce resource consumption. For more information, see tree booster.

Arguments

This option accepts the fo