The CREATE MODEL statement for Boosted Tree models using XGBoost

CREATE MODEL statement for Boosted Tree models using XGBoost

To create a Boosted Tree model in BigQuery, use the BigQuery ML CREATE MODEL statement with the BOOSTED_TREE_CLASSIFIER or BOOSTED_TREE_REGRESSOR model types. The model is trained using the XGBoost library.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
[OPTIONS(MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' },
         BOOSTER_TYPE = {'GBTREE' | 'DART'},
         NUM_PARALLEL_TREE = int64_value,
         DART_NORMALIZE_TYPE = {'TREE' | 'FOREST'},
         TREE_METHOD = {'AUTO' | 'EXACT' | 'APPROX' | 'HIST'},
         MIN_TREE_CHILD_WEIGHT = int64_value,
         COLSAMPLE_BYTREE = float64_value,
         COLSAMPLE_BYLEVEL = float64_value,
         COLSAMPLE_BYNODE = float64_value,
         MIN_SPLIT_LOSS = float64_value,
         MAX_TREE_DEPTH = int64_value,
         SUBSAMPLE = float64_value,
         AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
         CLASS_WEIGHTS = struct_array,
         L1_REG = float64_value,
         L2_REG = float64_value,
         EARLY_STOP = { TRUE | FALSE },
         LEARN_RATE = float64_value,
         INPUT_LABEL_COLS = string_array,
         MAX_ITERATIONS = int64_value,
         MIN_REL_PROGRESS = float64_value,
         DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
         DATA_SPLIT_EVAL_FRACTION = float64_value,
         DATA_SPLIT_COL = string_value
)];

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.

model_name

model_name is the name of the BigQuery ML model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

CREATE MODEL for Boosted Tree models supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'BOOSTED_TREE_CLASSIFIER' | 'BOOSTED_TREE_REGRESSOR' }

Description

Specifies the model type. This option is required.

AUTO_CLASS_WEIGHTS

Syntax

AUTO_CLASS_WEIGHTS = { TRUE | FALSE }

Description

Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.

Use only with the BOOSTED_TREE_CLASSIFIER model.

By default, the training data that is used to create a Boosted Tree Classifier model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.

To balance every class, use the following formula:

total_input_rows / (input_rows_for_class_n * number_of_unique_classes)

Arguments

Accepts a BOOL. The default value is FALSE.

CLASS_WEIGHTS

Syntax

CLASS_WEIGHTS = struct_array

Description

The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is TRUE.

Arguments

struct_array is an ARRAY of STRUCTs; each STRUCT contains a STRING that contains the class label and a FLOAT64 containing the weight for that class label. A weight must be present for every class label. The weights are not required to add up to one. For example:

CLASS_WEIGHTS = [STRUCT('example_label', .2)]

EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }

Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

INPUT_LABEL_COLS

Syntax

INPUT_LABEL_COLS = string_array

Description

The label column name(s) in the training data.

Arguments

string_array is an ARRAY of STRINGs. Boosted Tree model types only support string_array values that contain one element.

L1_REG

Syntax

L1_REG = float64_value

Description

The amount of L1 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 0.

L2_REG

Syntax

L2_REG = float64_value

Description

The amount of L2 regularization applied.

Arguments

float64_value is a FLOAT64. The default value is 1.0.

LEARN_RATE

Syntax

LEARN_RATE = float64_value

Description

Learn_rate is the step size shrinkage used in update to prevents overfitting. After each boosting step, learn_rate shrinks the feature weights to make the boosting process more conservative.

Arguments

float64_value is a FLOAT64. The default value is 0.3.

MAX_ITERATIONS

Syntax

MAX_ITERATIONS = int64_value

Description

The maximum number of rounds for boosting.

Arguments

int64_value is an INT64. The default value is 20.

MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

BOOSTER_TYPE

Syntax

BOOSTER_TYPE =  { 'GBTREE' | 'DART'}

Description

For boosted tree models, specify the booster type to use, with default value gbtree.

Arguments

The following options are available:

NUM_PARALLEL_TREE

Syntax

NUM_PARALLEL_TREE = int64_value

Description

Number of parallel trees constructed during each iteration. Default value is 1. To train a boosted random forest, set this value larger than 1.

DART_NORMALIZE_TYPE

Syntax

DART_NORMALIZE_TYPE = { 'TREE' | 'FOREST'}

Description

Type of normalization algorithm for DART booster. Default value is tree.

TREE_METHOD

Syntax

TREE_METHOD = { 'AUTO' | 'EXACT' | 'APPROX' | 'HIST'}

Description

Type of tree construction algorithm. Default value is AUTO.

HIST is recommended for large datasets in order to achieve faster training speed and lower resource consumption. For more information, see tree booster.

MIN_TREE_CHILD_WEIGHT

Syntax

MIN_TREE_CHILD_WEIGHT = int64_value

Description

Minimum sum of instance weight needed in a child for further partitioning. If the tree partition step results in a leaf node with the sum of instance weight less than min_tree_child_weight, then the building process will give up further partitioning. The larger min_tree_child_weight is, the more conservative the algorithm will be. The value should be greater than or equal to 0, with default value 1.

COLSAMPLE_BYTREE

Syntax

COLSAMPLE_BYTREE = float64_value

Description

Subsample ratio of columns when constructing each tree. Subsampling occurs once for every tree constructed. The value should be between 0 and 1, with default value 1.

COLSAMPLE_BYLEVEL

Syntax

COLSAMPLE_BYLEVEL = float64_value

Description

Subsample ratio of columns for each level. Subsampling occurs once for every new depth level reached in a tree. Columns are subsampled from the set of columns chosen for the current tree. The value should be between 0 and 1, with default value 1.

COLSAMPLE_BYNODE

Syntax

COLSAMPLE_BYNODE = float64_value

Description

Subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level. The value should be between 0 and 1, with default value 1.

MIN_SPLIT_LOSS

Syntax

MIN_SPLIT_LOSS = float64_value

Description

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger min_split_loss is, the more conservative the algorithm will be. Default value is 0.

MAX_TREE_DEPTH

Syntax

MAX_TREE_DEPTH = int64_value

Description

Maximum depth of a tree. Default value is 6.

SUBSAMPLE

Syntax

SUBSAMPLE = float64_value

Description

Subsample ratio of the training instances. Setting this value to 0.5 means that training randomly samples half of the training data prior to growing trees, which prevents overfitting. Subsampling will occur once in every boosting iteration. This is independent of the training-test data split used in the training options (80/20 random by default). The test data is not used in any iteration irrespective of subsample; subsample is only applied to the training data. The value should be between 0 and 1, with default value 1.0 (use all of the training data in each iteration).

DATA_SPLIT_METHOD

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }

Description

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting due to early stopping.

Arguments

Accepts the following values:

'AUTO_SPLIT' The automatic split strategy is as follows:

  • When there are fewer than 500 rows in the input data, all rows are used as training data.
  • When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.
  • When there are more than 50,000 rows in the input data, only 10,000 rows are used as evaluation data in a RANDOM split.

'RANDOM' Split data randomly. A random split is deterministic: different training runs produce the same split results if the underlying training data remains the same.

'CUSTOM' Split data using a customer-provided column of type BOOL. The rows with a value of TRUE are used as evaluation data. The rows with a value of FALSE are used as training data.

'SEQ' Split data sequentially using a customer-provided column. The column can have any orderable data type: NUMERIC, STRING, or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows including NULL values are used as evaluation data.

'NO_SPLIT' Use all data as training data.

DATA_SPLIT_EVAL_FRACTION

Syntax

DATA_SPLIT_EVAL_FRACTION = float64_value

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used for evaluation, accurate to two decimal places.

Arguments

float64_value is a FLOAT64. The default value is 0.2.

DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

Identifies the column used to split the data. This column cannot be used as a feature or label, and will be excluded from features automatically.

  • When the value of DATA_SPLIT_METHOD is 'CUSTOM', the corresponding column should be of type BOOL. The rows with TRUE or NULL values are used as evaluation data. Rows with FALSE values are used as training data.

  • When the value of DATA_SPLIT_METHOD is 'SEQ', the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The first rows are used as training data.

Example

The following example trains a Boosted Tree classifier model against 'mytable' with 'mylabel' as the label column.

CREATE MODEL project_id:mydataset.mymodel
OPTIONS(MODEL_TYPE='BOOSTED_TREE_CLASSIFIER',
        BOOSTER_TYPE = 'GBTREE',
        NUM_PARALLEL_TREE = 1,
        MAX_ITERATIONS = 50,
        TREE_METHOD = 'HIST',
        EARLY_STOP = FALSE,
        SUBSAMPLE = 0.85,
        INPUT_LABEL_COLS = ['mylabel'])
AS SELECT * FROM project_id:mydataset.mytable;

Supported regions

Training Boosted Tree models is not supported in all BigQuery ML regions. For a complete list of supported regions and multi-regions, see Regional locations.