# The CREATE MODEL statement for Wide-and-Deep models

## CREATE MODEL statement for Wide-and-Deep models

To create a Wide-and-Deep model in BigQuery, use the BigQuery ML CREATE MODEL statement with the DNN_LINEAR_COMBINED_CLASSIFIER or DNN_LINEAR_COMBINED_REGRESSOR model types. These models are built using TensorFlow estimators.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

## CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
[OPTIONS(MODEL_TYPE = { 'DNN_LINEAR_COMBINED_CLASSIFIER' | 'DNN_LINEAR_COMBINED_REGRESSOR' },
ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' },
AUTO_CLASS_WEIGHTS = { TRUE | FALSE },
BATCH_SIZE = int64_value,
CLASS_WEIGHTS = struct_array,
DROPOUT = float64_value,
EARLY_STOP = { TRUE | FALSE },
HIDDEN_UNITS = int_array,
L1_REG = float64_value,
L2_REG = float64_value,
LEARN_RATE = float64_value,
INPUT_LABEL_COLS = string_array,
MAX_ITERATIONS = int64_value,
MIN_REL_PROGRESS = float64_value,
OPTIMIZER = { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' },
WARM_START = { TRUE | FALSE },
DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' },
DATA_SPLIT_EVAL_FRACTION = float64_value,
DATA_SPLIT_COL = string_value,
ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE },
)];


### CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

### CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

### CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing models with the same name in the specified dataset.

### Model Options

BigQuery ML currently supports the following options. model_name and model_type are required; others are optional.

#### model_name

model_name is the name of the BigQuery ML model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

• Up to 1,024 characters
• Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

 [PROJECT_ID].[DATASET].[MODEL] 

For example:

 myproject.mydataset.mymodel 

#### MODEL_TYPE

Syntax

MODEL_TYPE = { 'DNN_LINEAR_COMBINED_CLASSIFIER' | 'DNN_LINEAR_COMBINED_REGRESSOR' }


Description

Specifies the model type. This option is required.

#### ACTIVATION_FN

Syntax

ACTIVATION_FN =  { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }


Description

For Wide-and-Deep model types, specifies the activation function of the neural network.

Arguments

The following options are available:

The default value is 'RELU'.

#### AUTO_CLASS_WEIGHTS

Syntax

AUTO_CLASS_WEIGHTS = { TRUE | FALSE }


Description

Whether to balance class labels using weights for each class in inverse proportion to the frequency of that class.

Use only with the DNN_LINEAR_COMBINED_CLASSIFIER model.

By default, the training data that is used to create a multiclass Wide-and-Deep Classifier model is unweighted. If the labels in the training data are imbalanced, the model might learn to predict the most popular class of labels more heavily, which might not be desired.

To balance every class, use the following formula:

total_input_rows / (input_rows_for_class_n * number_of_unique_classes)


Arguments

Accepts a BOOL. The default value is FALSE.

#### BATCH_SIZE

Syntax

BATCH_SIZE = int64_value


Description

For Wide-and-Deep model types, specifies the mini batch size of samples that are fed to the neural network.

Arguments

A positive number that is no greater than 8192.

The default value is 32 or the number of samples, whichever is smaller.

#### CLASS_WEIGHTS

Syntax

CLASS_WEIGHTS = struct_array

Description

The weights to use for each class label. This option cannot be specified if AUTO_CLASS_WEIGHTS is TRUE.

Arguments

struct_array is an ARRAY of STRUCTs; each STRUCT contains a STRING that contains the class label and a FLOAT64 containing the weight for that class label. A weight must be present for every class label. The weights are not required to add up to one. For example:

CLASS_WEIGHTS = [STRUCT('example_label', .2)]


#### DROPOUT

Syntax

DROPOUT = float64_value


Description

For Wide-and-Deep model types, specifies the dropout rate of units in the neural network.

Arguments

The value must be non-negative and less than 1.0. The default value is 0.0.

#### EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }


Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

#### HIDDEN_UNITS

Syntax

HIDDEN_UNITS = int_array


Description

For Wide-and-Deep model types, specifies the hidden layers of the neural network.

Arguments

An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units. The number of units is calculated based on a variety of aspects, such as feature columns and number of categorical values.

Example

HIDDEN_UNITS = [256, 128, 64]


This example represents an architecture of 3 hidden layers with 256, 128, and 64 nodes respectively.

#### INPUT_LABEL_COLS

Syntax

 INPUT_LABEL_COLS = string_array 

Description

The label column name(s) in the training data.

Arguments

string_array is an ARRAY of STRINGs. The Wide-and-Deep model types only support string_array values that contain one element. Defaults to 'label'.

#### L1_REG

Syntax

 L1_REG = float64_value 

Description

The L1 regularization strength of the OPTIMIZER. This option is only available when OPTIMIZER is set to one of the following:

Arguments

float64_value is a FLOAT64. The default value is 0.

#### L2_REG

Syntax

 L2_REG = float64_value 

Description

The L2 regularization strength of the OPTIMIZER. This option is only available when OPTIMIZER is set to one of the following:

Arguments

float64_value is a FLOAT64. The default value is 0.

#### LEARN_RATE

Syntax

 LEARN_RATE = float64_value | struct_array 

Description

The initial learn rate for training.

Arguments

LEARN_RATE is a polymorphic option that can be specified as a single value (used for the entire architecture), or an array of structs (used for different parts of the model).

• float64_value is a FLOAT64 when specifying initial learn rate for both linear and DNN parts.

For example:

LEARN_RATE = 0.01

• struct_array is an ARRAY of STRUCTs;. Each STRUCT contains a STRING and a FLOAT64, where the string represents the part of the model, and the float represents the initial learn rate of the part.

For example:

LEARN_RATE = [STRUCT('dnn', 0.001), STRUCT('linear', 0.01)]


The default value is 0.001 for both linear and dnn parts.

#### MAX_ITERATIONS

Syntax

 MAX_ITERATIONS = int64_value 

Description

The maximum number of training iterations, where one iteration represents a single pass of the entire training data.

Arguments

int64_value is an INT64. The default value is 20.

#### MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

#### OPTIMIZER

Syntax

 OPTIMIZER = string_value | struct_array 

Description

For Wide-and-Deep model types, specifies the optimizer for training the model.

Arguments

OPTIMIZER is a polymorphic option that can be specified as a single value (used for the entire architecture), or an array of structs (used for different parts of the model).

Or it can be used as:

• struct_array is an ARRAY of STRUCTs. Each STRUCT contains two STRINGs, where the first string represents the part of the model, and the second represents the optimizer type, as shown above.

For example:

OPTIMIZER = [STRUCT('dnn', 'ADAGRAD'), STRUCT('linear', 'SGD')]


The default value of OPTIMIZER is [STRUCT('dnn', 'ADAM'), STRUCT('linear', 'FTRL')].

#### WARM_START

Syntax

WARM_START = { TRUE | FALSE }


Description

Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The values of the MODEL_TYPE, LABELS, and HIDDEN_UNITS options, and the training data schema, must remain the same as they were in previous training job in a warm start. The warm_start option is only supported for LINEAR_REG, LOGISTIC_REG, KMEANS , DNN_REGRESSOR, DNN_CLASSIFIER, DNN_LINEAR_COMBINED_REGRESSOR, DNN_LINEAR_COMBINED_CLASSIFIER and AUTOENCODER models retrain.

Arguments

Accepts a BOOL. The default value is FALSE.

#### DATA_SPLIT_METHOD

Syntax

DATA_SPLIT_METHOD = { 'AUTO_SPLIT' | 'RANDOM' | 'CUSTOM' | 'SEQ' | 'NO_SPLIT' }


Description

The method to split input data into training and evaluation sets. Training data is used to train the model. Evaluation data is used to avoid overfitting due to early stopping.

Arguments

Accepts the following values:

'AUTO_SPLIT' The automatic split strategy is as follows:

• When there are fewer than 500 rows in the input data, all rows are used as training data.
• When there are between 500 and 50,000 rows in the input data, 20% of the data is used as evaluation data in a RANDOM split.
• When there are more than 50,000 rows in the input data, only 10,000 rows are used as evaluation data in a RANDOM split.

'RANDOM' Split data randomly. A random split is deterministic: different training runs produce the same split results if the underlying training data remains the same.

'CUSTOM' Split data using a column of type BOOL that you provide (or similar). The rows with a value of TRUE are used as evaluation data. The rows with a value of FALSE are used as training data.

'SEQ' Split data sequentially using a column that you provide (or similar). The column can have any orderable data type: NUMERIC, BIGNUMERIC, STRING, or TIMESTAMP. All rows with split values smaller than the threshold are used as training data. The remaining rows including NULL values are used as evaluation data.

'NO_SPLIT' Use all data as training data.

#### DATA_SPLIT_EVAL_FRACTION

Syntax

 DATA_SPLIT_EVAL_FRACTION = float64_value 

Description

This option is used with 'RANDOM' and 'SEQ' splits. It specifies the fraction of the data used for evaluation, accurate to two decimal places.

Arguments

float64_value is a FLOAT64. The default value is 0.2.

#### DATA_SPLIT_COL

Syntax

DATA_SPLIT_COL = string_value

Description

Identifies the column used to split the data. This column cannot be used as a feature or label, and is excluded from features automatically.

• When the value of DATA_SPLIT_METHOD is 'CUSTOM', the corresponding column should be of type BOOL. The rows with TRUE or NULL values are used as evaluation data. Rows with FALSE values are used as training data.

• When the value of DATA_SPLIT_METHOD is 'SEQ', the last n rows from smallest to largest in the corresponding column are used as evaluation data, where n is the value specified for DATA_SPLIT_EVAL_FRACTION. The first rows are used as training data.

#### ENABLE_GLOBAL_EXPLAIN

Syntax

 ENABLE_GLOBAL_EXPLAIN = { TRUE | FALSE } 

Description

Specifies whether to compute global explanations using explainable AI to evaluate global feature importance to the model.

Arguments

The value is a BOOL. The default value is FALSE.

Syntax

 INTEGRATED_GRADIENTS_NUM_STEPS = int64_value 

Description

Specifies the number of steps to sample between the example being explained and its baseline for approximating the integral in integrated gradients attribution methods.

Arguments

The value is a INT64. The default value is 50.

#### INTERNAL DEFAULT OPTIONS

BigQuery ML also uses the following default values when building Wide-and-Deep models internally.

loss_reduction = losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE

batch_norm = False


## Example

The following example trains a Wide-and-Deep classifier model against 'mytable' with 'mylabel' as the label column.

CREATE MODEL project_id:mydataset.mymodel
OPTIONS(MODEL_TYPE='DNN_LINEAR_COMBINED_CLASSIFIER',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 16,
DROPOUT = 0.1,
EARLY_STOP = FALSE,
HIDDEN_UNITS = [128, 128, 128],
INPUT_LABEL_COLS = ['mylabel'],
LEARN_RATE=0.001,
MAX_ITERATIONS = 50,