# The CREATE MODEL statement for Autoencoder models

## CREATE MODEL statement for Autoencoder models

To create an Autoencoder model in BigQuery, use the BigQuery ML CREATE MODEL statement with the AUTOENCODER model type. A sample query can be found in the Example section below.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

## CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name
[OPTIONS(MODEL_TYPE = { 'AUTOENCODER' },
ACTIVATION_FN = { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' },
BATCH_SIZE = int64_value,
DROPOUT = float64_value,
EARLY_STOP = { TRUE | FALSE },
HIDDEN_UNITS = int_array,
L1_REG_ACTIVATION = float64_value,
LEARN_RATE = float64_value,
MAX_ITERATIONS = int64_value,
MIN_REL_PROGRESS = float64_value,
WARM_START = { TRUE | FALSE }
)];


### CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

### CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

### CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.

### Model options

BigQuery ML supports the following options: model_name and model_type are required. Others are optional.

#### model_name

model_name is the name of the BigQuery ML model that you're creating or replacing. The model name must be unique for each dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

• Up to 1,024 characters
• Letters of upper or lower-case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, then prepend the project ID to the model name in the following format, including backticks:

[PROJECT_ID].[DATASET].[MODEL]

For example:

myproject.mydataset.mymodel

#### MODEL_TYPE

Syntax

MODEL_TYPE = { 'AUTOENCODER' }


Description

Specifies the model type. This option is required.

#### ACTIVATION_FN

Syntax

ACTIVATION_FN =  { 'RELU' | 'RELU6' | 'CRELU' | 'ELU' | 'SELU' | 'SIGMOID' | 'TANH' }


Description

For Autoencoder models, specifies the activation function of the neural network.

Arguments

The following options are available:

The default value is 'RELU'.

#### BATCH_SIZE

Syntax

BATCH_SIZE = int64_value


Description

For Autoencoder models, specifies the mini batch size of samples that are fed to the neural network.

Arguments

A positive number with a maximum value of 8192.

The default value is 32 or the number of samples, whichever is smaller.

#### DROPOUT

Syntax

DROPOUT = float64_value


Description

For Autoencoder models, specifies the dropout rate of units in the neural network.

Arguments

A valid value must be non-negative and not greater than 1.0. The default value is 0.0.

#### EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }


Description

Specifies whether or not the training stops after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

#### HIDDEN_UNITS

Syntax

HIDDEN_UNITS = int_array


Description

For Autoencoder models, specifies the hidden layers of the neural network.

Arguments

An array of integers that represents the architecture of the hidden layers. If not specified, BigQuery ML applies a single hidden layer that contains no more than 128 units. The number of units is calculated based on a variety of aspects, such as the number of feature columns or categorical values.

An array of integers that represents the architecture of the hidden layers, where all layers are fully connected. If not specified, BigQuery ML applies a single hidden layer with the number of nodes calculated as [min(128, num_samples / (10 * (num_input_units + num_output_units)))]. The upper bound of the rule ensures that the model isn't over fitting.

The number in the middle of the array defines the shape of the latent space. For example, hidden_units=[128, 64, 4, 64, 128] defines a four-dimensional latent space.

The number of layers in hidden_units must be odd, and we recommend that the sequence be symmetrical.

Example

HIDDEN_UNITS = [128, 64, 8, 64, 128]


This example presents an architecture of five hidden layers, where the encoder layers are [128, 64] in sequence, the decoder layers are [64, 128], and the latent space has a dimensionality of 8.

#### L1_REG_ACTIVATION

Syntax

 L1_REG_ACTIVATION = float64_value 

Description

L1 regularizer to apply a penalty on the activations at the latent space. This is a common approach to to achieve high sparsity of data representations after dimensionality reduction.

Arguments

float64_value is a FLOAT64. The default value is 0.

#### LEARN_RATE

Syntax

 LEARN_RATE = float64_value 

Description

The initial learn rate for training.

Arguments

float64_value is a FLOAT64. The default value is 0.001.

#### MAX_ITERATIONS

Syntax

 MAX_ITERATIONS = int64_value 

Description

The maximum number of training iterations, where one iteration represents a single pass of all of the training data.

Arguments

int64_value is an INT64. The default value is 20.

#### MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to TRUE. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

#### OPTIMIZER

Syntax

OPTIMIZER =  { 'ADAGRAD' | 'ADAM' | 'FTRL' | 'RMSPROP' | 'SGD' }


Description

For Autoencoder models, specifies the optimizer for training the model.

Arguments

The following options are available:

The default value is 'ADAM'.

#### WARM_START

Syntax

WARM_START = { TRUE | FALSE }


Description

Retrains a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

Also, in a warm start, the values of the MODEL_TYPE, the HIDDEN_UNITS options, and the model retraining data schema must all remain the same as they were in the previous training job. The warm_start option is only supported for LINEAR_REG, LOGISTIC_REG, KMEANS , DNN_REGRESSOR, DNN_CLASSIFIER and AUTOENCODER models retrain.

Arguments

Accepts a BOOL. The default value is FALSE.

#### INTERNAL DEFAULT OPTIONS

BigQuery ML also uses the following default values when you're building Autoencoder models internally.

loss_reduction = losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE

batch_norm = False


## Supported machine learning functions

The following machine learning functions are supported for the Autoencoder model, including:

## Example

The following example trains an Autoencoder model against 'mytable'.

CREATE MODEL project_id:mydataset.mymodel
OPTIONS(MODEL_TYPE='AUTOENCODER',
ACTIVATION_FN = 'RELU',
BATCH_SIZE = 16,
DROPOUT = 0.1,
EARLY_STOP = FALSE,
HIDDEN_UNITS = [128, 64, 8, 64, 128],
LEARN_RATE=0.001,
MAX_ITERATIONS = 50,