The CREATE MODEL statement for K-means models

CREATE MODEL statement for K-means models

To create a K-means model in BigQuery, use the BigQuery ML CREATE MODEL statement with the KMEANS model type.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
[OPTIONS(MODEL_TYPE = { 'KMEANS' },
    NUM_CLUSTERS = int64_value,
    KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' },
    KMEANS_INIT_COL = string_value,
    DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' },
    STANDARDIZE_FEATURES = { TRUE | FALSE },
    MAX_ITERATIONS = int64_value,
    EARLY_STOP = { TRUE | FALSE },
    MIN_REL_PROGRESS = float64_value,
    WARM_START = { TRUE | FALSE }
)];

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

model_name is the name of the model you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.

model_option_list

In the model_option_list, the model_type option is required. All others are optional.

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = { 'KMEANS' }

Description

Specify the model type. This option is required.

Arguments

'KMEANS' K-means clustering for data segmentation; for example, identifying customer segments. K-means is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.

NUM_CLUSTERS

Syntax

NUM_CLUSTERS = int64_value

Description

For a k-means model, the number of clusters to identify in the input data.

Arguments

int64_value is an INT64. Allowed values are 2-100. The default value is log10(n), where n is the number of training examples.

KMEANS_INIT_METHOD

Syntax

KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }

Description

For a k-means model, the method of initializing the clusters.

To use the same centroids in repeated CREATE MODEL queries, specify the option 'CUSTOM'.

Arguments

'RANDOM': Initializes the centroids by randomly selecting NUM_CLUSTERS data points from the input data.

'KMEANS++': Initializes NUM_CLUSTERS centroids by using the KMEANS++ algorithm. 'KMEANS++' trains a better model than 'RANDOM' cluster initialization.

'CUSTOM': Initializes the centroids using a provided column of type BOOL. This column is specified by the option 'KMEANS_INIT_COL'. BigQuery ML uses the rows with a value of TRUE as the initial centroids. When this option is present and the values in 'KMEANS_INIT_COL' are constant, repeated CREATE MODEL queries use the same centroids.

The default value is 'RANDOM'

KMEANS_INIT_COL

Syntax

KMEANS_INIT_COL = string_value

Description

For a k-means model, identifies the column used to initialize the centroids. If this column contains a value of TRUE for a given row, then BigQuery ML uses that row as an initial centroid.

This option can only be specified when 'KMEANS_INIT_METHOD' has the value 'CUSTOM'. The corresponding column must be of type BOOL. Model option NUM_CLUSTERS must be present in the query and its value must equal the total number of TRUE rows in this column. BigQuery ML cannot use this column as a feature and excludes it from features automatically.

Arguments

string_value is a STRING.

DISTANCE_TYPE

Syntax

DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }

Description

For a k-means model, the type of metric to compute the distance between two points.

Arguments

Accepts the following values:

'EUCLIDEAN' Use the following equation to calculate the distance between point x and y:

$$ \lVert x-y\rVert_{2} $$

'COSINE' Use the following equation calculates the distance:

$$ \sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}} $$

where \( \lVert x\rVert_{2} \) represents the L2 norm for x.

The default value is 'EUCLIDEAN'.

STANDARDIZE_FEATURES

Syntax

STANDARDIZE_FEATURES = { TRUE | FALSE }

Description

For a kmeans model, whether to standardize numerical features.

Arguments

Accepts a BOOL. The default value is TRUE.

MAX_ITERATIONS

Syntax

MAX_ITERATIONS = int64_value

Description

The maximum number of training iterations or steps.

Arguments

int64_value is an INT64. The default value is 20.

EARLY_STOP

Syntax

EARLY_STOP = { TRUE | FALSE }

Description

Whether training should stop after the first iteration in which the relative loss improvement is less than the value specified for MIN_REL_PROGRESS.

Arguments

The value is a BOOL. The default value is TRUE.

MIN_REL_PROGRESS

Syntax

MIN_REL_PROGRESS = float64_value

Description

The minimum relative loss improvement that is necessary to continue training when EARLY_STOP is set to true. For example, a value of 0.01 specifies that each iteration must reduce the loss by 1% for training to continue.

Arguments

float64_value is a FLOAT64. The default value is 0.01.

WARM_START

Syntax

WARM_START = { TRUE | FALSE }

Description

Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.

In a warm start run, the iteration numbers are reset to start from zero. The TRAINING_RUN number or the TIMESTAMP columns can be used to distinguish the warm start run from the original run.

The value of the MODEL_TYPE and the training data schema must remain constant in a warm start models retrain.

Arguments

Accepts a BOOL. The default value is FALSE.

query_statement

The AS query_statement clause specifies the GoogleSQL query that is used to generate the training data. See the GoogleSQL Query Syntax page for the supported SQL syntax of the query_statement clause.

CREATE MODEL examples

The following examples create models named mymodel in mydataset in your default project.

Train a k-means model

This example creates a k-means model with four clusters using the default distance_type value of euclidean_distance.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=4 ) AS
SELECT
  *
FROM `mydataset.mytable`

Train a k-means model with random cluster initialization method.

This example creates a k-means model with three clusters using the random cluster initialization method.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=3,
    KMEANS_INIT_METHOD='RANDOM') AS
SELECT
  *
FROM
  `mydataset.mytable`

Train a k-means model with custom cluster initialization method

This example creates a k-means model with three clusters using the custom cluster initialization method. init_col identifies the column of type BOOL that contains the values which specify whether a given row is an initial centroid. This column should only contain three rows with the value TRUE.

CREATE MODEL
  `mydataset.mymodel`
OPTIONS
  ( MODEL_TYPE='KMEANS',
    NUM_CLUSTERS=3,
    KMEANS_INIT_METHOD='CUSTOM',
    KMEANS_INIT_COL='init_col') AS
SELECT
  init_col,
  features
FROM