The CREATE MODEL statement for K-means models
This document describes the CREATE MODEL
statement for creating
k-means
models in BigQuery. K-means models
support hyperparameter tuning.
You can use k-means models with the
ML.PREDICT
function
to cluster data, and you can use k-means models with the
ML.DETECT_ANOMALIES
function
to perform anomaly detection.
For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(model_option_list) AS query_statement model_option_list: MODEL_TYPE = { 'KMEANS' }, [, NUM_CLUSTERS = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) } ] [, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' } ] [, KMEANS_INIT_COL = string_value ] [, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' } ] [, STANDARDIZE_FEATURES = { TRUE | FALSE } ] [, MAX_ITERATIONS = int64_value ] [, EARLY_STOP = { TRUE | FALSE } ] [, MIN_REL_PROGRESS = float64_value ] [, WARM_START = { TRUE | FALSE } ] [, NUM_TRIALS = int64_value ] [, MAX_PARALLEL_TRIALS = int64_value ] [, HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' } ] [, HPARAM_TUNING_OBJECTIVES = 'DAVIES_BOULDIN_INDEX' ] [, MODEL_REGISTRY = { 'VERTEX_AI' } ] [, VERTEX_AI_MODEL_ID = string_value ] [, VERTEX_AI_MODEL_VERSION_ALIASES = string_array ] [, KMS_KEY_NAME = string_value ]
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE = { 'KMEANS' }
Description
Specify the model type. This option is required.
Arguments
Specify KMEANS
to use k-means clustering for data segmentation; for example,
identifying customer segments. K-means is an unsupervised learning technique, so
model training does not require labels or split data for training or
evaluation.
NUM_CLUSTERS
Syntax
NUM_CLUSTERS = { int64_value | HPARAM_RANGE(range) | HPARAM_CANDIDATES([candidates]) }
Description
The number of clusters to identify in the input data.
Arguments
If you aren't running hyperparameter tuning, then you can specify an INT64
value between 2
and 100
. The default value is
log10(n)
, where n
is the number of training examples.
If you are running hyperparameter tuning, use one of the following options:
- The
HPARAM_RANGE
keyword and twoINT64
values that define the range of the hyperparameter. For example,NUM_CLUSTERS = HPARAM_RANGE(2, 25)
. - The
HPARAM_CANDIDATES
keyword and an array ofINT64
values that provide discrete values to use for the hyperparameter. For example,NUM_CLUSTERS = HPARAM_CANDIDATES([5, 10, 50, 100])
.
When running hyperparameter tuning, the valid range is [2, 100]
,
the default range is [2, 10]
, and the scale type is LINEAR
.
KMEANS_INIT_METHOD
Syntax
KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }
Description
The method of initializing the clusters.
To use the same centroids in repeated CREATE MODEL
queries, specify the option
'CUSTOM'
.
Arguments
This option accepts the following values:
RANDOM
: Initializes the centroids by randomly selecting a number of data points equal to theNUM_CLUSTERS
value from the input data. This is the default value.KMEANS++
: Initializes a number of centroids equal to theNUM_CLUSTERS
value by using the k-means++ algorithm. Using this approach usually trains a better model than using random cluster initialization.CUSTOM
: Initializes the centroids using a provided column of typeBOOL
. BigQuery ML uses the rows with a value ofTRUE
as the initial centroids. You specify the column to use by using theKMEANS_INIT_COL
option.When you use this option, if the values in the column identified by
'KMEANS_INIT_COL'
remain constant, then repeatedCREATE MODEL
queries use the same centroids.
KMEANS_INIT_COL
Syntax
KMEANS_INIT_COL = string_value
Description
The name of the column to use to initialize the centroids. This column must have
a type of BOOL
. If this column contains a value of TRUE
for a given row,
then BigQuery ML uses that row as an initial centroid. The number
of TRUE
rows in this column must be equal to the value you have specified
for the NUM_CLUSTERS
option.
You can only use this option if you have specified CUSTOM
for the
KMEANS_INIT_METHOD
option.
You can't use this column as a feature; BigQuery ML automatically excludes it.
Arguments
A STRING
value.
DISTANCE_TYPE
Syntax
DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }
Description
The type of metric to use to compute the distance between two points.
Arguments
This option accepts the following values:
EUCLIDEAN
: Use the following equation to calculate the distance between pointsx
andy
:$$ \lVert x-y\rVert_{2} $$This is the default value.
COSINE
: Use the following equation to calculate the distance between pointsx
andy
:$$ \sqrt{1-\frac{x \cdot y}{\lVert x\rVert_{2}\lVert y\rVert_{2}}} $$where \( \lVert x\rVert_{2} \) represents the L2 norm for
x
.
STANDARDIZE_FEATURES
Syntax
STANDARDIZE_FEATURES = { TRUE | FALSE }
Description
Determines whether to standardize numerical features. This setting doesn't affect automatic preprocessing of non-numerical features.
Arguments
A BOOL
value. The default value is TRUE
.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations, where one iteration represents a single pass of the entire training data.
Arguments
An INT64
value. The default value is 20
.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Determines whether training should stop after the first iteration in which the
relative loss improvement is less than the value specified for
MIN_REL_PROGRESS
.
Arguments
A BOOL
value. The default value is TRUE
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to TRUE
. For example, a value of
0.01
specifies that each iteration must reduce the loss by 1% for training
to continue.
Arguments
A FLOAT64
value. The default value is 0.01
.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Determines whether to train a model with new training data, new model options, or both. Unless you explicitly override them, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. Use
the training run or iteration information returned by the
ML.TRAINING_INFO
function
to distinguish the warm start run from the original run.
The value of the MODEL_TYPE
and the training data schema must remain
constant in a warm start models retrain.
Arguments
A BOOL
value. The default value is FALSE
.
NUM_TRIALS
Syntax
NUM_TRIALS = int64_value
Description
The maximum number of submodels to train. The tuning stops when NUM_TRIALS
submodels are trained, or when the hyperparameter search space is exhausted.
You must specify this option in order to use hyperparameter tuning.
Arguments
An INT64
value between 1
and 100
, inclusive.
MAX_PARALLEL_TRIALS
Syntax
MAX_PARALLEL_TRIALS = int64_value
Description
The maximum number of trials to run at the same time. If you specify a value
for this option, you must also specify a value for
NUM_TRIALS
.
Arguments
An INT64
value between 1
and 5
, inclusive. The default value is 1
.
HPARAM_TUNING_ALGORITHM
Syntax
HPARAM_TUNING_ALGORITHM = { 'VIZIER_DEFAULT' | 'RANDOM_SEARCH' | 'GRID_SEARCH' }
Description
The algorithm used to tune the hyperparameters. If you specify a value
for this option, you must also specify a value for NUM_TRIALS
.
Arguments
Specify one of the following values:
VIZIER_DEFAULT
: Use the default algorithm in Vertex AI Vizier to tune hyperparameters. This algorithm is the most powerful algorithm of those offered. It performs a mixture of advanced search algorithms, including Bayesian optimization with Gaussian processes. It also uses transfer learning to take advantage of previously tuned models. This is the default, and also the recommended approach.RANDOM_SEARCH
: Use random search to explore the search space.GRID_SEARCH
: Use grid search to explore the search space. You can only use this algorithm when every hyperparameter's search space is discrete.
HPARAM_TUNING_OBJECTIVES
Syntax
HPARAM_TUNING_OBJECTIVES = { 'DAVIES_BOULDIN_INDEX' }
Description
The hyperparameter tuning objective for the model. If you specify a value
for this option, you must also specify a value for NUM_TRIALS
.
Arguments
The possible objectives are a subset of the
model evaluation metrics
for the model type. If you aren't running hyperparameter tuning, or if you are
and you don't specify an objective, the DAVIES_BOULDIN_INDEX
objective is
used.
MODEL_REGISTRY
The MODEL_REGISTRY
option specifies the model registry destination.
VERTEX_AI
is the only supported model registry destination. To learn more, see
Register a BigQuery ML model.
VERTEX_AI_MODEL_ID
The VERTEX_AI_MODEL_ID
option specifies the Vertex AI model ID
to register the model with.
You can only set the VERTEX_AI_MODEL_ID
option when the MODEL_REGISTRY
option is set to VERTEX_AI
. To learn more, see
Add a Vertex AI model ID.
VERTEX_AI_MODEL_VERSION_ALIASES
The VERTEX_AI_MODEL_VERSION_ALIASES
option specifies the
Vertex AI model alias to register the model with.
You can only set the VERTEX_AI_MODEL_VERSION_ALIASES
option when the
MODEL_REGISTRY
option is set to VERTEX_AI
. To learn more, see
Add a Vertex AI model ID.
KMS_KEY_NAME
Syntax
KMS_KEY_NAME = string_value
Description
The Cloud Key Management Service customer-managed encryption key (CMEK) to use to encrypt the model.
Arguments
A STRING
value containing the fully-qualified name of the CMEK. For example,
'projects/my_project/locations/my_location/keyRings/my_ring/cryptoKeys/my_key'
query_statement
The AS query_statement
clause specifies the
GoogleSQL query used to generate the training data. See the
GoogleSQL query syntax
page for the supported SQL syntax of the query_statement
clause.
CREATE MODEL
examples
The following examples create models named mymodel
in mydataset
in your
default project.
Train a k-means model
This example creates a k-means model with four clusters using the default
distance_type
value of euclidean_distance
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=4 ) AS SELECT * FROM `mydataset.mytable`
Train a k-means model with random cluster initialization method.
This example creates a k-means model with three clusters using the random cluster initialization method.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=3, KMEANS_INIT_METHOD='RANDOM') AS SELECT * FROM `mydataset.mytable`
Train a k-means model with custom cluster initialization method
This example creates a k-means model with three clusters using the custom
cluster initialization method. init_col
identifies the column of type BOOL
that contains the values which specify whether a given row is an initial
centroid. This column should only contain three rows with the value TRUE
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=3, KMEANS_INIT_METHOD='CUSTOM', KMEANS_INIT_COL='init_col') AS SELECT init_col, features FROM `mydataset.mytable`