The CREATE MODEL statement for K-means models
CREATE MODEL
statement for K-means models
To create a K-means model in BigQuery, use the
BigQuery ML CREATE MODEL
statement with the KMEANS
model type.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [OPTIONS(MODEL_TYPE = { 'KMEANS' }, NUM_CLUSTERS = int64_value, KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }, KMEANS_INIT_COL = string_value, DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }, STANDARDIZE_FEATURES = { TRUE | FALSE }, MAX_ITERATIONS = int64_value, EARLY_STOP = { TRUE | FALSE }, MIN_REL_PROGRESS = float64_value, WARM_START = { TRUE | FALSE } )];
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
model_name
is the name of the model you're creating or replacing. The model
name must be unique per dataset: no other model or table can have the same name.
The model name must follow the same naming rules as a BigQuery
table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.
model_option_list
In the model_option_list
, the model_type
option is required. All others
are optional.
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'KMEANS' }
Description
Specify the model type. This option is required.
Arguments
'KMEANS'
K-means clustering for data segmentation; for example,
identifying customer segments. K-means is an unsupervised learning technique, so
model training does not require labels nor split data for training or
evaluation.
NUM_CLUSTERS
Syntax
NUM_CLUSTERS = int64_value
Description
For a k-means model, the number of clusters to identify in the input data.
Arguments
int64_value
is an INT64
. Allowed values are 2-100. The default
value is log10(n)
, where n
is the number of training
examples.
KMEANS_INIT_METHOD
Syntax
KMEANS_INIT_METHOD = { 'RANDOM' | 'KMEANS++' | 'CUSTOM' }
Description
For a k-means model, the method of initializing the clusters.
To use the same centroids in repeated CREATE MODEL
queries, specify the option
'CUSTOM'
.
Arguments
'RANDOM'
: Initializes the centroids by randomly selecting
NUM_CLUSTERS
data points from the input data.
'KMEANS++'
: Initializes NUM_CLUSTERS
centroids by using
the KMEANS++
algorithm. 'KMEANS++'
trains a better model than 'RANDOM'
cluster
initialization.
'CUSTOM'
: Initializes the centroids using a provided column of type BOOL
.
This column is specified by the option 'KMEANS_INIT_COL'
.
BigQuery ML uses the rows with a value of TRUE
as the initial centroids.
When this option is present and the values in 'KMEANS_INIT_COL'
are constant,
repeated CREATE MODEL
queries use the same centroids.
The default value is 'RANDOM'
KMEANS_INIT_COL
Syntax
KMEANS_INIT_COL = string_value
Description
For a k-means model, identifies the column used to initialize the centroids.
If this column contains a value of TRUE
for a given row, then BigQuery ML
uses that row as an initial centroid.
This option can only be specified when
'KMEANS_INIT_METHOD'
has the value 'CUSTOM'
. The
corresponding column must be of type BOOL
. Model option
NUM_CLUSTERS
must be present in the query and its value must
equal the total number of TRUE
rows in this column. BigQuery ML cannot
use this column as a feature and excludes it from features automatically.
Arguments
string_value
is a STRING
.
DISTANCE_TYPE
Syntax
DISTANCE_TYPE = { 'EUCLIDEAN' | 'COSINE' }
Description
For a k-means model, the type of metric to compute the distance between two points.
Arguments
Accepts the following values:
'EUCLIDEAN'
Use the following equation to calculate the distance between point
x
and y
:
'COSINE'
Use the following equation calculates the distance:
where \( \lVert x\rVert_{2} \) represents the L2 norm for x
.
The default value is 'EUCLIDEAN'
.
STANDARDIZE_FEATURES
Syntax
STANDARDIZE_FEATURES = { TRUE | FALSE }
Description
For a kmeans
model, whether to
standardize numerical
features.
Arguments
Accepts a BOOL
. The default value is TRUE
.
MAX_ITERATIONS
Syntax
MAX_ITERATIONS = int64_value
Description
The maximum number of training iterations or steps.
Arguments
int64_value
is an INT64
. The default value is 20.
EARLY_STOP
Syntax
EARLY_STOP = { TRUE | FALSE }
Description
Whether training should stop after the first iteration in which the relative
loss improvement is less than the value specified for
MIN_REL_PROGRESS
.
Arguments
The value is a BOOL
. The default value is TRUE
.
MIN_REL_PROGRESS
Syntax
MIN_REL_PROGRESS = float64_value
Description
The minimum relative loss improvement that is necessary to continue training
when EARLY_STOP
is set to true. For example, a value of 0.01 specifies
that each iteration must reduce the loss by 1% for training to continue.
Arguments
float64_value
is a FLOAT64
. The default value is 0.01.
WARM_START
Syntax
WARM_START = { TRUE | FALSE }
Description
Retrain a model with new training data, new model options, or both. Unless explicitly overridden, the initial options used to train the model are used for the warm start run.
In a warm start run, the iteration numbers are reset to start from zero. The
TRAINING_RUN
number or the TIMESTAMP
columns can be
used to distinguish the warm start run from the original run.
The value of the MODEL_TYPE
and the training data schema must remain constant
in a warm start models retrain.
Arguments
Accepts a BOOL
. The default value is FALSE
.
query_statement
The AS query_statement
clause specifies the GoogleSQL query that is used to
generate the training data. See the
GoogleSQL Query Syntax
page for the supported SQL syntax of the query_statement
clause.
CREATE MODEL
examples
The following examples create models named mymodel
in mydataset
in your
default project.
Train a k-means model
This example creates a k-means model with four clusters using the default
distance_type
value of euclidean_distance
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=4 ) AS SELECT * FROM `mydataset.mytable`
Train a k-means model with random cluster initialization method.
This example creates a k-means model with three clusters using the random cluster initialization method.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=3, KMEANS_INIT_METHOD='RANDOM') AS SELECT * FROM `mydataset.mytable`
Train a k-means model with custom cluster initialization method
This example creates a k-means model with three clusters using the custom
cluster initialization method. init_col
identifies the column of type BOOL
that contains the values which specify whether a given row is an initial
centroid. This column should only contain three rows with the value TRUE
.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='KMEANS', NUM_CLUSTERS=3, KMEANS_INIT_METHOD='CUSTOM', KMEANS_INIT_COL='init_col') AS SELECT init_col, features FROM