The CREATE MODEL statement for PCA models
CREATE MODEL
statement for PCA
To create a principal component analysis (PCA) model in BigQuery, use
the BigQuery ML CREATE MODEL
statement with the PCA
model type.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name [OPTIONS(MODEL_TYPE = { 'PCA' }, NUM_PRINCIPAL_COMPONENTS = int64_value, PCA_EXPLAINED_VARIANCE_RATIO = float64_value, SCALE_FEATURES = { TRUE | FALSE } PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' }, )];
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
model_name
is the name of the model you're creating or replacing. The model
name must be unique per dataset: no other model or table can have the same name.
The model name must follow the same naming rules as a BigQuery
table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks: `[PROJECT_ID].[DATASET].[MODEL]` ; for example, `myproject.mydataset.mymodel`.
model_option_list
In the model_option_list
, the model_type
option is required. All others
are optional.
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = { 'PCA' }
Description
Specify the model type. This option is required.
Arguments
'PCA'
Principal component analysis (PCA) is the
process of computing the principal components and using them to perform a change
of basis on the data. It is commonly used for dimensionality reduction by
projecting each data point onto only the first few principal components to
obtain lower-dimensional data while preserving as much of the data's variation
as possible. The first principal component can equivalently be defined as a
direction that maximizes the variance of the projected data.
PCA is an unsupervised learning technique, so model training does not require labels nor split data for training or evaluation.
NUM_PRINCIPAL_COMPONENTS
Syntax
NUM_PRINCIPAL_COMPONENTS = int64_value
Description
Number of principal components to keep.
Arguments
int64_value
is an INT64
. It can not be larger than the
total number of rows or the total feature cardinalities (after one-hot encoding
the categorical features).
PCA_EXPLAINED_VARIANCE_RATIO
Syntax
PCA_EXPLAINED_VARIANCE_RATIO = float64_value
Description
The number of principal components is selected such that the percentage of variance explained by the principal components is greater than the ratio specified by this argument.
Arguments
float64_value
is a FLOAT64
. The value must be within
(0, 1).
SCALE_FEATURES
Syntax
SCALE_FEATURES = { TRUE | FALSE }
Description
Whether or not to scale the numerical features to unit variance. Note that the input numerical features are always centered to have zero mean value. Separately, categorical features are one-hot encoded.
Arguments
Accepts a BOOL
. The default value is TRUE
.
PCA_SOLVER
Syntax
PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' }
Description
The solver that is used to calculate the principal components.
Arguments
'FULL'
: Run a full eigendecomposition algorithm. In this case, the maximum allowed
feature cardinality (after one-hot encoding the categoricals) is dynamically
estimated. The primary factor that determines this value is the lengths of the
feature names, and is irrelevant to the values of
NUM_PRINCIPAL_COMPONENTS
or
PCA_EXPLAINED_VARIANCE_RATIO
. As a guideline,
this maximum-allowed-feature cardinality typically falls between 1000 to 1500.
If the total feature cardinality of the input data violates the estimated
maximum value, then an invalid query error is returned.
'RANDOMIZED'
: Run a randomized PCA algorithm. In this case, the maximum
allowed feature cardinality is restricted to 10,000. If the feature cardinality
of the input data is less than 10,000, then there is a dynamically determined cap
on the number of principal components to compute, resulting from resource
constraints.
- If you specify
NUM_PRINCIPAL_COMPONENTS
, then the value must not be larger than the cap, otherwise it will result in invalid query errors. - If you specify
PCA_EXPLAINED_VARIANCE_RATIO
, then all principal components under the cap are computed. If their total explained variance ratio is less than the specified value, then they will all be returned; otherwise a subset is returned.
'AUTO'
: The solver is selected by a default policy based on the input data.
Typically, when the feature cardinality (after one-hot encoding all the
categoricals) is less than a threshold, the exact full eigendecomposition is
computed. Otherwise randomized PCA is performed. The threshold is dynamically
determined but typically falls between 1000 and 1500. The number of rows in the
input data is not considered when in deciding the solver.
The default value is 'AUTO'
.
query_statement
The AS query_statement
clause specifies the GoogleSQL query that is used to
generate the training data. See the
GoogleSQL Query Syntax
page for the supported SQL syntax of the query_statement
clause.
CREATE MODEL
examples
The following examples create models named mymodel
in mydataset
in your
default project.
Train a PCA model using NUM_PRINCIPAL_COMPONENTS
option.
This example creates a PCA model with four principal components.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='PCA', NUM_PRINCIPAL_COMPONENTS=4 ) AS SELECT * FROM `mydataset.mytable`
Train a PCA model using PCA_EXPLAINED_VARIANCE_RATIO
option.
This example creates a PCA model, where the number of principal components is selected such that the percentage of variance explained by them is greater than 0.8.
CREATE MODEL `mydataset.mymodel` OPTIONS ( MODEL_TYPE='PCA', PCA_EXPLAINED_VARIANCE_RATIO=0.8 ) AS SELECT * FROM `mydataset.mytable`
Dimensionality reduction with a PCA model example
The following sample creates the mydataset.iris_pca
PCA model with input
features.
CREATE MODEL `mydataset.iris_pca` OPTIONS ( MODEL_TYPE='PCA', NUM_PRINCIPAL_COMPONENTS=2, SCALE_FEATURES=FALSE ) AS SELECT sepal_length, sepal_width, petal_length, petal_width FROM `bigquery-public-data.ml_datasets.iris`;
The following sample transforms the input features using the
mydataset.iris_pca
model into a lower dimensional space, which is then used
to train the mydataset.iris_logistic
model. mydataset.iris_logistic
will be
a better ML model if the original input features are afflicted by the curse of
dimensionality.
CREATE MODEL `mydataset.iris_logistic` OPTIONS ( MODEL_TYPE='LOGISTIC_REG', INPUT_LABEL_COLS=['species'] ) AS SELECT * FROM ML.PREDICT( MODEL mydataset.iris_pca, ( SELECT sepal_length, sepal_width, petal_length, petal_width, species FROM `bigquery-public-data.ml_datasets.iris` ) );