The CREATE MODEL statement for PCA models

CREATE MODEL statement for PCA

This document describes the CREATE MODEL statement for creating principal component analysis (PCA) models in BigQuery.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(model_option_list)
AS query_statement

model_option_list:
MODEL_TYPE = PCA,
    NUM_PRINCIPAL_COMPONENTS = int64_value | PCA_EXPLAINED_VARIANCE_RATIO = float64_value
    [, SCALE_FEATURES = { TRUE | FALSE } ]
    [, PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' } ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE = { 'PCA' }

Description

Specify the model type. This option is required.

Arguments

Principal component analysis computes principal components and uses them to perform a change of basis on the data. This approach is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components. This lets the model obtain lower-dimensional data while preserving as much of the data's variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data.

PCA is an unsupervised learning technique, so model training doesn't require either labels or input data that is split into sets for training and evaluation.

NUM_PRINCIPAL_COMPONENTS

Syntax

NUM_PRINCIPAL_COMPONENTS = int64_value

Description

The number of principal components to keep.

You must specify either NUM_PRINCIPAL_COMPONENTS or PCA_EXPLAINED_VARIANCE_RATIO, but not both.

Arguments

An INT64 value. This value can't be larger than the total number of rows or the total feature cardinalities after one-hot encoding the categorical features.

PCA_EXPLAINED_VARIANCE_RATIO

Syntax

PCA_EXPLAINED_VARIANCE_RATIO = float64_value

Description

The ratio for the explained variance. The number of principal components is selected such that the percentage of variance explained by the principal components is greater than the ratio specified by this argument.

You must specify either PCA_EXPLAINED_VARIANCE_RATIO or NUM_PRINCIPAL_COMPONENTS, but not both.

Arguments

A FLOAT64 value in the range (0, 1).

SCALE_FEATURES

Syntax

SCALE_FEATURES = { TRUE | FALSE }

Description

Determines whether or not to scale the numerical features to unit variance. The input numerical features are always centered to have zero mean value. Separately, categorical features are one-hot encoded.

Arguments

A BOOL value. The default value is TRUE.

PCA_SOLVER

Syntax

PCA_SOLVER = { 'FULL' | 'RANDOMIZED' | 'AUTO' }

Description

The solver to use to calculate the principal components.

Arguments

This option accepts the following values:

  • FULL: Run a full eigendecomposition algorithm. In this case, the maximum allowed feature cardinality after one-hot encoding the categoricals is dynamically estimated. The primary factor that determines the feature cardinality value is the lengths of the feature names, and this value isn't affected by the values of the NUM_PRINCIPAL_COMPONENTS or PCA_EXPLAINED_VARIANCE_RATIO options. As a guideline, the maximum allowed feature cardinality typically falls between 1,000 and 1,500. If the total feature cardinality of the input data violates the estimated maximum value, then an invalid query error is returned.
  • RANDOMIZED: Run a randomized PCA algorithm. In this case, the maximum allowed fea