The CREATE MODEL statement for the ARIMA_PLUS_XREG
model
CREATE MODEL
statement for multivariate time series models
To create multivariate time series models in BigQuery, use the
BigQuery ML CREATE MODEL
statement and specify MODEL_TYPE
to be
'ARIMA_PLUS_XREG'
.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
BigQuery ML time series modeling pipeline
The multivariate time series model ARIMA_PLUS_XREG
is an ARIMA_PLUS
model with linear external
regressors. The model diagram is as follows:
The details of the ARIMA_PLUS Pipeline
in the above diagram is shown in BigQuery ML time series modeling pipeline.
The ARIMA_PLUS
pipeline for the BigQuery ML time series includes the
following functionalities:
- Infer the data frequency of the time series.
- Handle irregular time intervals.
- Handle duplicated timestamps by taking the mean value.
- Interpolate missing data using local linear interpolation.
- Detect and clean spike and dip outliers.
- Detect and adjust abrupt step (level) changes.
- Detect and adjust holiday effect.
- Detect multiple seasonal patterns within a single time series via Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
- Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(MODEL_TYPE = 'ARIMA_PLUS_XREG' [, TIME_SERIES_TIMESTAMP_COL = string_value ] [, TIME_SERIES_DATA_COL = string_value ] [, HORIZON = int64_value ] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER = int64_value ] [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION = string_value | string_array ] [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ] [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ] [, TIME_SERIES_LENGTH_FRACTION = float64_value ] [, MIN_TIME_SERIES_LENGTH = int64_value ] [, MAX_TIME_SERIES_LENGTH = int64_value ] [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ]) AS query_statement
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. If the model
name exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.
model_name
model_name
is the name of the BigQuery ML model that you're creating or
replacing. The model name must be unique per dataset: no other model or table
can have the same name. The model name must follow the same naming rules as a
BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of either case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = 'ARIMA_PLUS_XREG'
Description
Specifies the model type. To create a multivariate time series model, set
model_type
to 'ARIMA_PLUS_XREG'
.
model_option_list
In the model_option_list
, the options that are always required include
model_type
, time_series_timestamp_col
, time_series_data_col
. Other options
are only required in certain scenarios. See more details below.
Time series models support the following options:
TIME_SERIES_TIMESTAMP_COL
Syntax
TIME_SERIES_TIMESTAMP_COL = string_value
Description
The timestamp column name for time series models.
Arguments
string_value
is a 'STRING'
.
TIME_SERIES_DATA_COL
Syntax
TIME_SERIES_DATA_COL = string_value
Description
The data column name for time series models.
Arguments
string_value
is a 'STRING'
.
HORIZON
Syntax
HORIZON = int64_value
Description
The number of time points to forecast.
Arguments
The value is a INT64
. The default value is 1000. The maximum value is 10,000.
AUTO_ARIMA
Syntax
AUTO_ARIMA = { TRUE | FALSE }
Description
Whether the training process should use auto.ARIMA or not. If true, training
will automatically find the best non-seasonal order (i.e., the p, d, q tuple)
and decide whether or not to include a linear drift term when d is 1. If false,
you must specify the NON_SEASONAL_ORDER
tuple in the query.
Arguments
The value is a BOOL
. The default value is TRUE
.
AUTO_ARIMA_MAX_ORDER
Syntax
AUTO_ARIMA_MAX_ORDER = int64_value
Description
The maximum value for the sum of non-sesonal p and q. It controls the parameter
search space in the auto.ARIMA algorithm. Currently, the allowed values are
(1, 2, 3, 4, 5). As a reference, for each value there are (3, 6, 10, 15, 21) candidate
models to evaluate if non-seasonal d is determined to be 0 or 2. If non-seasonal
d is determined to be 1, the number of candidate models to evaluate doubles as
there is an additional drift term to consider for all the existing candidate
models. This option is disabled when AUTO_ARIMA
is set to false.
Arguments
The value is a INT64
. The default value is 5. The minimum value is 1 and the
maximum value is 5.
NON_SEASONAL_ORDER
Syntax
NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)
Description
The tuple of non-seasonal p, d, q for the ARIMA_PLUS
model. There are no
default values and you must specify all of them. You must explicitly specify
auto_arima to false to use this option. Currently, p and q are restricted to
[0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2].
Arguments
(int64_value, int64_value, int64_value)
is a tuple of
three 'INT64'
.
DATA_FREQUENCY
Syntax
DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }
Description
The data frequency of the input time series. The finest supported granularity is
'PER_MINUTE'
.
Arguments
Accepts the following values:
'AUTO_FREQUENCY'
: the training process automatically infers the data
frequency, which can be one of the values listed below.
'PER_MINUTE'
: per-minute time series
'HOURLY'
: hourly time series
'DAILY'
: daily time series
'WEEKLY'
: weekly time series
'MONTHLY'
: monthly time series
'QUARTERLY'
: querterly time series
'YEARLY'
: yearly time series
The default value is 'AUTO_FREQUENCY'
.
INCLUDE_DRIFT
Syntax
INCLUDE_DRIFT = { TRUE | FALSE }
Description
Whether the underlying ARIMA_PLUS
model includes a linear drift term
or not. The drift term is applicable when non-seasonal d is 1.
When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.
When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.
Arguments
The value is a BOOL
. The default value is FALSE
for auto_arima is disabled.
HOLIDAY_REGION
Syntax
HOLIDAY_REGION = string_value | string_array
Description
The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option. The value can be a single region string or a list of region strings. If you include more than one region string, the union of the holidays in all the provided regions will be taken into modeling.
Arguments
HOLIDAY_REGION
is a polymorphic option that can be defined by a single
string or an array of strings.
string_value is a type
STRING
.For example:
HOLIDAY_REGION = 'GLOBAL'
string_array is an
ARRAY