The CREATE MODEL statement for the ARIMA_PLUS_XREG model

CREATE MODEL statement for multivariate time series models

To create multivariate time series models in BigQuery, use the BigQuery ML CREATE MODEL statement and specify MODEL_TYPE to be 'ARIMA_PLUS_XREG'.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

BigQuery ML time series modeling pipeline

The multivariate time series model ARIMA_PLUS_XREG is an ARIMA_PLUS model with linear external regressors. The model diagram is as follows:

ARIMA_PLUS_XREG_DIAGRAM

The details of the ARIMA_PLUS Pipeline in the above diagram is shown in BigQuery ML time series modeling pipeline.

The ARIMA_PLUS pipeline for the BigQuery ML time series includes the following functionalities:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series via Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(MODEL_TYPE = 'ARIMA_PLUS_XREG'
  [, TIME_SERIES_TIMESTAMP_COL = string_value ]
  [, TIME_SERIES_DATA_COL = string_value ]
  [, HORIZON = int64_value ]
  [, AUTO_ARIMA = { TRUE | FALSE } ]
  [, AUTO_ARIMA_MAX_ORDER = int64_value ]
  [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
  [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]
  [, INCLUDE_DRIFT = { TRUE | FALSE } ]
  [, HOLIDAY_REGION = string_value | string_array ]
  [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]
  [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ]
  [, TIME_SERIES_LENGTH_FRACTION = float64_value ]
  [, MIN_TIME_SERIES_LENGTH = int64_value ]
  [, MAX_TIME_SERIES_LENGTH = int64_value ]
  [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ])
AS query_statement

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.

model_name

model_name is the name of the BigQuery ML model that you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = 'ARIMA_PLUS_XREG'

Description

Specifies the model type. To create a multivariate time series model, set model_type to 'ARIMA_PLUS_XREG'.

model_option_list

In the model_option_list, the options that are always required include model_type, time_series_timestamp_col, time_series_data_col. Other options are only required in certain scenarios. See more details below.

Time series models support the following options:

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL = string_value

Description

The timestamp column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL = string_value

Description

The data column name for time series models.

Arguments

string_value is a 'STRING'.

HORIZON

Syntax

HORIZON = int64_value

Description

The number of time points to forecast.

Arguments

The value is a INT64. The default value is 1000. The maximum value is 10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA = { TRUE | FALSE }

Description

Whether the training process should use auto.ARIMA or not. If true, training will automatically find the best non-seasonal order (i.e., the p, d, q tuple) and decide whether or not to include a linear drift term when d is 1. If false, you must specify the NON_SEASONAL_ORDER tuple in the query.

Arguments

The value is a BOOL. The default value is TRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER = int64_value

Description

The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. Currently, the allowed values are (1, 2, 3, 4, 5). As a reference, for each value there are (3, 6, 10, 15, 21) candidate models to evaluate if non-seasonal d is determined to be 0 or 2. If non-seasonal d is determined to be 1, the number of candidate models to evaluate doubles as there is an additional drift term to consider for all the existing candidate models. This option is disabled when AUTO_ARIMA is set to false.

Arguments

The value is a INT64. The default value is 5. The minimum value is 1 and the maximum value is 5.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)

Description

The tuple of non-seasonal p, d, q for the ARIMA_PLUS model. There are no default values and you must specify all of them. You must explicitly specify auto_arima to false to use this option. Currently, p and q are restricted to [0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2].

Arguments

(int64_value, int64_value, int64_value) is a tuple of three 'INT64'.

DATA_FREQUENCY

Syntax

DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }

Description

The data frequency of the input time series. The finest supported granularity is 'PER_MINUTE'.

Arguments

Accepts the following values:

'AUTO_FREQUENCY': the training process automatically infers the data frequency, which can be one of the values listed below.

'PER_MINUTE': per-minute time series

'HOURLY': hourly time series

'DAILY': daily time series

'WEEKLY': weekly time series

'MONTHLY': monthly time series

'QUARTERLY': querterly time series

'YEARLY': yearly time series

The default value is 'AUTO_FREQUENCY'.

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT = { TRUE | FALSE }

Description

Whether the underlying ARIMA_PLUS model includes a linear drift term or not. The drift term is applicable when non-seasonal d is 1.

  • When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.

  • When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.

Arguments

The value is a BOOL. The default value is FALSE for auto_arima is disabled.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION = string_value | string_array

Description

The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option. The value can be a single region string or a list of region strings. If you include more than one region string, the union of the holidays in all the provided regions will be taken into modeling.

Arguments

HOLIDAY_REGION is a polymorphic option that can be defined by a single string or an array of strings.

  • string_value is a type STRING.

    For example:

    HOLIDAY_REGION = 'GLOBAL'
    
  • string_array is an ARRAY