The CREATE MODEL statement for the ARIMA_PLUS_XREG model

Stay organized with collections Save and categorize content based on your preferences.

CREATE MODEL statement for multivariate time series models

To create multivariate time series models in BigQuery, use the BigQuery ML CREATE MODEL statement and specify MODEL_TYPE to be 'ARIMA_PLUS_XREG'.

For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.

BigQuery ML time series modeling pipeline

The multivariate time series model ARIMA_PLUS_XREG is an ARIMA_PLUS model with linear external regressors. The model diagram is as follows:

ARIMA_PLUS_XREG_DIAGRAM

The details of the ARIMA_PLUS Pipeline in the above diagram is shown in BigQuery ML time series modeling pipeline

The ARIMA_PLUS pipeline for the BigQuery ML time series includes the following functionalities:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series via Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(MODEL_TYPE = 'ARIMA_PLUS_XREG'
  [, TIME_SERIES_TIMESTAMP_COL = string_value ]
  [, TIME_SERIES_DATA_COL = string_value ]
  [, HORIZON = int64_value ]
  [, AUTO_ARIMA = { TRUE | FALSE } ]
  [, AUTO_ARIMA_MAX_ORDER = int64_value ]
  [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
  [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]
  [, INCLUDE_DRIFT = { TRUE | FALSE } ]
  [, HOLIDAY_REGION = string_value | string_array ]
  [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]
  [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ]
  [, TIME_SERIES_LENGTH_FRACTION = float64_value ]
  [, MIN_TIME_SERIES_LENGTH = int64_value ]
  [, MAX_TIME_SERIES_LENGTH = int64_value ]
  [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ])
AS query_statement

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.

model_name

model_name is the name of the BigQuery ML model that you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = 'ARIMA_PLUS_XREG'

Description

Specifies the model type. To create a multivariate time series model, set model_type to 'ARIMA_PLUS_XREG'.

model_option_list

In the model_option_list, the options that are always required include model_type, time_series_timestamp_col, time_series_data_col. Other options are only required in certain scenarios. See more details below.

Time series models support the following options:

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL = string_value

Description

The timestamp column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL = string_value

Description

The data column name for time series models.

Arguments

string_value is a 'STRING'.

HORIZON

Syntax

HORIZON = int64_value

Description

The number of time points to forecast.

Arguments

The value is a INT64. The default value is 1000. The maximum value is 10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA = { TRUE | FALSE }

Description

Whether the training process should use auto.ARIMA or not. If true, training will automatically find the best non-seasonal order (i.e., the p, d, q tuple) and decide whether or not to include a linear drift term when d is 1. If false, you must specify the NON_SEASONAL_ORDER tuple in the query.

Arguments

The value is a BOOL. The default value is TRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER = int64_value

Description

The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. Currently, the allowed values are (1, 2, 3, 4, 5). As a reference, for each value there are (3, 6, 10, 15, 21) candidate models to evaluate if non-seasonal d is determined to be 0 or 2. If non-seasonal d is determined to be 1, the number of candidate models to evaluate doubles as there is an additional drift term to consider for all the existing candidate models. This option is disabled when AUTO_ARIMA is set to false.

Arguments

The value is a INT64. The default value is 5. The minimum value is 1 and the maximum value is 5.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)

Description

The tuple of non-seasonal p, d, q for the ARIMA_PLUS model. There are no default values and you must specify all of them. You must explicitly specify auto_arima to false to use this option. Currently, p and q are restricted to [0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2].

Arguments

(int64_value, int64_value, int64_value) is a tuple of three 'INT64'.

DATA_FREQUENCY

Syntax

DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }

Description

The data frequency of the input time series. The finest supported granularity is 'PER_MINUTE'.

Arguments

Accepts the following values:

'AUTO_FREQUENCY': the training process automatically infers the data frequency, which can be one of the values listed below.

'PER_MINUTE': per-minute time series

'HOURLY': hourly time series

'DAILY': daily time series

'WEEKLY': weekly time series

'MONTHLY': monthly time series

'QUARTERLY': querterly time series

'YEARLY': yearly time series

The default value is 'AUTO_FREQUENCY'.

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT = { TRUE | FALSE }

Description

Whether the underlying ARIMA_PLUS model includes a linear drift term or not. The drift term is applicable when non-seasonal d is 1.

  • When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.

  • When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.

Arguments

The value is a BOOL. The default value is FALSE for auto_arima is disabled.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION = string_value | string_array

Description

The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option. The value can be a single region string or a list of region strings. If you include more than one region string, the union of the holidays in all the provided regions will be taken into modeling.

Arguments

HOLIDAY_REGION is a polymorphic option that can be defined by a single string or an array of strings.

  • string_value is a type STRING.

    For example:

    HOLIDAY_REGION = 'GLOBAL'
    
  • string_array is an ARRAY of type STRINGs, where each STRING is one of the following supported region strings.

    For example:

    HOLIDAY_REGION = ['US', 'UK']
    

Accepts the following values:

Top level: global

  • 'GLOBAL'

Second level: continental regions

  • 'NA': North America
  • 'JAPAC': Japan and Asia Pacific
  • 'EMEA': Europe, the Middle East and Africa
  • 'LAC': Latin America and the Caribbean

Third level: countries/regions

  • 'AE': United Arab Emirates
  • 'AR': Argentina
  • 'AT': Austria
  • 'AU': Australia
  • 'BE': Belgium
  • 'BR': Brazil
  • 'CA': Canada
  • 'CH': Switzerland
  • 'CL': Chile
  • 'CN': China
  • 'CO': Colombia
  • 'CZ': Czechia
  • 'DE': Germany
  • 'DK': Denmark
  • 'DZ': Algeria
  • 'EC': Ecuador
  • 'EE': Estonia
  • 'EG': Egypt
  • 'ES': Spain
  • 'FI': Finland
  • 'FR': France
  • 'GB': United Kingdom
  • 'GR': Greece
  • 'HK': Hong Kong
  • 'HU': Hungary
  • 'ID': Indonesia
  • 'IE': Ireland
  • 'IL': Israel
  • 'IN': India
  • 'IR': Iran
  • 'IT': Italy
  • 'JP': Japan
  • 'KR': South Korea
  • 'LV': Latvia
  • 'MA': Morocco
  • 'MX': Mexico
  • 'MY': Malaysia
  • 'NG': Nigeria
  • 'NL': Netherlands
  • 'NO': Norway
  • 'NZ': New Zealand
  • 'PE': Peru
  • 'PH': Philippines
  • 'PK': Pakistan
  • 'PL': Poland
  • 'PT': Portugal
  • 'RO': Romania
  • 'RS': Serbia
  • 'RU': Russia
  • 'SA': Saudi Arabia
  • 'SE': Sweden
  • 'SG': Singapore
  • 'SI': Slovenia
  • 'SK': Slovakia
  • 'TH': Thailand
  • 'TR': Turkey
  • 'TW': Taiwan
  • 'UA': Ukraine
  • 'US': United States
  • 'VE': Venezuela
  • 'VN': Vietnam
  • 'ZA': South Africa

Holiday data

Below is the holiday data we used in the US region for the year 2022-2023.

  1. region specifies which geographic region the holiday applies, as in the list above.
  2. holiday_name contains the name of the holiday.
  3. primary_date specifies the date of the holiday. For holidays that span multiple days, this is usually the first day of the holiday.
  4. preholiday_days describes the number of days the holiday effect starts before the primary_date.
  5. postholiday_days describes the number of days the holiday effect ends after the primary_date.
region holiday_name primary_date preholiday_days postholiday_days
US Christmas 2022-12-25 10 1
US Christmas 2023-12-25 10 1
US MothersDay 2022-05-08 6 1
US MothersDay 2023-05-14 6 1
US NewYear 2022-01-01 5 3
US NewYear 2023-01-01 5 3
US DaylightSavingEnd 2022-11-06 1 1
US DaylightSavingEnd 2023-11-05 1 1
US DaylightSavingStart 2022-03-13 1 1
US DaylightSavingStart 2023-03-12 1 1
US Thanksgiving 2022-11-24 3 5
US Thanksgiving 2023-11-23 3 5
US Valentine 2022-02-14 3 1
US Valentine 2023-02-14 3 1
US EasterMonday 2022-04-18 8 1
US EasterMonday 2023-04-10 8 1
US Halloween 2022-10-31 1 1
US Halloween 2023-10-31 1 1
US StPatrickDay 2022-03-17 1 1
US StPatrickDay 2023-03-17 1 1
US ColumbusDay 2022-10-10 1 1
US ColumbusDay 2023-10-09 1 1
US IndependenceDay 2022-07-04 1 1
US IndependenceDay 2023-07-04 1 1
US Juneteenth 2022-06-19 1 1
US Juneteenth 2023-06-19 1 1
US LaborDay 2022-09-05 1 1
US LaborDay 2023-09-04 1 1
US MemorialDay 2022-05-30 1 1
US MemorialDay 2023-05-29 1 1
US MLKDay 2022-01-17 1 1
US MLKDay 2023-01-16 1 1
US PresidentDay 2022-02-21 1 1
US PresidentDay 2023-02-20 1 1
US Superbowl 2022-02-13 1 1
US Superbowl 2023-02-05 1 1
US VeteranDay 2022-11-11 1 1
US VeteranDay 2023-11-11 1 1

CLEAN_SPIKES_AND_DIPS

Syntax

CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }

Description

Whether or not to perform automatic spikes and dips detection and cleanup in the underlying ARIMA_PLUS model training pipeline. The spikes and dips are replaced with local linear interpolated values when they are detected.

Arguments

The value is a BOOL. The default value is TRUE.

ADJUST_STEP_CHANGES

Syntax

ADJUST_STEP_CHANGES = { TRUE | FALSE }

Description

Whether or not to perform automatic step change detection and adjustment in the ARIMA_PLUS model training pipeline.

Arguments

The value is a BOOL. The default value is TRUE.

TIME_SERIES_LENGTH_FRACTION

Syntax

TIME_SERIES_LENGTH_FRACTION = float64_value

Description

The fraction of the interpolated length of the time series that is used to model the time series trend component. All of the time points of the time series are used to model the non-trend component. For example, if the time series has 100 time points, then specifying a TIME_SERIES_LENGTH_FRACTION of 0.5 uses the most recent 50 time points for trend modeling. This training option accelerates modeling training without sacrificing much forecasting accuracy.

Arguments

float64_value is a FLOAT64. The value must be within (0, 1). The default behavior is using all the points in the time series.

MIN_TIME_SERIES_LENGTH

Syntax

MIN_TIME_SERIES_LENGTH = int64_value

Description

The minimum number of time points in a time series that are used in modeling the trend component of the time series. MIN_TIME_SERIES_LENGTH requires TIME_SERIES_LENGTH_FRACTION is present. This training option prevents too few time points from being used in trend modeling when TIME_SERIES_LENGTH_FRACTION is used.

Arguments

int64_value is an INT64. The default value is 20. The minimum value is 4.

MAX_TIME_SERIES_LENGTH

Syntax

MAX_TIME_SERIES_LENGTH = int64_value

Description

The maximum number of time points in a time series that can be used in modeling the trend component of the time series.

Arguments

int64_value is an INT64. It doesn't have a default value and the minimum value is 4. It's recommended to try 30 as a starting value.

TREND_SMOOTHING_WINDOW_SIZE

Syntax

TREND_SMOOTHING_WINDOW_SIZE = int64_value

Description

Smoothing window size for the trend component. When a positive value is specified, a center moving average smoothing is applied on the history trend. When the smoothing window is out of the boundary at the beginning or the end of the trend, the first element or the last element is padded to fill the smoothing window before the average is applied.

Arguments

int64_value is a type INT64. There is no default value. A positive value must be specified to smooth the trend.

query_statement

The AS query_statement clause specifies the GoogleSQL query that is used to generate the training data. For information about the supported SQL syntax of the query_statement clause, see GoogleSQL query syntax.

For multivariate time series models, the query_statement is expected to contain time_series_timestamp_col, time_series_data_col and feature columns.

Supported inputs

The CREATE MODEL statement supports the following data types for the time series input columns.

Supported data types for time series model inputs

BigQuery ML supports different GoogleSQL data types for the input columns for time series models. Supported data types for each respective column include:

time series input column Supported types
time_series_timestamp_col TIMESTAMP
DATE
DATETIME
time_series_data_col INT64
NUMERIC
BIGNUMERIC
FLOAT64

Known limitations

CREATE MODEL statements for time series models must comply with the following rules:

  • For the input time series, the maximum length is 1,000,000 time points and the minimum length is 3 time points.
  • The maximum time points to forecast, which is specified using horizon, is 10,000.
  • The maximum cardinality of training features is 10,000.
  • Holiday effect modeling is effective only for approximately 5 years.
  • The BigQuery ML training option warm_start is not supported by ARIMA_PLUS_XREG models.

CREATE MODEL examples

The following example creates models named mymodel in mydataset in your default project.

Training a time series model to forecast a single time series

CREATE MODEL `project_id.mydataset.mymodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS_XREG',
         time_series_timestamp_col='date',
         time_series_data_col='transaction') AS
SELECT
  date,
  transaction,
  feature1,
  feature2
FROM
  `mydataset.mytable`

Train a time series model using a fraction of the time points for speed-up

CREATE MODEL `project_id.mydataset.mymodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS_XREG',
         time_series_timestamp_col='date',
         time_series_data_col='transaction',
         time_series_length_fraction=0.5,
         min_time_series_length=30) AS
SELECT
  date,
  transaction,
  feature1,
  feature2
FROM
  `mydataset.mytable`

What's next