The CREATE MODEL statement for the ARIMA_PLUS_XREG model

This document describes the CREATE MODEL statement for creating multivariate time series models in BigQuery.

Forecasting takes place when you create the model. You can use the ML.FORECAST and ML.EXPLAIN_FORECAST functions to retrieve the forecasting values and compute the prediction intervals.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

Time series modeling pipeline

The multivariate ARIMA_PLUS_XREG time series model is an ARIMA_PLUS model with linear external regressors. The following diagram shows the model pipeline:

ARIMA_PLUS_XREG_DIAGRAM

For a diagram that shows the details of the ARIMA_PLUS Pipeline section in the preceding diagram, see Time series modeling pipeline.

The modeling pipeline for the ARIMA_PLUS time series models performs the following functions:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series by using Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality by using double exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The model with the lowest Akaike information criterion (AIC) is selected as the best model.

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(model_option_list)
AS { query_statement |
  (
    training_data AS (query_statement),
    custom_holiday AS (holiday_statement)
  )
}

model_option_list:
MODEL_TYPE = 'ARIMA_PLUS_XREG'
    [, TIME_SERIES_TIMESTAMP_COL = string_value ]
    [, TIME_SERIES_DATA_COL = string_value ]
    [, HORIZON = int64_value ]
    [, AUTO_ARIMA = { TRUE | FALSE } ]
    [, AUTO_ARIMA_MAX_ORDER = int64_value ]
    [, AUTO_ARIMA_MIN_ORDER = int64_value ]
    [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
    [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]
    [, INCLUDE_DRIFT = { TRUE | FALSE } ]
    [, HOLIDAY_REGION = string_value | string_array ]
    [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]
    [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ]
    [, TIME_SERIES_LENGTH_FRACTION = float64_value ]
    [, MIN_TIME_SERIES_LENGTH = int64_value ]
    [, MAX_TIME_SERIES_LENGTH = int64_value ]
    [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ]
    [, L2_REG = float64_value ]

CREATE MODEL

Creates and trains a new model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates and trains a new model only if the model doesn't exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates and trains a model and replaces an existing model with the same name in the specified dataset.

model_name

The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:

  • Contain up to 1,024 characters
  • Contain letters (upper or lower case), numbers, and underscores

model_name is not case-sensitive.

If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example, `myproject.mydataset.mymodel`.

MODEL_TYPE

Syntax

MODEL_TYPE = 'ARIMA_PLUS_XREG'

Description

Specifies the model type. This option is required.

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL = string_value

Description

The name of the column that provides the time points used in training the model. The column must be of one of the following data types:

  • TIMESTAMP
  • DATE
  • DATETIME

Arguments

A STRING value.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL = string_value

Description

The name of the column that contains the data to forecast. The column must be of one of the following data types:

  • INT64
  • NUMERIC
  • BIGNUMERIC
  • FLOAT64

Arguments

A STRING value.

HORIZON

Syntax

HORIZON = int64_value

Description

The number of time points to forecast.

Arguments

An INT64 value. The default value is 1,000. The maximum value is 10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA = { TRUE | FALSE }

Description

Determines whether the training process uses auto.ARIMA or not. If TRUE, training automatically finds the best non-seasonal order (that is, the p, d, q tuple) and decides whether or not to include a linear drift term when d is 1. If FALSE, you must specify the NON_SEASONAL_ORDER option.

Arguments

A BOOL value. The default value is TRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER = int64_value

Description

The maximum value for the sum of non-seasonal p and q. This value determines the parameter search space in the auto.ARIMA algorithm, in combination with the AUTO_ARIMA_MIN_ORDER value. This option is disabled when the AUTO_ARIMA value is FALSE.

Arguments

An INT64 value between 1 and 5, inclusive. The default value is 5.

If non-seasonal d is determined to be 0 or 2, the number of candidate models evaluated for each supported value is as follows:

  • 1: 3 candidate models
  • 2: 6 candidate models
  • 3: 10 candidate models
  • 4: 15 candidate models
  • 5: 21 candidate models

If non-seasonal d is determined to be 1, the number of candidate models to evaluate is doubled, because there's an additional drift term to consider for all of the existing candidate models.

AUTO_ARIMA_MIN_ORDER

Syntax

AUTO_ARIMA_MIN_ORDER = int64_value

Description

The minimum value for the sum of non-seasonal p and q. This value determines the parameter search space in the auto.ARIMA algorithm, in combination with the AUTO_ARIMA_MAX_ORDER value. Setting this option to 1 or greater lets the model exclude some flat forecasting results. This option is disabled when AUTO_ARIMA is FALSE.

Arguments

The value is a INT64. The default value is 0.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER = (p_value, d_value, q_value)

Description

The tuple of non-seasonal p, d, q for the ARIMA_PLUS model. There are no default values, and you must specify all three values. p and q must be a value between 0 and 5, inclusive. d must be a value between 0 and 2, inclusive.

The AUTO_ARIMA value must be FALSE to use this option.

Arguments

A tuple of three INT64 values. For example, (1, 2, 1).

DATA_FREQUENCY

Syntax

DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }

Description

The data frequency of the input time series. The finest supported granularity is PER_MINUTE.

Arguments

This option accepts the following values:

  • AUTO_FREQUENCY: This is the default. The training process automatically infers the data frequency, which can be any of the other supported values for this option.
  • PER_MINUTE
  • HOURLY
  • DAILY
  • WEEKLY
  • MONTHLY
  • QUARTERLY
  • YEARLY

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT = { TRUE | FALSE }

Description

Determines whether the ARIMA_PLUS model should include a linear drift term or not. The drift term is applicable when non-seasonal d is 1.

  • When the AUTO_ARIMA value is FALSE , this argument defaults to FALSE. You can set it to TRUE only when non-seasonal d is 1. Otherwise the CREATE MODEL statement returns an invalid query error.
  • When the AUTO_ARIMA value is TRUE, BigQuery ML automatically determines whether or not to include a linear drift term, so you can't use this option.

Arguments

A BOOL value. The default value is FALSE.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION = string_value | string_array

Description

The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling isn't used. To use it, specify one or more holiday regions using this option. If you include more than one region string, the union of the holidays in all the provided regions are taken into account when modeling.

Holiday effect modeling is only applicable when the time series is daily or weekly, and longer than a year. If the input time series doesn't meet these requirements, holiday effect modeling isn't used even if you specify this option.

For more information about the holidays included in each region, see Holiday data.

Arguments

A STRING or ARRAY<STRING> value.

Use a single string value to identify one region. For example:

HOLIDAY_REGION = 'GLOBAL'

Use an array of string values to identify multiple regions. For example:

HOLIDAY_REGION = ['US', 'GB']

This option accepts the following values:

Global

  • GLOBAL

Continental regions

  • NA: North America
  • JAPAC: Japan and Asia Pacific
  • EMEA: Europe, the Middle East and Africa
  • LAC: Latin America and the Caribbean

Countries

  • AE: United Arab Emirates
  • AR: Argentina
  • AT: Austria
  • AU: Australia
  • BE: Belgium
  • BR: Brazil
  • CA: Canada
  • CH: Switzerland
  • CL: Chile
  • CN: China
  • CO: Colombia
  • CZ: Czechia
  • DE: Germany
  • DK: Denmark
  • DZ: Algeria
  • EC: Ecuador
  • EE: Estonia
  • EG: Egypt
  • ES: Spain
  • FI: Finland
  • FR: France
  • GB: United Kingdom
  • GR: Greece
  • HK: Hong Kong
  • HU: Hungary
  • ID: Indonesia
  • IE: Ireland
  • IL: Israel
  • IN: India
  • IR: Iran
  • IT: Italy
  • JP: Japan
  • KR: South Korea
  • LV: Latvia
  • MA: Morocco
  • MX: Mexico
  • MY: Malaysia
  • NG: Nigeria
  • NL: Netherlands
  • NO: Norway
  • NZ: New Zealand
  • PE: Peru
  • PH: Philippines
  • PK: Pakistan
  • PL: Poland
  • PT: Portugal
  • RO: Romania
  • RS: Serbia
  • RU: Russia
  • SA: Saudi Arabia
  • SE: Sweden
  • SG: Singapore
  • SI: Slovenia
  • SK: Slovakia
  • TH: Thailand
  • TR: Turkey
  • TW: Taiwan
  • UA: Ukraine
  • US: United States
  • VE: Venezuela
  • VN: Vietnam
  • ZA: South Africa

CLEAN_SPIKES_AND_DIPS

Syntax

CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }

Description

Determines whether or not to perform automatic spikes and dips detection and cleanup in the ARIMA_PLUS model training pipeline. The spikes and dips are replaced with local linear interpolated values when they're detected.

Arguments

A BOOL value. The default value is TRUE.

ADJUST_STEP_CHANGES

Syntax

ADJUST_STEP_CHANGES = { TRUE | FALSE }

Description

Determines whether or not to perform automatic step change detection and adjustment in the ARIMA_PLUS model training pipeline.

Arguments

A BOOL value. The default value is TRUE.

TIME_SERIES_LENGTH_FRACTION

Syntax

TIME_SERIES_LENGTH_FRACTION = float64_value

Description

The fraction of the interpolated length of the time series that's used to model the time series trend component. All of the time points of the time series are used to model the non-trend component. For example, if the time series has 100 time points, then specifying a TIME_SERIES_LENGTH_FRACTION of 0.5 uses the most recent 50 time points for trend modeling. This training option accelerates modeling training without sacrificing much forecasting accuracy.

You can use the TIME_SERIES_LENGTH_FRACTION option with the MIN_TIME_SERIES_LENGTH option, but not with the MAX_TIME_SERIES_LENGTH option.

Arguments

A FLOAT64 value in the range (0, 1). The default behavior is to use all the points in the time series.

MIN_TIME_SERIES_LENGTH

Syntax

MIN_TIME_SERIES_LENGTH = int64_value

Description

The minimum number of time points that are used in modeling the trend component of the time series. If you use this option, you must also specify a value for the TIME_SERIES_LENGTH_FRACTION option. For example, if you use TIME_SERIES_ID_COL to forecast two time series, one with 100 time points and another with 30 time points, then setting TIME_SERIES_LENGTH_FRACTION to 0.5 and MIN_TIME_SERIES_LENGTH to 20 results in the last 50 points of first time series being used for trend modeling. For the second time series, the last 20 points rather than the last 15 points (30 * 0.5) are used in trend