The CREATE MODEL statement for the ARIMA_PLUS_XREG model
This document describes the CREATE MODEL
statement for creating multivariate
time series models in BigQuery.
Forecasting takes place when you create the model. You can use the
ML.FORECAST
and
ML.EXPLAIN_FORECAST
functions to retrieve the forecasting values and compute the prediction
intervals.
For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.
Time series modeling pipeline
The multivariate ARIMA_PLUS_XREG
time series model is an
ARIMA_PLUS
model
with linear external regressors. The following diagram shows the model pipeline:
For a diagram that shows the details of the ARIMA_PLUS Pipeline section in the preceding diagram, see Time series modeling pipeline.
The modeling pipeline for the ARIMA_PLUS
time series models performs the
following functions:
- Infer the data frequency of the time series.
- Handle irregular time intervals.
- Handle duplicated timestamps by taking the mean value.
- Interpolate missing data using local linear interpolation.
- Detect and clean spike and dip outliers.
- Detect and adjust abrupt step (level) changes.
- Detect and adjust holiday effect.
- Detect multiple seasonal patterns within a single time series by using Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality by using double exponential smoothing (ETS).
- Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The model with the lowest Akaike information criterion (AIC) is selected as the best model.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(model_option_list) AS { query_statement | ( training_data AS (query_statement), custom_holiday AS (holiday_statement) ) } model_option_list: MODEL_TYPE = 'ARIMA_PLUS_XREG' [, TIME_SERIES_TIMESTAMP_COL = string_value ] [, TIME_SERIES_DATA_COL = string_value ] [, HORIZON = int64_value ] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER = int64_value ] [, AUTO_ARIMA_MIN_ORDER = int64_value ] [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION = string_value | string_array ] [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ] [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ] [, TIME_SERIES_LENGTH_FRACTION = float64_value ] [, MIN_TIME_SERIES_LENGTH = int64_value ] [, MAX_TIME_SERIES_LENGTH = int64_value ] [, TREND_SMOOTHING_WINDOW_SIZE = int64_value ] [, L2_REG = float64_value ]
CREATE MODEL
Creates and trains a new model in the specified dataset. If the model name
exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates and trains a new model only if the model doesn't exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates and trains a model and replaces an existing model with the same name in the specified dataset.
model_name
The name of the model you're creating or replacing. The model name must be unique in the dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can:
- Contain up to 1,024 characters
- Contain letters (upper or lower case), numbers, and underscores
model_name
is not case-sensitive.
If you don't have a default project configured, then you must prepend the project ID to the model name in the following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example, `myproject.mydataset.mymodel`.
MODEL_TYPE
Syntax
MODEL_TYPE = 'ARIMA_PLUS_XREG'
Description
Specifies the model type. This option is required.
TIME_SERIES_TIMESTAMP_COL
Syntax
TIME_SERIES_TIMESTAMP_COL = string_value
Description
The name of the column that provides the time points used in training the model. The column must be of one of the following data types:
TIMESTAMP
DATE
DATETIME
Arguments
A STRING
value.
TIME_SERIES_DATA_COL
Syntax
TIME_SERIES_DATA_COL = string_value
Description
The name of the column that contains the data to forecast. The column must be of one of the following data types:
INT64
NUMERIC
BIGNUMERIC
FLOAT64
Arguments
A STRING
value.
HORIZON
Syntax
HORIZON = int64_value
Description
The number of time points to forecast.
Arguments
An INT64
value. The default value is 1,000
. The maximum value is 10,000
.
AUTO_ARIMA
Syntax
AUTO_ARIMA = { TRUE | FALSE }
Description
Determines whether the training process uses auto.ARIMA or not. If TRUE
,
training automatically finds the best non-seasonal order (that is, the p, d,
q tuple) and decides whether or not to include a linear drift term when d is 1.
If FALSE
, you must specify the NON_SEASONAL_ORDER
option.
Arguments
A BOOL
value. The default value is TRUE
.
AUTO_ARIMA_MAX_ORDER
Syntax
AUTO_ARIMA_MAX_ORDER = int64_value
Description
The maximum value for the sum of non-seasonal p and q. This value determines
the parameter search space in the auto.ARIMA algorithm, in combination with the
AUTO_ARIMA_MIN_ORDER
value. This option is disabled when the AUTO_ARIMA
value is FALSE
.
Arguments
An INT64
value between 1
and 5
, inclusive. The default value is 5
.
If non-seasonal d is determined to be 0 or 2, the number of candidate models evaluated for each supported value is as follows:
1
: 3 candidate models2
: 6 candidate models3
: 10 candidate models4
: 15 candidate models5
: 21 candidate models
If non-seasonal d is determined to be 1, the number of candidate models to evaluate is doubled, because there's an additional drift term to consider for all of the existing candidate models.
AUTO_ARIMA_MIN_ORDER
Syntax
AUTO_ARIMA_MIN_ORDER = int64_value
Description
The minimum value for the sum of non-seasonal p and q. This value determines
the parameter search space in the auto.ARIMA algorithm, in combination with the
AUTO_ARIMA_MAX_ORDER
value. Setting this option to 1
or greater lets the
model exclude some flat forecasting results. This option is disabled when
AUTO_ARIMA
is FALSE
.
Arguments
The value is a INT64
. The default value is 0
.
NON_SEASONAL_ORDER
Syntax
NON_SEASONAL_ORDER = (p_value, d_value, q_value)
Description
The tuple of non-seasonal p, d, q for the ARIMA_PLUS
model. There are no
default values, and you must specify all three values. p and q must be a value
between 0
and 5
, inclusive. d must be a value between 0
and 2
,
inclusive.
The AUTO_ARIMA
value must be FALSE
to use this option.
Arguments
A tuple of three INT64
values. For example, (1, 2, 1)
.
DATA_FREQUENCY
Syntax
DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }
Description
The data frequency of the input time series. The finest supported granularity is
PER_MINUTE
.
Arguments
This option accepts the following values:
AUTO_FREQUENCY
: This is the default. The training process automatically infers the data frequency, which can be any of the other supported values for this option.PER_MINUTE
HOURLY
DAILY
WEEKLY
MONTHLY
QUARTERLY
YEARLY
INCLUDE_DRIFT
Syntax
INCLUDE_DRIFT = { TRUE | FALSE }
Description
Determines whether the ARIMA_PLUS
model should include a linear drift term or
not. The drift term is applicable when non-seasonal d is 1.
- When the
AUTO_ARIMA
value isFALSE
, this argument defaults toFALSE
. You can set it toTRUE
only when non-seasonal d is 1. Otherwise theCREATE MODEL
statement returns an invalid query error. - When the
AUTO_ARIMA
value isTRUE
, BigQuery ML automatically determines whether or not to include a linear drift term, so you can't use this option.
Arguments
A BOOL
value. The default value is FALSE
.
HOLIDAY_REGION
Syntax
HOLIDAY_REGION = string_value | string_array
Description
The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling isn't used. To use it, specify one or more holiday regions using this option. If you include more than one region string, the union of the holidays in all the provided regions are taken into account when modeling.
Holiday effect modeling is only applicable when the time series is daily or weekly, and longer than a year. If the input time series doesn't meet these requirements, holiday effect modeling isn't used even if you specify this option.
For more information about the holidays included in each region, see Holiday data.
Arguments
A STRING
or ARRAY<STRING>
value.
Use a single string value to identify one region. For example:
HOLIDAY_REGION = 'GLOBAL'
Use an array of string values to identify multiple regions. For example:
HOLIDAY_REGION = ['US', 'GB']
This option accepts the following values:
Global
GLOBAL
Continental regions
NA
: North AmericaJAPAC
: Japan and Asia PacificEMEA
: Europe, the Middle East and AfricaLAC
: Latin America and the Caribbean
Countries
AE
: United Arab EmiratesAR
: ArgentinaAT
: AustriaAU
: AustraliaBE
: BelgiumBR
: BrazilCA
: CanadaCH
: SwitzerlandCL
: ChileCN
: ChinaCO
: ColombiaCZ
: CzechiaDE
: GermanyDK
: DenmarkDZ
: AlgeriaEC
: EcuadorEE
: EstoniaEG
: EgyptES
: SpainFI
: FinlandFR
: FranceGB
: United KingdomGR
: GreeceHK
: Hong KongHU
: HungaryID
: IndonesiaIE
: IrelandIL
: IsraelIN
: IndiaIR
: IranIT
: ItalyJP
: JapanKR
: South KoreaLV
: LatviaMA
: MoroccoMX
: MexicoMY
: MalaysiaNG
: NigeriaNL
: NetherlandsNO
: NorwayNZ
: New ZealandPE
: PeruPH
: PhilippinesPK
: PakistanPL
: PolandPT
: PortugalRO
: RomaniaRS
: SerbiaRU
: RussiaSA
: Saudi ArabiaSE
: SwedenSG
: SingaporeSI
: SloveniaSK
: SlovakiaTH
: ThailandTR
: TurkeyTW
: TaiwanUA
: UkraineUS
: United StatesVE
: VenezuelaVN
: VietnamZA
: South Africa
CLEAN_SPIKES_AND_DIPS
Syntax
CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }
Description
Determines whether or not to perform automatic spikes and dips detection and
cleanup in the ARIMA_PLUS
model training pipeline. The spikes and dips are
replaced with local linear interpolated values when they're detected.
Arguments
A BOOL
value. The default value is TRUE
.
ADJUST_STEP_CHANGES
Syntax
ADJUST_STEP_CHANGES = { TRUE | FALSE }
Description
Determines whether or not to perform automatic step change detection and
adjustment in the ARIMA_PLUS
model training pipeline.
Arguments
A BOOL
value. The default value is TRUE
.
TIME_SERIES_LENGTH_FRACTION
Syntax
TIME_SERIES_LENGTH_FRACTION = float64_value
Description
The fraction of the interpolated length of the time series that's used to
model the time series trend component. All of the time points of the time series
are used to model the non-trend component. For example, if the time series has
100 time points, then specifying a TIME_SERIES_LENGTH_FRACTION
of 0.5
uses
the most recent 50 time points for trend modeling. This training option
accelerates modeling training without sacrificing much forecasting accuracy.
You can use the TIME_SERIES_LENGTH_FRACTION
option with the
MIN_TIME_SERIES_LENGTH
option, but not with the MAX_TIME_SERIES_LENGTH
option.
Arguments
A FLOAT64
value in the range (0, 1)
. The default behavior is to use all
the points in the time series.
MIN_TIME_SERIES_LENGTH
Syntax
MIN_TIME_SERIES_LENGTH = int64_value
Description
The minimum number of time points that are used in modeling the
trend component of the time series. If you use this option, you must also
specify a value for the TIME_SERIES_LENGTH_FRACTION
option. For example,
if you use TIME_SERIES_ID_COL
to forecast two time series, one with 100 time
points and another with 30 time points, then setting
TIME_SERIES_LENGTH_FRACTION
to 0.5
and MIN_TIME_SERIES_LENGTH
to 20
results in the last 50 points of first time series being used for trend
modeling. For the second time series, the last 20 points rather than the last
15 points (30 * 0.5
) are used in trend