The CREATE MODEL statement for time series models

CREATE MODEL statement for time series models

To create time series models in BigQuery, use the BigQuery ML CREATE MODEL statement and specify MODEL_TYPE to be 'ARIMA_PLUS'.

BigQuery ML time series modeling pipeline

The BigQuery ML time series modeling pipeline includes multiple modules. The ARIMA model is the most computationally expensive, which is why the model is named ARIMA_PLUS.

SINGLE_TIME_SERIES_DIAGRAM

The modeling pipeline for the BigQuery ML time series includes the following functionalities:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series via Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyper-parameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).

BigQuery ML large-scale time series

Up to 500,000 time series can be forecasted simultaneously with a single query using the TIME_SERIES_ID_COL option. Different modeling pipelines run in parallel, assuming that enough slots are available. The following diagram shows this process:

MULTIPLE_TIME_SERIES_DIAGRAM

CREATE MODEL syntax

{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL}
model_name
OPTIONS(MODEL_TYPE = 'ARIMA_PLUS'
  [, TIME_SERIES_TIMESTAMP_COL = string_value ]
  [, TIME_SERIES_DATA_COL = string_value ]
  [, TIME_SERIES_ID_COL = { string_value | string_array } ]
  [, HORIZON = int64_value ]
  [, AUTO_ARIMA = { TRUE | FALSE } ]
  [, AUTO_ARIMA_MAX_ORDER = int64_value ]
  [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ]
  [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ]
  [, INCLUDE_DRIFT = { TRUE | FALSE } ]
  [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ]
  [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ]
  [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ]
  [, DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ])
AS query_statement

CREATE MODEL

Creates a new BigQuery ML model in the specified dataset. If the model name exists, CREATE MODEL returns an error.

CREATE MODEL IF NOT EXISTS

Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.

CREATE OR REPLACE MODEL

Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.

model_name

model_name is the name of the BigQuery ML model that you're creating or replacing. The model name must be unique per dataset: no other model or table can have the same name. The model name must follow the same naming rules as a BigQuery table. A model name can contain the following:

  • Up to 1,024 characters
  • Letters of either case, numbers, and underscores

model_name is not case-sensitive.

If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:

`[PROJECT_ID].[DATASET].[MODEL]`

For example:

`myproject.mydataset.mymodel`

CREATE MODEL supports the following options:

MODEL_TYPE

Syntax

MODEL_TYPE = 'ARIMA_PLUS'

Description

Specifies the model type. To create a time series model, set model_type to 'ARIMA_PLUS'.

model_option_list

In the model_option_list, the options that are always required include model_type, time_series_timestamp_col, time_series_data_col. Other options are only required in certain scenarios. See more details below.

Time series models support the following options:

TIME_SERIES_TIMESTAMP_COL

Syntax

TIME_SERIES_TIMESTAMP_COL = string_value

Description

The timestamp column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_DATA_COL

Syntax

TIME_SERIES_DATA_COL = string_value

Description

The data column name for time series models.

Arguments

string_value is a 'STRING'.

TIME_SERIES_ID_COL

Syntax

TIME_SERIES_ID_COL = { string_value | string_array }

Description

The ID column names for time series models. These columns are used when the user wants to fit and forecast multiple time series using a single query. Different IDs indicate different time series.

Arguments

This can be either of the following:

  • string_value: 'STRING'
  • string_array: an array of 'STRING'

HORIZON

Syntax

HORIZON = int64_value

Description

The number of time points to forecast. When forecasting multiple time-series at once, this parameter applies to each time series.

Arguments

The value is a INT64. The default value is 1000. The maximum value is 10,000.

AUTO_ARIMA

Syntax

AUTO_ARIMA = { TRUE | FALSE }

Description

Whether the training process should use auto.ARIMA or not. If true, training will automatically find the best non-seasonal order (i.e., the p, d, q tuple) and decide whether or not to include a linear drift term when d is 1. If false, the user must specify non_seasonal_order in the query. When forecasting multiple time-series at the same time, the auto.ARIMA algorithm must be used for each time series, so this option must not be set to false.

Arguments

The value is a BOOL. The default value is TRUE.

AUTO_ARIMA_MAX_ORDER

Syntax

AUTO_ARIMA_MAX_ORDER = int64_value

Description

The maximum value for the sum of non-sesonal p and q. It controls the parameter search space in the auto.ARIMA algorithm. Currently, the allowed values are (2, 3, 4, 5). As a reference, for each value there are (6, 10, 15, 21) candidate models to evaluat if non-seasonal d is determined to be 0 or 2. If non-seasonal d is determined to be 1, the number of candidate models to evaluate doubles as there is an additional drift term to consider for all the existing candidate models. This option is disabled when AUTO_ARIMA is set to false.

Arguments

The value is a INT64. The default value is 5. The minimum value is 2 and the maximum value is 5.

NON_SEASONAL_ORDER

Syntax

NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)

Description

The tuple of non-seasonal p, d, q for the ARIMA_PLUS model. There are no default values and you must specify all of them. You must explicitly specify auto_arima to false to use this option. Currently, p and q are restricted to [0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2]. When forecasting multiple time-series at the same time, because the auto.ARIMA algorithm must be used for each time series, this option is disabled.

Arguments

(int64_value, int64_value, int64_value) is a tuple of three 'INT64'.

DATA_FREQUENCY

Syntax

DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }

Description

The data frequency of the input time series. The finest supported granularity is 'PER_MINUTE'. When forecasting multiple time-series at once, this argument applies to all individual time series.

Arguments

Accepts the following values:

'AUTO_FREQUENCY': the training process automatically infers the data frequency, which can be one of the values listed below.

'PER_MINUTE': per-minute time series

'HOURLY': hourly time series

'DAILY': daily time series

'WEEKLY': weekly time series

'MONTHLY': monthly time series

'QUARTERLY': querterly time series

'YEARLY': yearly time series

The default value is 'AUTO_FREQUENCY'.

INCLUDE_DRIFT

Syntax

INCLUDE_DRIFT = { TRUE | FALSE }

Description

Should the ARIMA_PLUS model include a linear drift term or not. The drift term is applicable when non-seasonal d is 1.

  • When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.

  • When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.

Arguments

The value is a BOOL. The default value is FALSE for auto_arima is disabled.

HOLIDAY_REGION

Syntax

HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... }

Description

The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option.

Arguments

Accepts the following values:

Top level: global

  • 'GLOBAL'

Second level: continental regions

  • 'NA': North America
  • 'JAPAC': Japan and Asia Pacific
  • 'EMEA': Europe, the Middle East and Africa
  • 'LAC': Latin America and the Caribbean

Third level: countries/regions

  • 'AE': United Arab Emirates
  • 'AR': Argentina
  • 'AT': Austria
  • 'AU': Australia
  • 'BE': Belgium
  • 'BR': Brazil
  • 'CA': Canada
  • 'CH': Switzerland
  • 'CL': Chile
  • 'CN': China
  • 'CO': Colombia
  • 'CZ': Czechia
  • 'DE': Germany
  • 'DK': Denmark
  • 'DZ': Algeria
  • 'EC': Ecuador
  • 'EE': Estonia
  • 'EG': Egypt
  • 'ES': Spain
  • 'FI': Finland
  • 'FR': France
  • 'GB': United Kingdom
  • 'GR': Greece
  • 'HK': Hong Kong
  • 'HU': Hungary
  • 'ID': Indonesia
  • 'IE': Ireland
  • 'IL': Israel
  • 'IN': India
  • 'IR': Iran
  • 'IT': Italy
  • 'JP': Japan
  • 'KR': South Korea
  • 'LV': Latvia
  • 'MA': Morocco
  • 'MX': Mexico
  • 'MY': Malaysia
  • 'NG': Nigeria
  • 'NL': Netherlands
  • 'NO': Norway
  • 'NZ': New Zealand
  • 'PE': Peru
  • 'PH': Philippines
  • 'PK': Pakistan
  • 'PL': Poland
  • 'PT': Portugal
  • 'RO': Romania
  • 'RS': Serbia
  • 'RU': Russia
  • 'SA': Saudi Arabia
  • 'SE': Sweden
  • 'SG': Singapore
  • 'SI': Slovenia
  • 'SK': Slovakia
  • 'TH': Thailand
  • 'TR': Turkey
  • 'TW': Taiwan
  • 'UA': Ukraine
  • 'US': United States
  • 'VE': Venezuela
  • 'VN': Vietnam
  • 'ZA': South Africa

CLEAN_SPIKES_AND_DIPS

Syntax

CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }

Description

Whether or not to perform automatic spikes and dips detection and cleanup in the ARIMA_PLUS model training pipeline. The spikes and dips are replaced with local linear interpolated values when they are detected.

Arguments

The value is a BOOL. The default value is TRUE.

ADJUST_STEP_CHANGES

Syntax

ADJUST_STEP_CHANGES = { TRUE | FALSE }

Description

Whether or not to perform automatic step change detection and adjustment in the ARIMA_PLUS model training pipeline.

Arguments

The value is a BOOL. The default value is TRUE.

DECOMPOSE_TIME_SERIES

Syntax

DECOMPOSE_TIME_SERIES = { TRUE | FALSE }

Description

Whether the separate components of both the history and forecast parts of the time series (such as holiday effect and seasonal components) should be saved in the ARIMA_PLUS model.

Arguments

The value is a BOOL. The default value is TRUE.

query_statement

The AS query_statement clause specifies the standard SQL query that is used to generate the training data. For information about the supported SQL syntax of the query_statement clause, see Standard SQL query syntax.

For time series models, the query_statement is expected to contain either 2 or 3 columns, depending on whether the user wants to forecast a single time-series or multiple time-series. In both cases, time_series_timestamp_col and time_series_data_col are required. Additional time_series_id_col or time_series_id_cols columns are required for forecasting multiple time series.

Supported inputs

The CREATE MODEL statement supports the following data types for the time series input columns.

Supported data types for time series model inputs

BigQuery ML supports different standard SQL data types for the input columns for time series models. Supported data types for each respective column include:

time series input column Supported types
time_series_timestamp_col TIMESTAMP
DATE
DATETIME
time_series_data_col INT64
NUMERIC
BIGNUMERIC
FLOAT64
time_series_id_col STRING
INT64

Known limitations

CREATE MODEL statements for time series models must comply with the following rules:

  • The maximum length for the input time series is 1,000,000. When forecasting multiple time-series at the same time, the limit applies to each time series.
  • The maximum number of time series to forecast simultaneously using the ID columns is 500,000.
  • When forecasting multiple time-series simultaneously using the ID column, those invalid time series that fail the model fitting will be ignored and won't appear in the results of forecast. Examples are single point time series. A warning message is shown in this case, and you can use the ML.ARIMA_EVALUATE function to retrieve the possible error message.
  • The maximum time points to forecast, which is specified using horizon, is 10,000.
  • Holiday effect modeling is effective only for approximately 5 years.
  • The BigQuery ML training option warm_start is not supported by ARIMA_PLUS models.

How to avoid long-running queries

Forecasting many time series simultaneously using the ID columns can lead to long-running queries.

  • When many time series are forecasted simultaneously using the ID columns, they won't be forecasted completely in parallel because of the slot capacity. As a result, the query can take a very long time to complete when there are a lot of time series to forecast. The query runtime will depend on your slot capacity, the properties of your time series such as length, and automatically determined non_seasonal_d. When you have a large number (for example, 100,000) of time series to forecast, we highly recommended that you first forecast a small batch of time series (for example, 1000) to see how long the query takes. Then, you can roughly estimate how long it will take for you to forecast your whole time series.
  • You can use the auto_arima_max_order option to balance between the query runtime and forecast accuracy. For example, if you use 4 instead of the default value 5 for this option, the query runtime can be reduced by at least 30%. However, the forecast accuracy might slightly drop for some of the time series.
  • If you want to avoid a single long-running query, you can also use BigQuery scripting.

CREATE MODEL examples

The following example creates models named mymodel in mydataset in your default project.

Training a time series model to forecast a single time series

This example creates a time series model.

CREATE MODEL `project_id.mydataset.mymodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS',
         time_series_timestamp_col='date'
         time_series_data_col='transaction') AS
SELECT
  date,
  transaction
FROM
  `mydataset.mytable`

Training multiple time-series models for multiple time-series at the same time

This example creates multiple time-series models, one for each input time series.

CREATE MODEL `project_id.mydataset.mymodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS',
         time_series_timestamp_col='date'
         time_series_data_col='transaction',
         time_series_id_col='company_name') AS
SELECT
  date,
  transaction,
  company_name
FROM
  `mydataset.mytable`

Training multiple time-series models for multiple time series at the same time using multiple time-series ID columns

This example creates multiple time-series models, one for each input time series.

CREATE MODEL `project_id.mydataset.mymodel`
 OPTIONS(MODEL_TYPE='ARIMA_PLUS',
         time_series_timestamp_col='date'
         time_series_data_col='transaction',
         time_series_id_col=['company_name', 'department_name']) AS
SELECT
  date,
  transaction,
  company_name,
  department_name
FROM
  `mydataset.mytable`

What's next