CREATE MODEL
statement for time series models
To create time series models in BigQuery, use the
BigQuery ML CREATE MODEL
statement and specify MODEL_TYPE
to be
'ARIMA_PLUS'
.
For information about supported model types of each SQL statement and function, and all supported SQL statements and functions for each model type, read End-to-end user journey for each model.
BigQuery ML time series modeling pipeline
The BigQuery ML time series modeling pipeline includes
multiple modules. The ARIMA model is the most computationally expensive, which
is why the model is named ARIMA_PLUS
.
The modeling pipeline for the BigQuery ML time series includes the following functionalities:
- Infer the data frequency of the time series.
- Handle irregular time intervals.
- Handle duplicated timestamps by taking the mean value.
- Interpolate missing data using local linear interpolation.
- Detect and clean spike and dip outliers.
- Detect and adjust abrupt step (level) changes.
- Detect and adjust holiday effect.
- Detect multiple seasonal patterns within a single time series via Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality via double exponential smoothing (ETS).
- Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The best model comes with the lowest Akaike information criterion (AIC).
BigQuery ML large-scale time series
Up to 100,000,000 time series can be forecast simultaneously with a single query
using the TIME_SERIES_ID_COL
option. Different modeling pipelines run in parallel, assuming that enough slots
are available. The following diagram shows this process:
Large-scale time series forecasting best practices
Learn how to perform Scalable forecasting with millions of time series in BigQuery
- When you have a large number (for example, 100,000) of time series to forecast, we highly recommended that you first forecast a small batch of time series (for example, 1000) to see how long the query takes. You can then estimate how long your entire time series forecast will take.
- You can use the
AUTO_ARIMA_MAX_ORDER
option to balance between the query runtime and forecast accuracy. IncreasingAUTO_ARIMA_MAX_ORDER
expands the hyperparameter search space to try more complex, that is with higher non-seasonal p and q, ARIMA models. So decreasing the value ofAUTO_ARIMA_MAX_ORDER
can accelerate the query. For example, if you use 3 instead of the default value 5 for this option, the query runtime is reduced by at least 50%. The forecast accuracy might drop slightly for some of the time series. If a shorter training time is important to your case, use a smaller value forAUTO_ARIMA_MAX_ORDER
. - The model training time for each time series has in linear relationship to its length or the number of data points. The longer the time series, the longer the training takes. On the other hand, for time series forecasting, not all data points contribute equally to the model fitting process. Instead, the more recent the data point is, the more it contributes. Therefore, if you have a long time series, for example ten years of daily data, you don't need to train a time series model using all the data points. The most recent two or three years of data points are enough.
- You can use
TIME_SERIES_LENGTH_FRACTION
,MIN_TIME_SERIES_LENGTH
andMAX_TIME_SERIES_LENGTH
training options to enable smart, fast model training with little or no loss of forecasting accuracy. The idea behind this is that while periodic modeling such as seasonality requires a certain number of time points, trend modeling doesn't need many time points. Meanwhile, trend modeling is much more computationally expensive than other time series components. By using the training options above, you can efficiently model the trend component with a subset of the time series, while the other time series components use the entire time series. - To avoid a single long-running query, you can also use BigQuery multi-statement queries.
CREATE MODEL
syntax
{CREATE MODEL | CREATE MODEL IF NOT EXISTS | CREATE OR REPLACE MODEL} model_name OPTIONS(MODEL_TYPE = 'ARIMA_PLUS' [, TIME_SERIES_TIMESTAMP_COL = string_value ] [, TIME_SERIES_DATA_COL = string_value ] [, TIME_SERIES_ID_COL = { string_value | string_array } ] [, HORIZON = int64_value ] [, AUTO_ARIMA = { TRUE | FALSE } ] [, AUTO_ARIMA_MAX_ORDER = int64_value ] [, NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value) ] [, DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' } ] [, INCLUDE_DRIFT = { TRUE | FALSE } ] [, HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... } ] [, CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE } ] [, ADJUST_STEP_CHANGES = { TRUE | FALSE } ] [, DECOMPOSE_TIME_SERIES = { TRUE | FALSE } ] [, TIME_SERIES_LENGTH_FRACTION = float64_value ] [, MIN_TIME_SERIES_LENGTH = int64_value ] [, MAX_TIME_SERIES_LENGTH = int64_value ]) AS query_statement
CREATE MODEL
Creates a new BigQuery ML model in the specified dataset. If the model
name exists, CREATE MODEL
returns an error.
CREATE MODEL IF NOT EXISTS
Creates a new BigQuery ML model only if the model does not currently exist in the specified dataset.
CREATE OR REPLACE MODEL
Creates a new BigQuery ML model and replaces any existing model with the same name in the specified dataset.
model_name
model_name
is the name of the BigQuery ML model that you're creating or
replacing. The model name must be unique per dataset: no other model or table
can have the same name. The model name must follow the same naming rules as a
BigQuery table. A model name can contain the following:
- Up to 1,024 characters
- Letters of either case, numbers, and underscores
model_name
is not case-sensitive.
If you do not have a default project configured, prepend the project ID to the model name in following format, including backticks:
`[PROJECT_ID].[DATASET].[MODEL]`
For example:
`myproject.mydataset.mymodel`
CREATE MODEL
supports the following options:
MODEL_TYPE
Syntax
MODEL_TYPE = 'ARIMA_PLUS'
Description
Specifies the model type. To create a time series model, set model_type
to 'ARIMA_PLUS'
.
model_option_list
In the model_option_list
, the options that are always required include
model_type
, time_series_timestamp_col
, time_series_data_col
. Other options
are only required in certain scenarios. See more details below.
Time series models support the following options:
TIME_SERIES_TIMESTAMP_COL
Syntax
TIME_SERIES_TIMESTAMP_COL = string_value
Description
The timestamp column name for time series models.
Arguments
string_value
is a 'STRING'
.
TIME_SERIES_DATA_COL
Syntax
TIME_SERIES_DATA_COL = string_value
Description
The data column name for time series models.
Arguments
string_value
is a 'STRING'
.
TIME_SERIES_ID_COL
Syntax
TIME_SERIES_ID_COL = { string_value | string_array }
Description
The ID column names for time series models. These columns are used when the user wants to fit and forecast multiple time series using a single query. Different IDs indicate different time series.
Arguments
This can be either of the following:
string_value
:'STRING'
string_array
: an array of'STRING'
HORIZON
Syntax
HORIZON = int64_value
Description
The number of time points to forecast. When forecasting multiple time-series at once, this parameter applies to each time series.
Arguments
The value is a INT64
. The default value is 1000. The maximum value is 10,000.
AUTO_ARIMA
Syntax
AUTO_ARIMA = { TRUE | FALSE }
Description
Whether the training process should use auto.ARIMA or not. If true, training will automatically find the best non-seasonal order (i.e., the p, d, q tuple) and decide whether or not to include a linear drift term when d is 1. If false, the user must specify non_seasonal_order in the query. When forecasting multiple time-series at the same time, the auto.ARIMA algorithm must be used for each time series, so this option must not be set to false.
Arguments
The value is a BOOL
. The default value is TRUE
.
AUTO_ARIMA_MAX_ORDER
Syntax
AUTO_ARIMA_MAX_ORDER = int64_value
Description
The maximum value for the sum of non-sesonal p and q. It controls the parameter
search space in the auto.ARIMA algorithm. Currently, the allowed values are
(1, 2, 3, 4, 5). As a reference, for each value there are (3, 6, 10, 15, 21) candidate
models to evaluate if non-seasonal d is determined to be 0 or 2. If non-seasonal
d is determined to be 1, the number of candidate models to evaluate doubles as
there is an additional drift term to consider for all the existing candidate
models. This option is disabled when AUTO_ARIMA
is set to false.
Arguments
The value is a INT64
. The default value is 5. The minimum value is 1 and the
maximum value is 5.
NON_SEASONAL_ORDER
Syntax
NON_SEASONAL_ORDER = (int64_value, int64_value, int64_value)
Description
The tuple of non-seasonal p, d, q for the ARIMA_PLUS
model. There are no
default values and you must specify all of them. You must explicitly specify
auto_arima to false to use this option. Currently, p and q are restricted to
[0, 1, 2, 3, 4, 5] and d is restricted to [0, 1, 2]. When forecasting multiple
time-series at the same time, because the auto.ARIMA algorithm must be used for
each time series, this option is disabled.
Arguments
(int64_value, int64_value, int64_value)
is a tuple of
three 'INT64'
.
DATA_FREQUENCY
Syntax
DATA_FREQUENCY = { 'AUTO_FREQUENCY' | 'PER_MINUTE' | 'HOURLY' | 'DAILY' | 'WEEKLY' | 'MONTHLY' | 'QUARTERLY' | 'YEARLY' }
Description
The data frequency of the input time series. The finest supported granularity is
'PER_MINUTE'
. When forecasting multiple time-series at once, this argument
applies to all individual time series.
Arguments
Accepts the following values:
'AUTO_FREQUENCY'
: the training process automatically infers the data
frequency, which can be one of the values listed below.
'PER_MINUTE'
: per-minute time series
'HOURLY'
: hourly time series
'DAILY'
: daily time series
'WEEKLY'
: weekly time series
'MONTHLY'
: monthly time series
'QUARTERLY'
: querterly time series
'YEARLY'
: yearly time series
The default value is 'AUTO_FREQUENCY'
.
INCLUDE_DRIFT
Syntax
INCLUDE_DRIFT = { TRUE | FALSE }
Description
Should the ARIMA_PLUS
model include a linear drift term or not. The drift term
is applicable when non-seasonal d is 1.
When auto-arima is set to false, this argument is default to false. It can be set to true only when non-seasonal d is 1, otherwise it will return an invalid query error.
When auto-arima is set to true, it will automatically decide whether or not to include a linear drift term. Therefore, this option is disabled for auto-ARIMA.
Arguments
The value is a BOOL
. The default value is FALSE
for auto_arima is disabled.
HOLIDAY_REGION
Syntax
HOLIDAY_REGION = { 'GLOBAL' | 'NA' | 'JAPAC' | 'EMEA' | 'LAC' | 'AE' | ... }
Description
The geographical region based on which the holiday effect is applied in modeling. By default, holiday effect modeling is disabled. To turn it on, specify the holiday region using this option.
Arguments
Accepts the following values:
Top level: global
'GLOBAL'
Second level: continental regions
'NA'
: North America'JAPAC'
: Japan and Asia Pacific'EMEA'
: Europe, the Middle East and Africa'LAC'
: Latin America and the Caribbean
Third level: countries/regions
'AE'
: United Arab Emirates'AR'
: Argentina'AT'
: Austria'AU'
: Australia'BE'
: Belgium'BR'
: Brazil'CA'
: Canada'CH'
: Switzerland'CL'
: Chile'CN'
: China'CO'
: Colombia'CZ'
: Czechia'DE'
: Germany'DK'
: Denmark'DZ'
: Algeria'EC'
: Ecuador'EE'
: Estonia'EG'
: Egypt'ES'
: Spain'FI'
: Finland'FR'
: France'GB'
: United Kingdom'GR'
: Greece'HK'
: Hong Kong'HU'
: Hungary'ID'
: Indonesia'IE'
: Ireland'IL'
: Israel'IN'
: India'IR'
: Iran'IT'
: Italy'JP'
: Japan'KR'
: South Korea'LV'
: Latvia'MA'
: Morocco'MX'
: Mexico'MY'
: Malaysia'NG'
: Nigeria'NL'
: Netherlands'NO'
: Norway'NZ'
: New Zealand'PE'
: Peru'PH'
: Philippines'PK'
: Pakistan'PL'
: Poland'PT'
: Portugal'RO'
: Romania'RS'
: Serbia'RU'
: Russia'SA'
: Saudi Arabia'SE'
: Sweden'SG'
: Singapore'SI'
: Slovenia'SK'
: Slovakia'TH'
: Thailand'TR'
: Turkey'TW'
: Taiwan'UA'
: Ukraine'US'
: United States'VE'
: Venezuela'VN'
: Vietnam'ZA'
: South Africa
CLEAN_SPIKES_AND_DIPS
Syntax
CLEAN_SPIKES_AND_DIPS = { TRUE | FALSE }
Description
Whether or not to perform automatic spikes and dips detection and cleanup in the
ARIMA_PLUS
model training pipeline. The spikes and dips are replaced with
local linear interpolated values when they are detected.
Arguments
The value is a BOOL
. The default value is TRUE
.
ADJUST_STEP_CHANGES
Syntax
ADJUST_STEP_CHANGES = { TRUE | FALSE }
Description
Whether or not to perform automatic step change detection and adjustment in the
ARIMA_PLUS
model training pipeline.
Arguments
The value is a BOOL
. The default value is TRUE
.
DECOMPOSE_TIME_SERIES
Syntax
DECOMPOSE_TIME_SERIES = { TRUE | FALSE }
Description
Whether the separate components of both the history and forecast parts of the
time series (such as holiday effect and seasonal components) should be saved in
the ARIMA_PLUS
model.
Arguments
The value is a BOOL
. The default value is TRUE
.
TIME_SERIES_LENGTH_FRACTION
Syntax
TIME_SERIES_LENGTH_FRACTION = float64_value
Description
The fraction of the interpolated length of the time series that is used to
model the time series trend component. All of the time points of the time series
are used to model the non-trend component. For example, if the time series has
100 time points, then specifying a TIME_SERIES_LENGTH_FRACTION
of 0.5 uses the
most recent 50 time points for trend modeling. This training option accelerates
modeling training without sacrificing much forecasting accuracy.
Arguments
float64_value
is a FLOAT64
. The value must be within
(0, 1). The default behavior is using all the points in the time series.
MIN_TIME_SERIES_LENGTH
Syntax
MIN_TIME_SERIES_LENGTH = int64_value
Description
The minimum number of time points in a time series that are used in modeling the
trend component of the time series. MIN_TIME_SERIES_LENGTH
requires
TIME_SERIES_LENGTH_FRACTION
is present. For example, if the query is
forecasting two time series simultaneously by using time_series_id_col
and
one time series has 100 time points while the other one has 30 time points, then
by setting TIME_SERIES_LENGTH_FRACTION
to 0.5 and MIN_TIME_SERIES_LENGTH
to
20, then the first time series's last 50 points are used for trend modeling. For
the second time series, because MIN_TIME_SERIES_LENGTH
is 20, the last 20
points rather than 15 (30 * 0.5) points are used in trend modeling. This training
option prevents too few time points from being used in trend modeling when
TIME_SERIES_LENGTH_FRACTION
is used, particularly when forecasting multiple
time series in a single query using time_series_id_col
.
Arguments
int64_value
is an INT64
. The default value is 20. The
minimum value is 4.
MAX_TIME_SERIES_LENGTH
Syntax
MAX_TIME_SERIES_LENGTH = int64_value
Description
The maximum number of time points in a time series that can be used in modeling
the trend component of the time series. For example, if the query is forecasting
two time series simultaneously by using TIME_SERIES_ID_COL
where one time
series has 100 time points while the other one has 50 time points, then by
setting MAX_TIME_SERIES_LENGTH
to 30 both of time series will use the last 30
points for trend modeling.
Arguments
int64_value
is an INT64
. It doesn't have a default
value and the minimum value is 4. It's recommended to try 30
as a starting
value.
query_statement
The AS query_statement
clause specifies the standard SQL query that is used to
generate the training data. For information about the supported SQL syntax of
the query_statement
clause, see
Standard SQL query syntax.
For time series models, the query_statement is expected to contain either 2 or 3
columns, depending on whether the user wants to forecast a single time-series or
multiple time-series. In both cases, time_series_timestamp_col
and
time_series_data_col
are required. Additional time_series_id_col
columns are
required for forecasting multiple time series.
Supported inputs
The CREATE MODEL
statement supports the following data types for the
time series input columns.
Supported data types for time series model inputs
BigQuery ML supports different standard SQL data types for the input columns for time series models. Supported data types for each respective column include:
time series input column |
Supported types |
---|---|
time_series_timestamp_col |
TIMESTAMP DATE DATETIME |
time_series_data_col |
INT64 NUMERIC BIGNUMERIC FLOAT64 |
time_series_id_col |
STRING INT64 ARRAY of STRING or INT64
|
Known limitations
CREATE MODEL
statements for time series models must comply with the following
rules:
- For the input time series, the maximum length is 1,000,000 time points and the minimum length is 3 time points. When forecasting multiple time-series at the same time, the limit applies to each time series.
- The maximum number of time series to forecast simultaneously using the ID columns is 100,000,000.
- When forecasting multiple time-series simultaneously using the ID column,
those invalid time series that fail the model fitting will be ignored and
won't appear in the results of forecast. Examples are single point
time series. A warning message is shown in this case, and you can use the
ML.ARIMA_EVALUATE
function to retrieve the possible error message. - The maximum time points to forecast, which is specified using
horizon
, is 10,000. - Holiday effect modeling is effective only for approximately 5 years.
- After a multiple time-series model is trained, the evaluation tab in the
BigQuery page on the Google Cloud console only shows the evaluation
metrics for the first 100 time series. To see the evaluation metrics for all
of the time series, use
ML.ARIMA_EVALUATE
- The BigQuery ML training option
warm_start
is not supported byARIMA_PLUS
models.
CREATE MODEL
examples
The following example creates models named mymodel
in mydataset
in your
default project.
Training a time series model to forecast a single time series
This example creates a time series model.
CREATE MODEL `project_id.mydataset.mymodel`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='transaction') AS
SELECT
date,
transaction
FROM
`mydataset.mytable`
Training multiple time-series models for multiple time-series at the same time
This example creates multiple time-series models, one for each input time series.
CREATE MODEL `project_id.mydataset.mymodel`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='transaction',
time_series_id_col='company_name') AS
SELECT
date,
transaction,
company_name
FROM
`mydataset.mytable`
Training multiple time-series models for multiple time series at the same time using multiple time-series ID columns
This example creates multiple time-series models, one for each input time series.
CREATE MODEL `project_id.mydataset.mymodel`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='transaction',
time_series_id_col=['company_name', 'department_name']) AS
SELECT
date,
transaction,
company_name,
department_name
FROM
`mydataset.mytable`
Train multiple time-series models for multiple time series using a subset defined as a fraction of the time points for speed-up
This example creates multiple time-series models, one for each input time series.
CREATE MODEL `project_id.mydataset.mymodel`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='transaction',
time_series_id_col=['company_name', 'department_name'],
time_series_length_fraction=0.5,
min_time_series_length=30) AS
SELECT
date,
transaction,
company_name,
department_name
FROM
`mydataset.mytable`
Train multiple time-series models for multiple time series using a subset defined as a maximum number of points in the time points
This example creates multiple time-series models, one for each input time series.
CREATE MODEL `project_id.mydataset.mymodel`
OPTIONS(MODEL_TYPE='ARIMA_PLUS',
time_series_timestamp_col='date',
time_series_data_col='transaction',
time_series_id_col=['company_name', 'department_name'],
max_time_series_length=50) AS
SELECT
date,
transaction,
company_name,
department_name
FROM
`mydataset.mytable`
What's next
- Walk through our tutorials that use the time series model in BigQuery ML: