The CREATE MODEL statement for ARIMA_PLUS models

This document describes the CREATE MODEL statement for creating univariate time series models in BigQuery.

Forecasting takes place when you create the model. You can use the ML.FORECAST and ML.EXPLAIN_FORECAST functions to retrieve the forecasting values and compute the prediction intervals.

For information about the supported SQL statements and functions for each model type, see End-to-end user journey for each model.

Time series modeling pipeline

The BigQuery ML time series modeling pipeline includes multiple modules. The ARIMA model is the most computationally expensive module, which is why the model is named ARIMA_PLUS.

SINGLE_TIME_SERIES_DIAGRAM

The modeling pipeline for the ARIMA_PLUS time series models performs the following functions:

  • Infer the data frequency of the time series.
  • Handle irregular time intervals.
  • Handle duplicated timestamps by taking the mean value.
  • Interpolate missing data using local linear interpolation.
  • Detect and clean spike and dip outliers.
  • Detect and adjust abrupt step (level) changes.
  • Detect and adjust holiday effect.
  • Detect multiple seasonal patterns within a single time series by using Seasonal and Trend decomposition using Loess (STL), and extrapolate seasonality by using double exponential smoothing (ETS).
  • Detect and model the trend using the ARIMA model and the auto.ARIMA algorithm for automatic hyperparameter tuning. In auto.ARIMA, dozens of candidate models are trained and evaluated in parallel. The model with the lowest Akaike information criterion (AIC) is selected as the best model.

Large-scale time series

You can forecast up to 100,000,000 time series simultaneously with a single query by using the TIME_SERIES_ID_COL option. With this option, different modeling pipelines run in parallel, as long as enough slots are available. The following diagram shows this process:

MULTIPLE_TIME_SERIES_DIAGRAM

Large-scale time series forecasting best practices

Forecasting many time series simultaneously can lead to long-running queries, because query processing isn't completely parallel due to limited slot capacity. The following best practices can help you avoid long-running queries when forecasting many time series simultaneously:

  • When you have a large number (for example, 100,000) of time series to forecast, first forecast a small number of time series (for example, 1,000) to see how long the query takes. You can then estimate how long your entire time series forecast will take.
  • You can use the AUTO_ARIMA_MAX_ORDER option to balance between query run time and forecast accuracy. Increasing AUTO_ARIMA_MAX_ORDER expands the hyperparameter search space to try more complex ARIMA models, that is, ARIMA models with higher non-seasonal p and q. This increases forecast accuracy but also increases query run time. Decreasing the value of AUTO_ARIMA_MAX_ORDER decreases forecast accuracy but also decreases query run time. For example, if you specify a value of 3 instead of using the default value of 5 for this option, the query run time is reduced by at least 50%. The forecast accuracy might drop slightly for some of the time series. If a shorter training time is important to your use case, use a smaller value for AUTO_ARIMA_MAX_ORDER.
  • The model training time for each time series has a linear relationship to its length, which is based on the number of data points. The longer the time series, the longer the training takes. However, not all data points contribute equally to the model fitting process. Instead, the more recent the data point is, the more it contributes to the process. Therefore, if you have a long time series, for example ten years of daily data, you don't need to train a time series model using all of the data points. The most recent two or three years of data points are enough.
  • You can use the TIME_SERIES_LENGTH_FRACTION, MIN_TIME_SERIES_LENGTH and MAX_TIME_SERIES_LENGTH training options to enable fast model training with little to no loss of forecasting accuracy. The idea behind these options is that while periodic modeling, such as seasonality, requires a certain number of time points, trend modeling doesn't need many time points. However, trend modeling is much more computationally expensive than other time series components. By using the aforementioned t