Data Analytics

How to do multivariate time series forecasting in BigQuery ML

January 19, 2023

Haoming Chen

Software Engineer, BigQuery ML

Yan Sun

Generative AI Solution Architect, Google Could

Try Google Cloud

Start building on Google Cloud with $300 in free credits and 20+ always free products.

Free trial

Companies across industries rely heavily on time series forecasting to project product demand, forecast sales, project online subscription/cancellation, and for many other use cases. This makes time series forecasting one of the most popular models in BigQuery ML.

What is multivariate time series forecasting? For example, if you want to forecast ice cream sales, it is helpful to forecast using the external covariant “weather” along with the target metric “past sales.” Multivariate time series forecasting in BigQuery lets you create more accurate forecasting models without having to move data out of BigQuery.

When it comes to time series forecasting, covariates or features besides the target time series are often used to provide better forecasting. Up until now, BigQuery ML has only supported univariate time series modeling using the ARIMA_PLUS model (documentation). It is one of the most popular BigQuery ML models.

While ARIMA_PLUS is widely used, forecasting using only the target variable is sometimes not sufficient. Some patterns inside the time series strongly depend on other features. We see strong customer demand for multivariate time series forecasting support that allows you to forecast using covariate and features.

We recently announced the public preview of multivariate time series forecasting with external regressors. We are introducing a new model type ARIMA_PLUS_XREG, where the XREG refers to external regressors or side features. You can use the SELECT statement to choose side features with the target time series. This new model leverages the BigQuery ML linear regression model to include the side features and the BigQuery ML ARIMA_PLUS model to model the linear regression residuals.

The ARIMA_PLUS_XREG model supports the following capabilities:

Automatic feature engineering for numerical, categorical, and array features.
All the model capabilities of the ARIMA_PLUS model, such as detecting seasonal trends, holidays, etc.

Headlight, an AI-powered ad agency, is using a multivariate forecasting model to determine conversion volumes for down-funnel metrics like subscriptions, cancellations, etc. based on cohort age. You can check out the customer video and demo here.

The following sections show some examples of the new ARIMA_PLUS_XREG model in BigQuery ML. In this example, we explore the bigquery-public-data.epa_historical_air_quality dataset, which has daily air quality and weather information. We use the model to forecast the PM_2.5¹ , based on its historical data and some covariates, such as temperature and wind speed.

An example: forecast Seattle’s air quality with weather information

Step 1. Create the dataset

The PM_2.5, temperature, and wind speed data are in separate tables. To simplify the queries, create a new table by joining those tables into a new table “bqml_test.seattle_air_quality_daily,” with the following columns:

date: the date of the observation
PM_2.5: the average PM_2.5 value for each day
wind_speed: the average wind speed for each day
temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

CREATE TABLE `bqml_test.seattle_air_quality_daily`
AS
WITH
  pm25_daily AS (
    SELECT
      avg(arithmetic_mean) AS pm25, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`
    WHERE
      city_name = 'Seattle'
      AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass'
    GROUP BY date_local
  ),
  wind_speed_daily AS (
    SELECT
      avg(arithmetic_mean) AS wind_speed, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`
    WHERE
      city_name = 'Seattle' AND parameter_name = 'Wind Speed - Resultant'
    GROUP BY date_local
  ),
  temperature_daily AS (
    SELECT
      avg(first_max_value) AS temperature, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`
    WHERE
      city_name = 'Seattle' AND parameter_name = 'Outdoor Temperature'
    GROUP BY date_local
  )
SELECT
  pm25_daily.date AS date, pm25, wind_speed, temperature
FROM pm25_daily
JOIN wind_speed_daily USING (date)
JOIN temperature_daily USING (date)

Here is a preview of the data:

https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Multivariate_Time_Series_Forecasting.max-1000x1000.jpg

Step 2. Create Model

The “CREATE MODEL” query of the new multivariate model, ARIMA_PLUS_XREG, is very similar to the current ARIMA_PLUS model. The major differences are the MODEL_TYPE and inclusion of feature columns in the SELECT statement.

Step 3. Forecast the future data

With the created model, you can use the ML.FORECAST function to forecast the future data. Compared to the ARIMA_PLUS model, you have to specify the future covariates as an input.

After running the above query, you can see the forecasting results:

https://storage.googleapis.com/gweb-cloudblog-publish/images/2_Multivariate_Time_Series_Forecasting.max-800x800.jpg

Step 4. Evaluate the model

You can use the ML.EVALUATE function to evaluate the forecasting errors. You can set perform_aggregation to “TRUE” to get the aggregated error metric or “FALSE” to see the per timestamp errors.

The evaluation result of ARIMA_PLUS_XREG is as follows:

https://storage.googleapis.com/gweb-cloudblog-publish/images/3_Multivariate_Time_Series_Forecasting.max-2000x2000.jpg

As a comparison, we also show the univariate forecasting ARIMA_PLUS result in the following table:

https://storage.googleapis.com/gweb-cloudblog-publish/images/4_Multivariate_Time_Series_Forecasting.max-2000x2000.jpg

Compared to ARIMA_PLUS, ARIMA_PLUS_XREG performs better on all measured metrics on this specific dataset and date range.

Conclusion

In the previous example, we demonstrated how to create a multivariate time series forecasting model, forecast future values using the model, and evaluate the forecasted results. The ML.ARIMA_EVALUATE and ML.ARIMA_COEFFICIENTS table value functions are also helpful for investigating your model. Based on the feedback from users, the model does the following to improve user productivity.

It shortens the time spent preprocessing data and lets users keep their data in BigQuery when doing machine learning.
It reduces overhead for the users who know SQL to do machine learning work in BigQuery.

For more information about the ARIMA_PLUS_XREG model, please see the documentation here.

What’s Next?

In this blogpost, we described the BigQuery ML Multivariate Time Series Forecast model, which is now available for public preview. We also showed a code demo for a data scientist, data engineer, or data analyst to enable the multivariate time series forecast model.

The following features are coming soon:

Large-scale multivariate time series, i.e., training millions of models for millions of multivariate time series in a single CREATE MODEL statement
Multivariate time series anomaly detection

^{Thanks to Xi Cheng, Honglin Zheng, Jiashang Liu, Amir Hormati, Mingge Deng and Abhinav Khushraj from the BigQuery ML team. Also thanks to Weijie Shen from the Google Resource Efficiency Data Science team.}

^{1. A measure of air pollution from fine particulate matter}

Posted in