Scalable forecasting with millions of time-series in BigQuery

In this tutorial, you will learn how to significantly accelerate training a set of time series models to perform multiple time-series forecasts with a single query. You will also learn how to evaluate forecasting accuracy.

For all steps but the last one, you will use the new_york.citibike_trips data. This data contains information about Citi Bike trips in New York City. This dataset only contains a few hundred time series. It is used to illustrate various strategies to accelerate model training. For the last step, you will use iowa_liquor_sales.sales data to forecast more than 1 million time series.

Before reading this tutorial, you should read Perform multiple time-series forecasting with a single query from NYC Citi Bike trips data. You should also read Large-scale time series forecasting best practices.

Objectives

In this tutorial, you use the following:

The CREATE MODEL statement creates a time-series model or a set of time-series models.
The ML.EVALUATE function evaluats the forecasting accuracy.
The AUTO_ARIMA_MAX_ORDER, TIME_SERIES_LENGTH_FRACTION, MIN_TIME_SERIES_LENGTH, and MAX_TIME_SERIES_LENGTH training options: to significantly reduce the model training time.

For simplicity, this tutorial doesn't cover how to use ML.FORECAST or ML.EXPLAIN_FORECAST to generate (explainable) forecasts. To learn how to use those functions, see Performing multiple time-series forecasting with a single query from NYC Citi Bike trips data.

Costs

This tutorial uses billable components of Google Cloud, including:

BigQuery
BigQuery ML

For more information about costs, see the BigQuery pricing page and the BigQuery ML pricing page.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Enable the API

Step one: Create dataset

Create a BigQuery dataset to store your ML model:

In the Google Cloud console, go to the BigQuery page.

Go to the BigQuery page
In the Explorer pane, click your project name.
Click View actions > Create dataset.
On the Create dataset page, do the following:
- For Dataset ID, enter bqml_tutorial.
- For Location type, select Multi-region, and then select US (multiple regions in United States).
  
  The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.
- Leave the remaining default settings as they are, and click Create dataset.

Step two: Create the time series to forecast

In the following query, the FROM bigquery-public-data.new_york.citibike_trips clause indicates that you are queries the citibike_trips table in the new_yorkdataset.

CREATE OR REPLACE TABLE
  `bqml_tutorial.nyc_citibike_time_series` AS
WITH input_time_series AS
(
  SELECT
    start_station_name,
    EXTRACT(DATE FROM starttime) AS date,
    COUNT(*) AS num_trips
  FROM
    `bigquery-public-data.new_york.citibike_trips`
  GROUP BY
    start_station_name, date
)
SELECT table_1.*
FROM input_time_series AS table_1
INNER JOIN (
  SELECT start_station_name,  COUNT(*) AS num_points
  FROM input_time_series
  GROUP BY start_station_name) table_2
ON
  table_1.start_station_name = table_2.start_station_name
WHERE
  num_points > 400

To run the query, use the following steps:

In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.

The SELECT statement in the query uses EXTRACT function to extract the date information from the starttime column. The query uses the COUNT(*) clause to get the daily total number of Citi Bike trips.

table_1 has 679 time series. The query uses additional INNER JOIN logic to select all those time series that have more than 400 time points, resulting in a total of 383 times series.

Step three: Simultaneously forecast multiple time-series with default parameters

In this step you forecast the daily total number of trips starting from different Citi Bike stations. To do this, you must forecast many time series. You could write multiple CREATE MODEL queries but that can be a tedious and time consuming process, especially when you have a large number of time series.

To improve this process, BigQuery ML lets you create a set of time series models to forecast multiple time series using a single query. Additionally, all time series models are fit simultaneously.

In the following GoogleSQL query, the CREATE MODEL clause creates and trains a set of models named bqml_tutorial.nyc_citibike_arima_model_default.

CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_default`
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips',
   time_series_id_col = 'start_station_name'
  ) AS
SELECT *
FROM bqml_tutorial.nyc_citibike_time_series
WHERE date < '2016-06-01'

To run the CREATE MODEL query to create and train your model, use the following steps:

In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.

The query takes about 14 minutes 25 seconds to complete.

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating a set of ARIMA-based time-series ARIMA_PLUS models. In addition to time_series_timestamp_col and time_series_data_col, you must specify time_series_id_col, which is used to annotate different input time series.

This example leaves out the time points in the time series after 2016-06-01 so that those time points can be used to evaluate the forecasting accuracy later by using the ML.EVALUATE function.

Step four: Evaluate forecasting accuracy for each time series

In this step, you evaluate the forecasting accuracy for each time series by using the following ML.EVALUATE query.

SELECT *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_default`,
              TABLE `bqml_tutorial.nyc_citibike_time_series`,
              STRUCT(7 AS horizon, TRUE AS perform_aggregation))

To run the above query, use the following steps:

In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run. This query reports several forecasting metrics, including:
The results should look like the following:

ML.EVALUATE takes the ARIMA_PLUS model that was trained in the previous step as its first argument.

The second argument is a data table containing the ground truth data. These forecasting results are compared to the ground truth data to compute accuracy metrics. In this case, the nyc_citibike_time_series contains both the time series points that are before 2016-06-01 and after 2016-06-01. The points after 2016-06-01 are the ground truth data. The points before 2016-06-01 are used to train the model to generate forecasts after that date. Only the points after 2016-06-01 are necessary to compute the metrics. The points before 2016-06-01 are ignored in metrics calculation.

The third argument is a STRUCT which contains two parameters. The horizon is 7, which means the query is calculating the forecasting accuracy based on the 7 point forecast. Note that if the ground truth data has less than 7 points for the comparison, then accuracy metrics is computed based on the available points only. perform_aggregation has a value of TRUE, which means that the forecasting accuracy metrics are aggregated over the metrics on the time point basis. If you specify perform_aggregation as FALSE, forecasting accuracy is returned for each forecasted time point.

Step five: Evaluate the overall forecasting accuracy for all the time series

In this step, you evaluate the forecasting accuracy for the entire 383 time series using the following query:

SELECT
  AVG(mean_absolute_percentage_error) AS MAPE,
  AVG(symmetric_mean_absolute_percentage_error) AS sMAPE
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_default`,
              TABLE `bqml_tutorial.nyc_citibike_time_series`,
              STRUCT(7 AS horizon, TRUE AS perform_aggregation))

Of the forecasting metrics returned by ML.EVALUATE, only mean absolute percentage error and symmetric mean absolute percentage error are time series value independent. Therefore, to evaluate the entire forecasting accuracy of the set of time series, only the aggregate of these two metrics is meaningful.

This query returns the following results: MAPE is 0.3471, sMAPE is 0.2563.

Step six: Forecast many time-series simultaneously using a smaller hyperparameter search space

In step three, we used the default values for all of the training options, including the auto_arima_max_order. This option controls the search space for hyperparameter tuning in the auto.ARIMA algorithm.

In this step, you use a smaller search space for the hyperparameters.

CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model_max_order_2
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips',
   time_series_id_col = 'start_station_name',
   auto_arima_max_order = 2
  ) AS
SELECT *
FROM bqml_tutorial.nyc_citibike_time_series
WHERE date < '2016-06-01'

This query reduces auto_arima_max_order from 5 (the default value) to 2.

To run the query, use the following steps:

In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.

The query takes about 1 minutes 45 seconds to complete. Recall that it takes 14 min 25 sec for the query to complete if auto_arima_max_order is 5. So the speed gain is around 7x by setting auto_arima_max_order to 2. If you wonder why the speed gain is not 5/2=2.5x, this is because when increasing the order of auto_arima_max_order, not only the number of candidate models increase, but also the complexity hence the training time of the models increases.

Step seven: Evaluate forecast accuracy based on a smaller hyperparameter search space

SELECT
  AVG(mean_absolute_percentage_error) AS MAPE,
  AVG(symmetric_mean_absolute_percentage_error) AS sMAPE
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2`,
              TABLE `bqml_tutorial.nyc_citibike_time_series`,
              STRUCT(7 AS horizon, TRUE AS perform_aggregation))

This query returns the following results: MAPE is 0.3337 and sMAPE is 0.2337.

In step five, using a larger hyperparameter search space, auto_arima_max_order = 5, resulted in MAPE of 0.3471 and sMAPE 0.2563. Therefore, in this case a smaller hyperparameter search space actually gives higher forecasting accuracy. One reason is that the auto.ARIMA algorithm only performs hyperparameter tuning for the trend module of the entire modeling pipeline. The best ARIMA model selected by auto.ARIMA algorithm might not generate the best forecasting results for the entire pipeline.

Step eight: Forecast many time-series simultaneously with a smaller hyperparameter search space and smart fast training strategies

In this step, you use both a smaller hyperparameter search space and the smart fast training strategy using one or more of the max_time_series_length, max_time_series_length, or time_series_length_fraction training options.

While periodic modeling such as seasonality requires a certain number of time points, trend modeling requires fewer time points. Meanwhile, trend modeling is much more computationally expensive than other time series components such as seasonality. By using the fast training options above, you can efficiently model the trend component with a subset of the time series, while the other time series components use the entire time series.

This example uses max_time_series_length to achieve fast training.

CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training`
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips',
   time_series_id_col = 'start_station_name',
   auto_arima_max_order = 2,
   max_time_series_length = 30
  ) AS
SELECT *
FROM `bqml_tutorial.nyc_citibike_time_series`
WHERE date < '2016-06-01'

The max_time_series_length option has a value of 30, so for each of the 383 time series, only the 30 most recent time points are used to model the trend component. All time series are still used to model the non-trend components.

To run the query, use the following steps:

In the Google Cloud console, click the Compose new query button.
Enter the above GoogleSQL query in the Query editor text area.
Click Run.

The query takes about 35 seconds to complete. This is 3x faster compared to the training query which doesn't use fast training strategy (i.e., takes 1 minute 45 seconds). Note that due to the constant time overhead for the non-training part of the query, such as data preprocessing etc, the speed gain will be much higher when the number of time series is much larger than this case. For a million time series, the speed gain will approach the ratio of the time series length and the value of max_time_series_length. In this case, the speed gain will be greater than 10x.

Step nine: Evaluate forecasting accuracy for a model with smaller hyperparameter search space and smart fast training strategies

SELECT
  AVG(mean_absolute_percentage_error) AS MAPE,
  AVG(symmetric_mean_absolute_percentage_error) AS sMAPE
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_max_order_2_fast_training`,
              TABLE `bqml_tutorial.nyc_citibike_time_series`,
              STRUCT(7 AS horizon, TRUE AS perform_aggregation))

This query returns the following results: MAPE is 0.3515, and sMAPE is 0.2473.

Recall that without the use of fast training strategies, the forecasting accuracy results are MAPE is 0.3337 and sMAPE is 0.2337. The difference between the two set of metric values are within 3%, which is statistically insignificant.

In short, you have used a smaller hyperparameter search space and smart fast training strategies to make your model training more than 20x faster without sacrificing forecasting accuracy. As mentioned earlier, with more time series, the speed gain by the smart fast training strategies can be significantly higher. Additionally, the underlying ARIMA library used by ARIMA_PLUS has been optimized to run 5x faster than before. Together, these gains enable the forecasting of millions of time series within hours.

Step ten: Forecast over a million time series

In this step, you forecast liquor sales for over 1 million liquor products in different stores using the public Iowa liquor sales data.

CREATE OR REPLACE MODEL
  `bqml_tutorial.liquor_forecast_by_product`
OPTIONS(
  MODEL_TYPE = 'ARIMA_PLUS',
  TIME_SERIES_TIMESTAMP_COL = 'date',
  TIME_SERIES_DATA_COL = 'total_bottles_sold',
  TIME_SERIES_ID_COL = ['store_number', 'item_description'],
  HOLIDAY_REGION = 'US',
  AUTO_ARIMA_MAX_ORDER = 2,
  MAX_TIME_SERIES_LENGTH = 30
) AS
SELECT
  store_number,
  item_description,
  date,
  SUM(bottles_sold) as total_bottles_sold
FROM
  `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date BETWEEN DATE("2015-01-01") AND DATE("2021-12-31")
GROUP BY store_number, item_description, date

The model training still uses a small hyperparameter search space as well as the smart fast training strategy. The query takes about 1 hour 16 minutes to complete.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

You can delete the project you created.
Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

If necessary, open the BigQuery page in the Google Cloud console.

Go to the BigQuery page
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset to delete the dataset, the table, and all of the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

To learn more about machine learning, see the Machine learning crash course.
For an overview of BigQuery ML, see Introduction to BigQuery ML.
To learn more about the Google Cloud console, see Using the Google Cloud console.