In this tutorial, you will learn how to create a set of time series models to
perform multiple time-series forecasts with a single query. You will also learn
how to apply different fast training strategies to significantly speed up the
query and how to evaluate forecasting accuracy. You will use the
new_york.citibike_trips
data.
This data contains information about Citi Bike trips in New York City.
Before reading this tutorial, we highly recommend that you read Single time-series forecasting from Google Analytics data.
Objectives
In this tutorial, you use the following:
- The
CREATE MODEL
statement: to create a time-series model or a set of time-series models. - The
ML.ARIMA_EVALUATE
function: to evaluate the model. - The
ML.ARIMA_COEFFICIENTS
function: to inspect the model coefficients. - The
ML.EXPLAIN_FORECAST
function: to retrieve various components of the time series (such as seasonality and trend) that can be used to explain the forecast results. - Looker Studio: to visualize the forecasting results.
- Optional: The
ML.FORECAST
function: to forecast daily total visits.
Costs
This tutorial uses billable components of Google Cloud, including:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two (optional): Visualize the time series you want to forecast
Before creating the model, it is beneficial to see how your input time series looks. You can achieve this by using Looker Studio.
In the following query, the FROM bigquery-public-data.new_york.citibike_trips
clause indicates that you are querying the citibike_trips
table in the
new_york
dataset.
In the SELECT
statement, the query uses the
EXTRACT function
to extract the date information from the starttime
column. The query uses
the COUNT(*)
clause to get the daily total number of Citi Bike trips.
#standardSQL SELECT EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY date
To run the query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY date
Click Run.
After the query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 1184 daily data points, which span more than 4 years. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.
In the Chart panel, choose Time series chart:
In the Data panel, below the Chart panel, go to the Metric section. Add the num_trips field and remove the default metric Record Count:
After you complete the above step, the following plot appears. The plot shows that the input time series has both weekly and yearly patterns. The time series is trending up as well.
Step three: Create your time series model to perform single time-series forecasting
Next, create a time series model using the NYC Citi Bike trips data.
The following GoogleSQL query creates a model used to forecast daily total
bike trips. The CREATE MODEL
clause creates and trains a model named bqml_tutorial.nyc_citibike_arima_model
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips' ) AS SELECT EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY date
The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating an
ARIMA-based
time-series model. By default,
auto_arima=TRUE
,
so the auto.ARIMA
algorithm automatically tunes the hyperparameters in
ARIMA_PLUS
models. The algorithm fits dozens of candidate models and chooses
the best one with the lowest
Akaike information criterion (AIC).
Additionally, because the default is
data_frequency='AUTO_FREQUENCY'
, the training process automatically infers
the data frequency of the input time series. Lastly, the CREATE MODEL
statement uses
decompose_time_series=TRUE
by default, and users can further understand how the time series is forecasted
by fetching the separate time-series components such as seasonal periods and
holiday effect.
Run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips' ) AS SELECT EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` GROUP BY date
Click Run.
The query takes approximately 17 seconds to complete, after which your model (
nyc_citibike_arima_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a model, you do not see query results.
Step four: Forecast the time series and visualize the results
To explain how the time series is forecasted, visualize all the sub-time series
components, such as seasonality and trend, using the
ML.EXPLAIN_FORECAST
function.
To achieve this, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model`, STRUCT(365 AS horizon, 0.9 AS confidence_level))
Click Run.
After the query completes, click the Explore data button, and then click Explore with Looker Studio. A new tab opens in the browser. Then, in the Chart panel, find the Time series chart icon and click it, as shown in the following screenshot.
In the Data panel, do the following:
- In the Date Range Dimension section, select
time_series_timestamp (Date)
. - In the Dimension section, select
time_series_timestamp (Date)
. - In the Metric section, remove the default metric
Record Count
, and add the following:time_series_data
prediction_interval_lower_bound
prediction_interval_upper_bound
trend
seasonal_period_yearly
seasonal_period_weekly
spikes_and_dips
step_changes
- In the Date Range Dimension section, select
In the Style panel, scroll down to the Missing Data option and use Line Breaks instead of Line to Zero.
After you complete these steps, the following plot appears in the left panel.
Step five: Forecast multiple time-series simultaneously
Next, you might want to forecast the daily total number of trips starting from
different Citi Bike stations. To do this, you must forecast many time series.
You can write multiple
CREATE MODEL
queries but that can be a tedious and time consuming process, especially when
you have a large number of time series.
To improve this process, BigQuery ML lets you create a set of time series models to forecast multiple time-series using a single query. Additionally, all time series models are fitted simultaneously.
In the following GoogleSQL query, the
CREATE MODEL
clause creates and trains a set of models named
bqml_tutorial.nyc_citibike_arima_model_group
.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_group` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name', auto_arima_max_order = 5 ) AS SELECT start_station_name, EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` WHERE start_station_name LIKE '%Central Park%' GROUP BY start_station_name, date
The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...)
clause indicates that you are creating a set of
ARIMA-based
time-series ARIMA_PLUS
models. In addition to time_series_timestamp_col
and
time_series_data_col
, you must specify time_series_id_col
, which is used to
annotate different input time series. The auto_arima_max_order
option controls the
search space for hyperparameter tuning in the auto.ARIMA algorithm.
Lastly, the CREATE MODEL
statement uses
decompose_time_series=TRUE
by default, and users can further understand how the time series is analyzed in
the training pipeline by fetching the decomposition results.
The SELECT ... FROM ... GROUP BY ...
clause indicates that you form
multiple time series; each one is associated with a different
start_station_name
. For simplicity, use the WHERE ... LIKE ...
clause to
limit start stations to those with Central Park
in their names.
To run the CREATE MODEL
query to create and train your model, use the
following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_group` OPTIONS (model_type = 'ARIMA_PLUS', time_series_timestamp_col = 'date', time_series_data_col = 'num_trips', time_series_id_col = 'start_station_name', auto_arima_max_order = 5 ) AS SELECT start_station_name, EXTRACT(DATE from starttime) AS date, COUNT(*) AS num_trips FROM `bigquery-public-data.new_york.citibike_trips` WHERE start_station_name LIKE '%Central Park%' GROUP BY start_station_name, date
Click Run.
The query takes approximately 24 seconds to complete, after which your model (
nyc_citibike_arima_model_group
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement, you do not see query results.
Step six: Inspect the evaluation metrics of the set of time series models
After creating your models, you can use the
ML.ARIMA_EVALUATE
function to see the evaluation metrics of all the created models.
In the following GoogleSQL query, the FROM
clause uses the
ML.ARIMA_EVALUATE
function against your model, bqml_tutorial.nyc_citibike_arima_model_group
. The
evaluation metrics only depend on the training input, so your model is the
only input.
To run the ML.ARIMA_EVALUATE
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`)
Click Run.
The query takes less than a second to complete. When the query is complete, click the Results tab below the query text area. The results should look like the following:
The results include the following columns:
start_station_name
non_seasonal_p
non_seasonal_d
non_seasonal_q
has_drift
log_likelihood
AIC
variance
seasonal_periods
has_holiday_effect
has_spikes_and_dips
has_step_changes
error_message
start_station_name
, the first column, annotates the time series that each time-series model is fitted against. It is the same as that specified bytime_series_id_col
.The following four columns (
non_seasonal_p
,non_seasonal_d
,non_seasonal_q
, andhas_drift
) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood
,AIC
, andvariance
) are relevant to the ARIMA model fitting process. The fitting process determines the best ARIMA model by using theauto.ARIMA
algorithm, one for each time series.The
has_holiday_effect
,has_spikes_and_dips
, andhas_step_changes
columns are only populated when thedecompose_time_series=TRUE
.The
seasonal_periods
column is the seasonal pattern inside the input time series. Each time series can have different seasonal patterns. For example, from the figure, you can see that one time series has a yearly pattern, while others don't.
Step seven: Inspect the coefficients of your models
The ML.ARIMA_COEFFICIENTS
function is used to retrieve the model coefficients of your ARIMA_PLUS
model,
bqml_tutorial.nyc_citibike_arima_model_group
. ML.ARIMA_COEFFICIENTS
takes
the model as the only input.
To run the ML.ARIMA_COEFFICIENTS
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`)
Click Run.
The query takes less than a second to complete. The results should look similar to the following screenshot:
The results include the following columns:
start_station_name
ar_coefficients
ma_coefficients
intercept_or_drift
start_station_name
, the first column, annotates the time series that each time series model is fitted against.ar_coefficients
shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly,ma_coefficients
shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal tonon_seasonal_p
andnon_seasonal_q
, respectively. Theintercept_or_drift
is the constant term in the ARIMA model.
Step eight: Use your model to forecast multiple time series simultaneously with explanations
The ML.EXPLAIN_FORECAST
function forecasts future time series values with a prediction interval using
your model, bqml_tutorial.nyc_citibike_arima_model_group
, and at the same time
returns all the separate components of the time series.
The STRUCT(3 AS horizon, 0.9 AS confidence_level)
clause indicates that the
query forecasts three future time points and generates a prediction interval
with 90% confidence. The ML.EXPLAIN_FORECAST
function takes the model, as well
as a couple of optional arguments.
To run the ML.EXPLAIN_FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`, STRUCT(3 AS horizon, 0.9 AS confidence_level))
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
start_station_name
time_series_timestamp
time_series_type
time_series_data
time_series_adjusted_data
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_lower_bound
trend
seasonal_period_yearly
seasonal_period_quarterly
seasonal_period_monthly
seasonal_period_weekly
seasonal_period_daily
holiday_effect
spikes_and_dips
step_changes
The output rows are ordered by
start_station_name
, and for eachstart_station_name
, the output rows are in the chronological order oftime_series_timestamp
. Different components are listed as columns of the output. For more information, see the definition ofML.EXPLAIN_FORECAST
.
(Optional) Step nine: Use your model to forecast multiple time series simultaneously
The ML.FORECAST
function can also be used to forecast future time series values with a
prediction interval using your model,
bqml_tutorial.nyc_citibike_arima_model_group
.
Like ML.EXPLAIN_FORECAST
, the STRUCT(3 AS horizon, 0.9 AS confidence_level)
clause indicates that for each time series, the query forecasts three future
time points, and generates a prediction interval with 90% confidence.
The ML.FORECAST
function takes the model, as well as a couple of optional
arguments.
To run the ML.FORECAST
query, use the following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`, STRUCT(3 AS horizon, 0.9 AS confidence_level))
Click Run.
The query takes less than a second to complete. The results should look like the following:
The results include the following columns:
start_station_name
forecast_timestamp
forecast_value
standard_error
confidence_level
prediction_interval_lower_bound
prediction_interval_upper_bound
confidence_interval_lower_bound
(soon to be deprecated)confidence_interval_upper_bound
(soon to be deprecated)
The first column,
start_station_name
, annotates the time series that each time series model is fitted against. Eachstart_station_name
has a horizon number of rows for its forecasting results.For each
start_station_name
, output rows are ordered in the chronological order offorecast_timestamp
. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as theforecast_value
. Theforecast_value
is the middle point of the prediction interval. The prediction interval depends on thestandard_error
andconfidence_level
.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset to delete the dataset, the table, and all of the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- Learn how to accelerate ARIMA_PLUS to enable forecast 1 million time series within hours
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.