Multiple time-series forecasting with a single query for NYC Citi Bike trips

In this tutorial, you will learn how to create a set of time series models to perform multiple time-series forecasts with a single query. You will use the new_york.citibike_trips data. This data contains information about Citi Bike trips in New York City.

Before reading this tutorial, we highly recommend that you read Single time-series forecasting from Google Analytics data.

Objectives

In this tutorial, you use the following:

  • The CREATE MODEL statement: to create a time series model or a set of time series models.
  • The ML.EVALUATE function: to evaluate the model.
  • The ML.ARIMA_COEFFICIENTS function: to inspect the model coefficients.
  • The ML.FORECAST function: to forecast the time series.
  • Google Data Studio: to visualize the forecasting results.

Costs

This tutorial uses billable components of Google Cloud, including:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see the BigQuery ML pricing page.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Cloud Console, on the project selector page, select or create a Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

The first step is to create a BigQuery dataset to store your ML model. To create your dataset:

  1. In the Cloud Console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Resources section, click your project name.

  3. In the details panel, click Create dataset.

    Create dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter bqml_tutorial.
    • For Data location, choose United States (US). Currently, the public datasets are stored in the US multi-region location. For simplicity, you should place your dataset in the same location.

      Create dataset page.

  5. Leave all other default settings in place and click Create dataset.

(Optional) Step two: Visualize the time series you want to forecast

Before creating the model, it is beneficial to see how your input time series looks. You can achieve this by using Google Data Studio.

In the following query, the FROM bigquery-public-data.new_york.citibike_trips clause indicates that you are querying the citibike_trips table in the new_yorkdataset.

In the SELECT statement, the query uses the EXTRACT function to extract the date information from the starttime column. The query uses the COUNT(*) clause to get the daily total number of Citi Bike trips.

#standardSQL
SELECT
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data`.new_york.citibike_trips
GROUP BY date

To run the query, use the following steps:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data`.new_york.citibike_trips
    GROUP BY date
    
  3. Click Run.

    After the query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 1184 daily data points, which span more than 4 years. Click the Explore data button and then Explore with Data Studio. Google Data Studio opens in a new tab. Complete the following steps in the new tab.

    Query output.

    In the Chart panel, choose Time series chart:

    Time_series_chart.

    In the Data panel, below the Chart panel, go to the Metric section. Add the num_trips field and remove the default metric Record Count:

    Time_series_data_fields.

    After you complete the above step, the following plot appears. The plot shows that the input time series has both weekly and yearly patterns. The time series is trending up as well.

    Result_visualization.

Step three: Create your time series model to perform single time-series forecasting

Next, create a time series model using the NYC Citi Bike trips data.

The following standard SQL query creates a model used to forecast daily total bike trips. The CREATE MODEL clause creates and trains a model named bqml_tutorial.nyc_citibike_arima_model.

The OPTIONS(model_type='ARIMA', time_series_timestamp_col='date', ...) clause indicates that you are creating a ARIMA-based time series model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyper-parameters in ARIMA models. The algorithm fits dozens of candidate models and chooses the best one with the lowest AIC. Additionally, because the default is data_frequency='AUTO_FREQUENCY', the training process automatically infers the data frequency of the input time series.

#standardSQL
CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
OPTIONS
  (model_type = 'ARIMA',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips'
  ) AS
SELECT
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data`.new_york.citibike_trips
GROUP BY date

Run the CREATE MODEL query to create and train your model:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model
    OPTIONS
      (model_type = 'ARIMA',
       time_series_timestamp_col = 'date',
       time_series_data_col = 'num_trips'
      ) AS
    SELECT
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data`.new_york.citibike_trips
    GROUP BY date
    
  3. Click Run.

    The query takes about 44 seconds to complete, after which your model (nyc_citibike_arima_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step four: Forecast the time series and visualize the results

To intuitively evaluate the time series forecasting results, visualize the forecasted time series together with the history time series. To achieve this, concatenate the history time series and the forecasted time series as shown in the following example query.

#standardSQL
SELECT
 date,
 num_trips AS history_value,
 NULL AS forecast_value,
 NULL AS prediction_interval_lower_bound,
 NULL AS prediction_interval_upper_bound
FROM
 (
  SELECT
     EXTRACT(DATE from starttime) AS date,
     COUNT(*) AS num_trips
  FROM
    `bigquery-public-data`.new_york.citibike_trips
  GROUP BY date
 )
UNION ALL
SELECT
 EXTRACT(DATE from forecast_timestamp) AS date,
 NULL AS history_value,
 forecast_value,
 prediction_interval_lower_bound,
 prediction_interval_upper_bound
FROM
 ML.FORECAST(MODEL bqml_tutorial.nyc_citibike_arima_model,
             STRUCT(365 AS horizon, 0.9 AS confidence_level))

Query details

The SQL before the UNION ALL clause forms the history time series. The SQL after the UNION ALL clause uses ML.FORECAST to generate the forecasted time series as well as the prediction interval. This example uses different fields for history_value and forecasted_value to plot them in different colors.

Run the query

To run the above query:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     date AS timestamp,
     num_trips AS history_value,
     NULL AS forecast_value,
     NULL AS prediction_interval_lower_bound,
     NULL AS prediction_interval_upper_bound
    FROM
     (
      SELECT
         EXTRACT(DATE from starttime) AS date,
         COUNT(*) AS num_trips
      FROM
        `bigquery-public-data`.new_york.citibike_trips
      GROUP BY date
     )
    UNION ALL
    SELECT
     EXTRACT(DATE from forecast_timestamp) AS timestamp,
     NULL AS history_value,
     forecast_value,
     prediction_interval_lower_bound,
     prediction_interval_upper_bound
    FROM
     ML.FORECAST(MODEL bqml_tutorial.nyc_citibike_arima_model,
                 STRUCT(365 AS horizon, 0.9 AS confidence_level))
    
  3. Click Run.

    After the query completes, click the Explore data button, and then click Explore with Data Studio. A new tab opens in the browser. Then, in the Chart panel, find the Time series chart icon and click it, as shown in the following screenshot.

    Time_series_chart.

    Under the Chart panel, in the Data panel, find the Metric section. Add the following metrics: history_value, forecast_value, prediction_interval_lower_bound, and prediction_interval_upper_bound. Then, remove the default metric Record Count as shown in the following screenshot.

    Time_series_chart.

    In the Style panel, scroll down to the Missing Data option and use Line Breaks instead of Line to Zero.

    Style_section.

    After you complete these steps, the following plot appears in the left panel. The input history time series is in blue, while the forecasted series is in green. The prediction interval is the region between the lower bound series and the upper bound series.

    Result_visualization.

Step five: Forecast multiple time-series simultaneously

Next, you might want to forecast the daily total number of trips starting from different Citi Bike stations. To do this, you must forecast many time series. You can write multiple CREATE MODEL queries but that can be a tedious and time consuming process, especially when you have a large number of time series.

To improve this process, BigQuery ML lets you create a set of time series models to forecast multiple time-series using a single query. Additionally, all time series models are fitted simultaneously.

In the following standard SQL query, the CREATE MODEL clause creates and trains a set of models named bqml_tutorial.nyc_citibike_arima_model_group.

The OPTIONS(model_type='ARIMA', time_series_timestamp_col='date', ...) clause indicates that you are creating a set of ARIMA-based time series models. In addition to time_series_timestamp_col and time_series_data_col, you must specify time_series_id_col, which is used to annotate different input time series. Both auto_arima and data_frequency options must be the default values, so you don't need to specify them in the query.

The SELECT ... FROM ... GROUP BY ... clause indicates that you form multiple time series; each one is associated with a different start_station_name. For simplicity, we use the WHERE ... LIKE ... clause to limit start stations to those with Central Park in their names.

#standardSQL
CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model_group
OPTIONS
  (model_type = 'ARIMA',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips',
   time_series_id_col = 'start_station_name'
  ) AS
SELECT
   start_station_name,
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data`.new_york.citibike_trips
WHERE start_station_name LIKE '%Central Park%'
GROUP BY start_station_name, date

To run the CREATE MODEL query to create and train your model, use the following steps:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL bqml_tutorial.nyc_citibike_arima_model_group
    OPTIONS
      (model_type = 'ARIMA',
       time_series_timestamp_col = 'date',
       time_series_data_col = 'num_trips',
       time_series_id_col = 'start_station_name'
      ) AS
    SELECT
       start_station_name,
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data`.new_york.citibike_trips
    WHERE start_station_name LIKE '%Central Park%'
    GROUP BY start_station_name, date
    
  3. Click Run.

    The query takes about 90 seconds to complete, after which your model (nyc_citibike_arima_model_group) appears in the navigation panel. Because the query uses a CREATE MODEL statement, you do not see query results.

Step six: Inspect the evaluation metrics of the set of time series models

After creating your models, you can use the ML.EVALUATE function to see the evaluation metrics of all the created models.

In the following standard SQL query, the FROM clause uses the ML.EVALUATE function against your model, bqml_tutorial.nyc_citibike_arima_model_group. The evaluation metrics only depend on the training input, so your model is the only input to the ML.EVALUATE function.

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`)

To run the ML.EVALUATE query that evaluates the model, use the following steps:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.EVALUATE(MODEL bqml_tutorial.nyc_citibike_arima_model_group)
    
  3. Click Run.

  4. The query takes less than a second to complete. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.EVALUATE output.

    The results include the following columns:

    • start_station_name
    • non_seasonal_p
    • non_seasonal_d
    • non_seasonal_q
    • has_drift
    • log_likelihood
    • AIC
    • variance
    • seasonal_periods

    start_station_name, the first column, annotates the time series that each time series model is fitted against. It is the same as that specified by time_series_id_col.

    The following four columns (non_seasonal_{p,d,q} and has_drift) define an ARIMA model. The three metrics after that (log_likelihood, AIC, and variance) are relevant to the ARIMA model fitting process. The fitting process determines the best ARIMA model by using the auto.ARIMA algorithm, one for each time series.

    seasonal_periods is the seasonal pattern inside the input time series. Each time series can have different seasonal patterns. For example, from the figure, you can see that one time series has a yearly pattern, while others do not.

Step seven: Inspect the coefficients of your models

The ML.ARIMA_COEFFICIENTS function is used to retrieve the model coefficients of your ARIMA model, bqml_tutorial.nyc_citibike_arima_model_group. ML.ARIMA_COEFFICIENTS takes the model as the only input.

#standardSQL
SELECT
  *
FROM
  ML.ARIMA_COEFFICIENTS(MODEL bqml_tutorial.nyc_citibike_arima_model_group)

To run the ML.ARIMA_COEFFICIENTS query, use the following steps:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_COEFFICIENTS(MODEL bqml_tutorial.nyc_citibike_arima_model_group)
    
  3. Click Run.

    The query takes less than a second to complete. The results should look similar to the following screenshot:

    ML.ARIMA_COEFFICIENTS output.

    The results include the following columns:

    • start_station_name
    • ar_coefficients
    • ma_coefficients
    • intercept_or_drift

    start_station_name, the first column, annotates the time series that each time series model is fitted against.

    ar_coefficients shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly, ma_coefficients shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal to non_seasonal_p and non_seasonal_q, respectively. The intercept_or_drift is the constant term in the ARIMA model.

Step eight: Use your model to forecast multiple time-series simultaneously

The ML.FORECAST function forecasts future time series values with a prediction interval using your model, bqml_tutorial.nyc_citibike_arima_model_group.

The STRUCT(3 AS horizon, 0.9 AS confidence_level) clause indicates that for each time series, forecast 3 future time points, and generate a prediction interval with a 90% confidence level.

ML.FORECAST takes the model, as well as a couple of optional arguments, as shown in the following example.

#standardSQL
SELECT
  *
FROM
  ML.FORECAST(MODEL bqml_tutorial.nyc_citibike_arima_model_group,
              STRUCT(3 AS horizon, 0.9 AS confidence_level))

To run the ML.FORECAST query, use the following steps:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.FORECAST(MODEL bqml_tutorial.nyc_citibike_arima_model_group,
                 STRUCT(3 AS horizon, 0.9 AS confidence_level))
    
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.FORECAST output.

    The results include the following columns:

    • start_station_name
    • forecast_timestamp
    • forecast_value
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_upper_bound
    • confidence_interval_lower_bound (will be deprecated soon)
    • confidence_interval_upper_bound (will be deprecated soon)

    start_station_name, the first column, annotates the time series that each time series model is fitted against. Each start_station_name has a horizon number of rows for its forecasting results.

    For each start_station_name, output rows are ordered in the chronological order of forecast_timestamp. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as the forecast_value. The forecast_value is the middle point of the prediction interval. The prediction interval depends on the standard_error and confidence_level.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset to delete the dataset, the table, and all of the data.

  4. In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Deleting your project

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next