Multiple time-series forecasting with a single query for NYC Citi Bike trips


In this tutorial, you will learn how to create a set of time series models to perform multiple time-series forecasts with a single query. You will also learn how to apply different fast training strategies to significantly speed up the query and how to evaluate forecasting accuracy. You will use the new_york.citibike_trips data. This data contains information about Citi Bike trips in New York City.

Before reading this tutorial, we highly recommend that you read Single time-series forecasting from Google Analytics data.

Objectives

In this tutorial, you use the following:

  • The CREATE MODEL statement: to create a time-series model or a set of time-series models.
  • The ML.ARIMA_EVALUATE function: to evaluate the model.
  • The ML.ARIMA_COEFFICIENTS function: to inspect the model coefficients.
  • The ML.EXPLAIN_FORECAST function: to retrieve various components of the time series (such as seasonality and trend) that can be used to explain the forecast results.
  • Looker Studio: to visualize the forecasting results.
  • Optional: The ML.FORECAST function: to forecast daily total visits.

Costs

This tutorial uses billable components of Google Cloud, including:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select US (multiple regions in United States).

      The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Step two (optional): Visualize the time series you want to forecast

Before creating the model, it is beneficial to see how your input time series looks. You can achieve this by using Looker Studio.

In the following query, the FROM bigquery-public-data.new_york.citibike_trips clause indicates that you are querying the citibike_trips table in the new_yorkdataset.

In the SELECT statement, the query uses the EXTRACT function to extract the date information from the starttime column. The query uses the COUNT(*) clause to get the daily total number of Citi Bike trips.

#standardSQL
SELECT
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data.new_york.citibike_trips`
GROUP BY date

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data.new_york.citibike_trips`
    GROUP BY date
    
  3. Click Run.

    After the query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 1184 daily data points, which span more than 4 years. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.

    Query output.

  4. In the Chart panel, choose Time series chart:

    Time_series_chart.

  5. In the Data panel, below the Chart panel, go to the Metric section. Add the num_trips field and remove the default metric Record Count:

    Time_series_data_fields.

    After you complete the above step, the following plot appears. The plot shows that the input time series has both weekly and yearly patterns. The time series is trending up as well.

    Result_visualization.

Step three: Create your time series model to perform single time-series forecasting

Next, create a time series model using the NYC Citi Bike trips data.

The following GoogleSQL query creates a model used to forecast daily total bike trips. The CREATE MODEL clause creates and trains a model named bqml_tutorial.nyc_citibike_arima_model.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model`
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips'
  ) AS
SELECT
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data.new_york.citibike_trips`
GROUP BY date

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating an ARIMA-based time-series model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyperparameters in ARIMA_PLUS models. The algorithm fits dozens of candidate models and chooses the best one with the lowest Akaike information criterion (AIC). Additionally, because the default is data_frequency='AUTO_FREQUENCY', the training process automatically infers the data frequency of the input time series. Lastly, the CREATE MODEL statement uses decompose_time_series=TRUE by default, and users can further understand how the time series is forecasted by fetching the separate time-series components such as seasonal periods and holiday effect.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model`
    OPTIONS
      (model_type = 'ARIMA_PLUS',
       time_series_timestamp_col = 'date',
       time_series_data_col = 'num_trips'
      ) AS
    SELECT
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data.new_york.citibike_trips`
    GROUP BY date
    
  3. Click Run.

    The query takes approximately 17 seconds to complete, after which your model (nyc_citibike_arima_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step four: Forecast the time series and visualize the results

To explain how the time series is forecasted, visualize all the sub-time series components, such as seasonality and trend, using the ML.EXPLAIN_FORECAST function.

To achieve this, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
      *
    FROM
      ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model`,
                          STRUCT(365 AS horizon, 0.9 AS confidence_level))
    
  3. Click Run.

  4. After the query completes, click the Explore data button, and then click Explore with Looker Studio. A new tab opens in the browser. Then, in the Chart panel, find the Time series chart icon and click it, as shown in the following screenshot.

    Time_series_chart.

  5. In the Data panel, do the following:

    1. In the Date Range Dimension section, select time_series_timestamp (Date).
    2. In the Dimension section, select time_series_timestamp (Date).
    3. In the Metric section, remove the default metric Record Count, and add the following:
      • time_series_data
      • prediction_interval_lower_bound
      • prediction_interval_upper_bound
      • trend
      • seasonal_period_yearly
      • seasonal_period_weekly
      • spikes_and_dips
      • step_changes

    Time_series_chart.

  6. In the Style panel, scroll down to the Missing Data option and use Line Breaks instead of Line to Zero.

    Style_section.

    After you complete these steps, the following plot appears in the left panel.

    Result_visualization.

Step five: Forecast multiple time-series simultaneously

Next, you might want to forecast the daily total number of trips starting from different Citi Bike stations. To do this, you must forecast many time series. You can write multiple CREATE MODEL queries but that can be a tedious and time consuming process, especially when you have a large number of time series.

To improve this process, BigQuery ML lets you create a set of time series models to forecast multiple time-series using a single query. Additionally, all time series models are fitted simultaneously.

In the following GoogleSQL query, the CREATE MODEL clause creates and trains a set of models named bqml_tutorial.nyc_citibike_arima_model_group.

 #standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_group`
OPTIONS
  (model_type = 'ARIMA_PLUS',
   time_series_timestamp_col = 'date',
   time_series_data_col = 'num_trips',
   time_series_id_col = 'start_station_name',
   auto_arima_max_order = 5
  ) AS
SELECT
   start_station_name,
   EXTRACT(DATE from starttime) AS date,
   COUNT(*) AS num_trips
FROM
  `bigquery-public-data.new_york.citibike_trips`
WHERE start_station_name LIKE '%Central Park%'
GROUP BY start_station_name, date

The OPTIONS(model_type='ARIMA_PLUS', time_series_timestamp_col='date', ...) clause indicates that you are creating a set of ARIMA-based time-series ARIMA_PLUS models. In addition to time_series_timestamp_col and time_series_data_col, you must specify time_series_id_col, which is used to annotate different input time series. The auto_arima_max_order option controls the search space for hyperparameter tuning in the auto.ARIMA algorithm. Lastly, the CREATE MODEL statement uses decompose_time_series=TRUE by default, and users can further understand how the time series is analyzed in the training pipeline by fetching the decomposition results.

The SELECT ... FROM ... GROUP BY ... clause indicates that you form multiple time series; each one is associated with a different start_station_name. For simplicity, use the WHERE ... LIKE ... clause to limit start stations to those with Central Park in their names.

To run the CREATE MODEL query to create and train your model, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL `bqml_tutorial.nyc_citibike_arima_model_group`
    OPTIONS
      (model_type = 'ARIMA_PLUS',
       time_series_timestamp_col = 'date',
       time_series_data_col = 'num_trips',
       time_series_id_col = 'start_station_name',
       auto_arima_max_order = 5
      ) AS
    SELECT
       start_station_name,
       EXTRACT(DATE from starttime) AS date,
       COUNT(*) AS num_trips
    FROM
      `bigquery-public-data.new_york.citibike_trips`
    WHERE start_station_name LIKE '%Central Park%'
    GROUP BY start_station_name, date
    
  3. Click Run.

    The query takes approximately 24 seconds to complete, after which your model (nyc_citibike_arima_model_group) appears in the navigation panel. Because the query uses a CREATE MODEL statement, you do not see query results.

Step six: Inspect the evaluation metrics of the set of time series models

After creating your models, you can use the ML.ARIMA_EVALUATE function to see the evaluation metrics of all the created models.

In the following GoogleSQL query, the FROM clause uses the ML.ARIMA_EVALUATE function against your model, bqml_tutorial.nyc_citibike_arima_model_group. The evaluation metrics only depend on the training input, so your model is the only input.

To run the ML.ARIMA_EVALUATE query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_EVALUATE(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`)
    
  3. Click Run.

  4. The query takes less than a second to complete. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.EVALUATE output.

    The results include the following columns:

    • start_station_name
    • non_seasonal_p
    • non_seasonal_d
    • non_seasonal_q
    • has_drift
    • log_likelihood
    • AIC
    • variance
    • seasonal_periods
    • has_holiday_effect
    • has_spikes_and_dips
    • has_step_changes
    • error_message

    start_station_name, the first column, annotates the time series that each time-series model is fitted against. It is the same as that specified by time_series_id_col.

    The following four columns (non_seasonal_p, non_seasonal_d, non_seasonal_q, and has_drift) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood, AIC, and variance) are relevant to the ARIMA model fitting process. The fitting process determines the best ARIMA model by using the auto.ARIMA algorithm, one for each time series.

    The has_holiday_effect, has_spikes_and_dips, and has_step_changes columns are only populated when the decompose_time_series=TRUE.

    The seasonal_periods column is the seasonal pattern inside the input time series. Each time series can have different seasonal patterns. For example, from the figure, you can see that one time series has a yearly pattern, while others don't.

Step seven: Inspect the coefficients of your models

The ML.ARIMA_COEFFICIENTS function is used to retrieve the model coefficients of your ARIMA_PLUS model, bqml_tutorial.nyc_citibike_arima_model_group. ML.ARIMA_COEFFICIENTS takes the model as the only input.

To run the ML.ARIMA_COEFFICIENTS query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_COEFFICIENTS(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`)
    
  3. Click Run.

    The query takes less than a second to complete. The results should look similar to the following screenshot:

    ML.ARIMA_COEFFICIENTS output.

    The results include the following columns:

    • start_station_name
    • ar_coefficients
    • ma_coefficients
    • intercept_or_drift

    start_station_name, the first column, annotates the time series that each time series model is fitted against.

    ar_coefficients shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly, ma_coefficients shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal to non_seasonal_p and non_seasonal_q, respectively. The intercept_or_drift is the constant term in the ARIMA model.

Step eight: Use your model to forecast multiple time series simultaneously with explanations

The ML.EXPLAIN_FORECAST function forecasts future time series values with a prediction interval using your model, bqml_tutorial.nyc_citibike_arima_model_group, and at the same time returns all the separate components of the time series.

The STRUCT(3 AS horizon, 0.9 AS confidence_level) clause indicates that the query forecasts three future time points and generates a prediction interval with 90% confidence. The ML.EXPLAIN_FORECAST function takes the model, as well as a couple of optional arguments.

To run the ML.EXPLAIN_FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.EXPLAIN_FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`,
                         STRUCT(3 AS horizon, 0.9 AS confidence_level))
    
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.EXPLAIN_FORECAST output1. ML.EXPLAIN_FORECAST output2.

    The results include the following columns:

    • start_station_name
    • time_series_timestamp
    • time_series_type
    • time_series_data
    • time_series_adjusted_data
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_lower_bound
    • trend
    • seasonal_period_yearly
    • seasonal_period_quarterly
    • seasonal_period_monthly
    • seasonal_period_weekly
    • seasonal_period_daily
    • holiday_effect
    • spikes_and_dips
    • step_changes
    • residual

    The output rows are ordered by start_station_name, and for each start_station_name, the output rows are in the chronological order of time_series_timestamp. Different components are listed as columns of the output. For more information, see the definition of ML.EXPLAIN_FORECAST.

(Optional) Step nine: Use your model to forecast multiple time series simultaneously

The ML.FORECAST function can also be used to forecast future time series values with a prediction interval using your model, bqml_tutorial.nyc_citibike_arima_model_group.

Like ML.EXPLAIN_FORECAST, the STRUCT(3 AS horizon, 0.9 AS confidence_level) clause indicates that for each time series, the query forecasts three future time points, and generates a prediction interval with 90% confidence.

The ML.FORECAST function takes the model, as well as a couple of optional arguments.

To run the ML.FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following GoogleSQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.FORECAST(MODEL `bqml_tutorial.nyc_citibike_arima_model_group`,
                 STRUCT(3 AS horizon, 0.9 AS confidence_level))
    
  3. Click Run.

    The query takes less than a second to complete. The results should look like the following:

    ML.FORECAST output.

    The results include the following columns:

    • start_station_name
    • forecast_timestamp
    • forecast_value
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_upper_bound
    • confidence_interval_lower_bound (soon to be deprecated)
    • confidence_interval_upper_bound (soon to be deprecated)

    The first column, start_station_name, annotates the time series that each time series model is fitted against. Each start_station_name has a horizon number of rows for its forecasting results.

    For each start_station_name, output rows are ordered in the chronological order of forecast_timestamp. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as the forecast_value. The forecast_value is the middle point of the prediction interval. The prediction interval depends on the standard_error and confidence_level.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset to delete the dataset, the table, and all of the data.

  4. In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next