Multivariate time-series forecasting from Seattle air quality data

Stay organized with collections Save and categorize content based on your preferences.

In this tutorial, you will learn how to create a multivariate time series model (ARIMA_PLUS_XREG) to perform time-series forecasting using the following sample tables from the epa_historical_air_quality dataset:

The epa_historical_air_quality dataset contains daily PM 2.5, temperature, and wind speed information collected from multiple US cities.

Objectives

In this tutorial, you use the following:

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see the BigQuery ML pricing page.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

The first step is to create a BigQuery dataset to store your ML model. To create your dataset:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Resources section, click your project name.

  3. In the details panel, click Create dataset.

    Create dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter bqml_tutorial.
    • For Data location, choose United States (US). Currently, the public datasets are stored in the US multi-region location. For simplicity, you should place your dataset in the same location.

      Create dataset page.

  5. Leave all other default settings in place and click Create dataset.

Step two: Create time series with extra features

The PM2.5, temperature, and wind speed data are in separate tables. To simplify the following queries, you can create a new table bqml_tutorial.seattle_air_quality_daily by joining those tables, with the following columns:

  • date: the date of the observation
  • PM2.5: the average PM2.5 value for each day
  • wind_speed: the average wind speed for each day
  • temperature: the highest temperature for each day

The new table has daily data from 2009-08-11 to 2022-01-31.

In the following standard SQL query, the FROM bigquery-public-data.epa_historical_air_quality.*_daily_summary clause indicates that you are querying the *_daily_summary tables in the epa_historical_air_quality dataset. These tables are partitioned tables.

#standardSQL
CREATE TABLE `bqml_tutorial.seattle_air_quality_daily`
AS
WITH
  pm25_daily AS (
    SELECT
      avg(arithmetic_mean) AS pm25, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.pm25_nonfrm_daily_summary`
    WHERE
      city_name = 'Seattle'
      AND parameter_name = 'Acceptable PM2.5 AQI & Speciation Mass'
    GROUP BY date_local
  ),
  wind_speed_daily AS (
    SELECT
      avg(arithmetic_mean) AS wind_speed, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.wind_daily_summary`
    WHERE
      city_name = 'Seattle' AND parameter_name = 'Wind Speed - Resultant'
    GROUP BY date_local
  ),
  temperature_daily AS (
    SELECT
      avg(first_max_value) AS temperature, date_local AS date
    FROM
      `bigquery-public-data.epa_historical_air_quality.temperature_daily_summary`
    WHERE
      city_name = 'Seattle' AND parameter_name = 'Outdoor Temperature'
    GROUP BY date_local
  )
SELECT
  pm25_daily.date AS date, pm25, wind_speed, temperature
FROM pm25_daily
JOIN wind_speed_daily USING (date)
JOIN temperature_daily USING (date)

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the above standard SQL query in the Query editor text area.

  3. Click Run.

(Optional) Step three: Visualize the time series you want to forecast

Before creating the model, it is useful to see what your input time series looks like. You can do this by using Looker Studio.

In the following standard SQL query, the FROM bqml_tutorial.seattle_air_quality_daily clause indicates that you are querying the seattle_air_quality_daily table in the bqml_tutorial dataset you just created.

#standardSQL
SELECT
  *
FROM
  `bqml_tutorial.seattle_air_quality_daily`

To run the query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     `bqml_tutorial.seattle_air_quality_daily`
    
  3. Click Run.

    After this query runs, the output is similar to the following screenshot. In the screenshot, you can see that this time series has 3960 data points. Click the Explore data button and then Explore with Looker Studio. Looker Studio opens in a new tab. Complete the following steps in the new tab.

    Query output

    In the Chart panel, choose Time series chart:

    Time_series_chart

    In the SETUP panel, below the Chart panel, go to the Metric section. Add the pm25, temperature, and wind_speed fields and then remove the default metric Record Count. You can also set a custom date range, e.g., Jan. 1, 2019 to Dec 31, 2021, to make the time series shorter. This is shown in the following figure.

    Time_series_data_fields

    After you complete these steps, the following plot appears. The plot shows that that the input time series has a weekly seasonal pattern.

    Result_visualization

Step four: Create your time series model

Next, create a time series model using the above air quality data. The following standard SQL query creates a model used to forecast pm25.

The CREATE MODEL clause creates and trains a model named bqml_tutorial.seattle_pm25_xreg_model.

#standardSQL
CREATE OR REPLACE
  MODEL
    `bqml_tutorial.seattle_pm25_xreg_model`
  OPTIONS (
    MODEL_TYPE = 'ARIMA_PLUS_XREG',
    time_series_timestamp_col = 'date',
    time_series_data_col = 'pm25')
AS
SELECT
  date,
  pm25,
  temperature,
  wind_speed
FROM
  `bqml_tutorial.seattle_air_quality_daily`
WHERE
  date
  BETWEEN DATE('2012-01-01')
  AND DATE('2020-12-31')

The OPTIONS(model_type='ARIMA_PLUS_XREG', time_series_timestamp_col='date', ...) clause indicates that you are creating an ARIMA with external regressors model. By default, auto_arima=TRUE, so the auto.ARIMA algorithm automatically tunes the hyper-parameters in ARIMA_PLUS_XREG models. The algorithm fits dozens of candidate models and chooses the best one with the lowest Akaike information criterion (AIC). Additionally, because the default is data_frequency='AUTO_FREQUENCY', the training process automatically infers the data frequency of the input time series.

Run the CREATE MODEL query to create and train your model:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the above standard SQL query in the Query editor text area.

  3. Click Run.

    The query takes about 20 seconds to complete, after which your model (seattle_pm25_xreg_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step five: Inspect the evaluation metrics of all evaluated models

After creating your model, you can use the ML.ARIMA_EVALUATE function to see the evaluation metrics of all the candidate models evaluated during the process of automatic hyperparameter tuning.

In the following standard SQL query, the FROM clause uses the ML.ARIMA_EVALUATE function against your model, bqml_tutorial.seattle_pm25_xreg_model. By default this query returns the evaluation metrics of all the candidate models.

To run the ML.ARIMA_EVALUATE query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_EVALUATE(MODEL bqml_tutorial.seattle_pm25_xreg_model)
    
  3. Click Run.

  4. When the query is complete, click the Results tab below the query text area. The results should look similar to the following screenshot:

    ML.ARIMA_EVALUATE output.

    The results include the following columns:

    • non_seasonal_p
    • non_seasonal_d
    • non_seasonal_q
    • has_drift
    • log_likelihood
    • AIC
    • variance
    • seasonal_periods
    • has_holiday_effect
    • has_spikes_and_dips
    • has_step_changes
    • error_message

    The following four columns (non_seasonal_{p,d,q} and has_drift) define an ARIMA model in the training pipeline. The three metrics after that (log_likelihood, AIC, and variance) are relevant to the ARIMA model fitting process.

    The auto.ARIMA algorithm first uses the KPSS test to decide that the best value for non_seasonal_d is 1. When non_seasonal_d is 1, auto.ARIMA then trains 42 different candidate ARIMA models in parallel. Note that when non_seasonal_d is not 1, auto.ARIMA trains 21 different candidate models. In this example, all 42 candidate models are valid. Therefore, the output contains 42 rows, where each row is associated with a candidate ARIMA model. Note that for some time series, some candidate models are invalid as they are either non-invertible or non-stationary. These invalid models are excluded from the output, which will make the output have less than 42 rows. These candidate models are ordered by AIC in the ascending order. The model in the first row has the lowest AIC, and is considered as the best model. This best model is saved as the final model and is used when you call ML.FORECAST, ML.EVALUATE, and ML.ARIMA_COEFFICIENTS as shown in the following steps.

    The seasonal_periods column is about the seasonal pattern inside the input time series. It has nothing to do with the ARIMA modeling, therefore it has the same value across all output rows. It reports a weekly patten, which is within our expectation as described in step two above.

    The has_holiday_effect, has_spikes_and_dips, and has_step_changes columns are only populated when decompose_time_series=TRUE. They are about the holiday effect, spikes and dips, and step changes inside the input time series, which are not related to the ARIMA modeling. Therefore they are all the same across all output rows, except for those failed models.

    The error_message column shows that the possible error incurred during the auto.ARIMA fitting process. A possible reason might be that the selected non_seasonal_p, non_seasonal_d, non_seasonal_q, and has_drift columns are not able to stabilize the time series. To retrieve the possible error message of all the candidate models, set show_all_candidate_models=true.

Step six: Inspect the coefficients of your model

The ML.ARIMA_COEFFICIENTS function retrieves the model coefficients of your ARIMA_PLUS model, bqml_tutorial.seattle_pm25_xreg_model. ML.ARIMA_COEFFICIENTS takes the model as the only input.

Run the ML.ARIMA_COEFFICIENTS query:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.ARIMA_COEFFICIENTS(MODEL bqml_tutorial.seattle_pm25_xreg_model)
    
  3. Click Run.

    The results should look like the following:

    ML.ARIMA_COEFFICIENTS output.

    The results include the following columns:

    • ar_coefficients
    • ma_coefficients
    • intercept_or_drift
    • processed_input
    • weight
    • category_weights.category
    • category_weights.weight

    ar_coefficients shows the model coefficients of the autoregressive (AR) part of the ARIMA model. Similarly, ma_coefficients shows the model coefficients of moving-average (MA) part. They are both arrays, whose lengths are equal to non_seasonal_p and non_seasonal_q, respectively. From the output of ML.ARIMA_EVALUATE, the best model as in the top row has a non_seasonal_p of 0 and a non_seasonal_q of 5. Therefore, ar_coefficients is an empty array and ma_coefficients is a length-5 array. The intercept_or_drift is the constant term in the ARIMA model.

    processed_input and the corresponding weight and category_weights column show the weights for each feature and the intercept in the linear regression model. If the feature is a numerical feature, the weight is in the weight column. If the feature is a categorical feature, the category_weights is an ARRAY of STRUCT where the STRUCT contains the names and weights of the categories.

Step seven: Use your model to forecast the time series

The ML.FORECAST function forecasts future time series values with a prediction interval using your model bqml_tutorial.seattle_pm25_xreg_model and future feature values.

In the following standard SQL query, the STRUCT(30 AS horizon, 0.8 AS confidence_level) clause indicates that the query forecasts 30 future time points, and generates a prediction interval with a 80% confidence level. ML.FORECAST takes the model, future feature values, as well as a couple of optional arguments.

To run the ML.FORECAST query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
    *
    FROM
    ML.FORECAST(
    MODEL bqml_tutorial.seattle_pm25_xreg_model,
    STRUCT(30 AS horizon, 0.8 AS confidence_level),
    (
      SELECT
        date,
        temperature,
        wind_speed
      FROM
        `bqml_tutorial.seattle_air_quality_daily`
      WHERE
        date > DATE('2020-12-31')
    ))
    
  3. Click Run.

    The results should look like the following:

    ML.FORECAST output.

    The results include the following columns:

    • forecast_timestamp
    • forecast_value
    • standard_error
    • confidence_level
    • prediction_interval_lower_bound
    • prediction_interval_upper_bound

    The output rows are ordered in the chronological order of forecast_timestamp. In time series forecasting, the prediction interval, which is captured by the lower and upper bounds, is as important as the forecast_value. The forecast_value is the middle point of the prediction interval. The prediction interval depends on the standard_error and confidence_level.

Step eight: Evaluate forecasting accuracy with actual data

To evaluate the forecasting accuracy with the actual data, you can use ML.EVALUATE function with your model, bqml_tutorial.seattle_pm25_xreg_model, and the actual data table.

To run the ML.EVALUATE query, use the following steps:

  1. In the Google Cloud console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
    *
    FROM
    ML.EVALUATE(
    MODEL `bqml_tutorial.seattle_pm25_xreg_model`,
    (
      SELECT
        date,
        pm25,
        temperature,
        wind_speed
      FROM
        `bqml_tutorial.seattle_air_quality_daily`
      WHERE
        date > DATE('2020-12-31')
    ),
    STRUCT(
      TRUE AS perform_aggregation,
      30 AS horizon))
    

    The second parameter is the actual data with the future features, which are used to forecast the future values to compare with the actual data. The third parameter is a struct of parameters to this function.

  3. Click Run.

    The results should look like the following:

    ML.EVALUATE output.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Deleting your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next