Using the BigQuery ML Hyperparameter Tuning to improve model performance

This tutorial introduces data analysts to BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. This tutorial introduces Hyperparameter Tuning by specifying num_trials training option.

In this tutorial, you use the tlc_yellow_trips_2018 sample table to create a model that predicts the tip of a taxi trip. You will see a ~40% performance (r2_score) improvement with hyperparameter tuning.

Objectives

In this tutorial, you use:

  • BigQuery ML to create a linear regression model using the CREATE MODEL statement with the num_trials set to 20.
  • The ML.TRIAL_INFO function to check the overview of all 20 trials
  • The ML.EVALUATE function to evaluate the ML model
  • The ML.PREDICT function to make predictions using the ML model

Costs

This tutorial uses billable components of Google Cloud, including:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Step one: Create your training dataset

The first step is to create a BigQuery dataset to store your training data and ML model. To create your dataset:

  1. In the Google Cloud Console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the navigation panel, under the Resources section, click your project name.

  3. On the right side, in the details panel, click Create dataset.

    Create dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter bqml_tutorial.
    • For Data location, choose United States (US). Currently, the public datasets are stored in the US multiregional location. For simplicity, place your dataset in the same location.

      Create dataset page.

  5. Leave all of the other default settings in place and click Create dataset.

Step two: Create your training input table

In this step, you will materialize the training input table with 100k rows.

  1. View the schema of the source table tlc_yellow_trips_2018.

    Table schema.

  2. Create the training input data table.

CREATE TABLE `bqml_tutorial.taxi_tip_input` AS
SELECT
  * EXCEPT(tip_amount), tip_amount AS label
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
WHERE
  tip_amount IS NOT NULL
LIMIT 100000

Step three: Create your model

Next, create a linear regression model with hyperparameter tuning using the tlc_yellow_trips_2018 sample table in BigQuery. The following standard SQL query is used to create the model with hyperparameter tuning.

CREATE MODEL `bqml_tutorial.hp_taxi_tip_model`
OPTIONS
  (model_type='linear_reg',
   num_trials=20,
   max_parallel_trials=2) AS
SELECT
  *
FROM
  `bqml_tutorial.taxi_tip_input`

Query details

The LINEAR_REG model has two tunable hyperparameters: l1_reg and l2_reg. The above query uses the default search space. You can also specify the search space explicitly:

OPTIONS
  (...
    l1_reg=hparam_range(0, 20),
    l2_reg=hparam_candidates([0, 0.1, 1, 10]))

In addition, these other hyperparameter tuning training options also use their default values:

  • hparam_tuning_algorithm: "VIZIER_DEFAULT"

  • hparam_tuning_objectives: ["r2_score"]

max_parallel_trials is set to 2 to accelerate the tuning process. With 2 trials running at any time, the whole tuning should take roughly as long as 10 serial training jobs instead of 20. Note, however, that the two concurrent trials cannot benefit from each other's training results.

Run the CREATE MODEL query

To run the CREATE MODEL query to create and train your model:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the above standard SQL query in the Query editor text area.

  3. Click Run.

    The query takes about 17 minutes to complete. You can track the tuning progress in execution details under Stages:

    Tuning progress.

Step four: Get trials information

To see the overview of all trials including their hyperparameters, objectives, status, and the optimal trial, you can use the ML.TRIAL_INFO function, and you can view the result in the Cloud Console after running the SQL.

SELECT *
FROM
  ML.TRIAL_INFO(MODEL `bqml_tutorial.hp_taxi_tip_model`)

You can run this SQL query as soon as one trial is done. If the tuning is stopped in the middle, all already-completed trials will remain available to use.

Step five: Evaluate your model

After creating your model, you can get the evaluation metrics of all trials by either using the ML.EVALUATE function or through Google Cloud Console.

Run ML.EVALUATE

SELECT *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.hp_taxi_tip_model`)

This SQL fetches evaluation metrics for all trials calculated from the TEST data. Check Data Split section to see the difference between ML.TRIAL_INFO objectives and ML.EVALUATE evaluation metrics.

You can also evaluate a specific trial by providing your own data. Please check ML.EVALUATE for more details.

Check evaluation metrics through Google Cloud Console

You can also check evaluation metrics by selecting the EVALUATION tab.

Tuning evaluation.

Step six: Use your model to predict taxi tips

Now that you have evaluated your model, the next step is to use it to predict the taxi tip.

The query used to predict the tip is as follows:

SELECT
  *
FROM
  ML.PREDICT(MODEL `bqml_tutorial.hp_taxi_tip_model`,
    (
    SELECT
      *
    FROM
      `bqml_tutorial.taxi_tip_input`
    LIMIT 10))

Query details

The top-most SELECT statement retrieves all columns including the predicted_label column. This column is generated by the ML.PREDICT function. When you use the ML.PREDICT function, the output column name for the model is predicted_label_column_name.

The prediction is made against the optimal trial by default. You can select other trial by specifying trial_id parameter.

SELECT
  *
FROM
  ML.PREDICT(MODEL `bqml_tutorial.hp_taxi_tip_model`,
    (
    SELECT
      *
    FROM
      `bqml_tutorial.taxi_tip_input`
    LIMIT
      10),
    STRUCT(3 AS trial_id))

Please check ML.PREDICT for more details on how to use model serving functions.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. In the navigation panel, click the bqml_tutorial dataset you created.

  3. On the right side of the window, click Delete dataset. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Deleting your project

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next