This tutorial teaches you how to use hyperparameter tuning in BigQuery ML to tune a machine learning model and improve its performance.
You perform hyperparameter tuning by specifying the
NUM_TRIALS
option
of the CREATE MODEL
statement, in combination with other model-specific
options. When you set these options, BigQuery ML trains
multiple versions, or trials of the model, each with slightly different
parameters, and returns the trial that performs the best.
This tutorial uses the public
tlc_yellow_trips_2018
sample table, which contains information about taxi trip in New York City
in 2018.
Objectives
This tutorial guides you through completing the following tasks:
- Using the
CREATE MODEL
statement to create a baseline linear regression model. - Evaluating the baseline model by using the
ML.EVALUATE
function. - Using the
CREATE MODEL
statement with hyperparameter tuning options to train twenty trials of a linear regression model. - Reviewing the trials by using the
ML.TRIAL_INFO
function. - Evaluating the trials by using the
ML.EVALUATE
function. - Get predictions about taxi trips from the optimal model among the trials by
using the
ML.PREDICT
function.
Costs
This tutorial uses billable components of Google Cloud, including:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Required permissions
- To create the dataset, you need the
bigquery.datasets.create
IAM permission. To create the connection resource, you need the following permissions:
bigquery.connections.create
bigquery.connections.get
To create the model, you need the following permissions:
bigquery.jobs.create
bigquery.models.create
bigquery.models.getData
bigquery.models.updateData
bigquery.connections.delegate
To run inference, you need the following permissions:
bigquery.models.getData
bigquery.jobs.create
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Create a dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Create a table of training data
Create a table of training data, based on a subset of the
tlc_yellow_trips_2018
table data.
Follow these steps to create the table:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE TABLE `bqml_tutorial.taxi_tip_input` AS SELECT * EXCEPT (tip_amount), tip_amount AS label FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018` WHERE tip_amount IS NOT NULL LIMIT 100000;
Create a baseline linear regression model
Create a linear regression model without hyperparameter tuning and train it on
the taxi_tip_input
table data.
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.baseline_taxi_tip_model` OPTIONS ( MODEL_TYPE = 'LINEAR_REG' ) AS SELECT * FROM `bqml_tutorial.taxi_tip_input`;
The query takes about 2 minutes to complete.
Evaluate the baseline model
Evaluate the performance of the model by using the ML.EVALUATE
function.
The ML.EVALUATE
function evaluates the predicted content ratings returned by
the model against the evaluation metrics calculated during model training.
Follow these steps to evaluate the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.baseline_taxi_tip_model`);
The results look similar to the following:
+---------------------+--------------------+------------------------+-----------------------+---------------------+---------------------+ | mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | +---------------------+--------------------+------------------------+-----------------------+---------------------+---------------------+ | 2.5853895559690323 | 23760.416358496139 | 0.017392406523370374 | 0.0044248227819481123 | -1934.5450533482465 | -1934.3513857946277 | +---------------------+--------------------+------------------------+-----------------------+---------------------+---------------------+
The r2_score
value for the baseline model is negative, which indicates a
poor fit for the data; the closer the
R2 score
is to 1, the better the model fit is.
Create a linear regression model with hyperparameter tuning
Create a linear regression model with hyperparameter tuning and train it on
the taxi_tip_input
table data.
You use the following hyperparameter tuning options in the CREATE MODEL
statement:
- The
NUM_TRIALS
option to set the number of trials to twenty. - The
MAX_PARALLEL_TRIALS
option to run two trials in each training job, for a total of ten jobs and twenty trials. This reduces the training time needed. However, the two concurrent trials don't benefit from each other's training results. - The
L1_REG
option to try different L1 regularization values in the different trials. L1 regularization removes irrelevant features from the model, which helps prevent overfitting.
The other hyperparameter tuning options supported by the model use their default values, as follows:
L1_REG
:0
HPARAM_TUNING_ALGORITHM
:'VIZIER_DEFAULT'
HPARAM_TUNING_OBJECTIVES
:['R2_SCORE']
Follow these steps to create the model:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
CREATE OR REPLACE MODEL `bqml_tutorial.hp_taxi_tip_model` OPTIONS ( MODEL_TYPE = 'LINEAR_REG', NUM_TRIALS = 20, MAX_PARALLEL_TRIALS = 2, L1_REG = HPARAM_RANGE(0, 5)) AS SELECT * FROM `bqml_tutorial.taxi_tip_input`;
The query takes approximately 20 minutes to complete.
Get information about the training trials
Get information about all of the trials, including their hyperparameter values,
objectives, and status, by using the ML.TRIAL_INFO
function. This function
also returns information about which trial has the best performance, based on
this information.
Follow these steps to get trial information:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.TRIAL_INFO(MODEL `bqml_tutorial.hp_taxi_tip_model`) ORDER BY is_optimal DESC;
The results look similar to the following:
+----------+-------------------------------------+-----------------------------------+--------------------+--------------------+-----------+---------------+------------+ | trial_id | hyperparameters | hparam_tuning_evaluation_metrics | training_loss | eval_loss | status | error_message | is_optimal | +----------+-------------------------------------+-----------------------------------+--------------------+--------------------+-----------+---------------+------------+ | 7 | {"l1_reg":"4.999999999999985"} | {"r2_score":"0.653653627638174"} | 4.4677841296238165 | 4.478469742512195 | SUCCEEDED | NULL | true | | 2 | {"l1_reg":"2.402163664510254E-11"} | {"r2_score":"0.6532493667964732"} | 4.457692508421795 | 4.483697081650438 | SUCCEEDED | NULL | false | | 3 | {"l1_reg":"1.2929452948742316E-7"} | {"r2_score":"0.653249366811995"} | 4.45769250849513 | 4.483697081449748 | SUCCEEDED | NULL | false | | 4 | {"l1_reg":"2.5787102060628228E-5"} | {"r2_score":"0.6532493698925899"} | 4.457692523040582 | 4.483697041615808 | SUCCEEDED | NULL | false | | ... | ... | ... | ... | ... | ... | ... | ... | +----------+-------------------------------------+-----------------------------------+--------------------+--------------------+-----------+---------------+------------+
The
is_optimal
column value indicates that trial 7 is the optimal model returned by the tuning.
Evaluate the tuned model trials
Evaluate the performance of the trials by using the ML.EVALUATE
function.
The ML.EVALUATE
function evaluates the predicted content ratings returned by
the model against the evaluation metrics calculated during training for all
trials.
Follow these steps to evaluate the model trials:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.hp_taxi_tip_model`) ORDER BY r2_score DESC;
The results look similar to the following:
+----------+---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+ | trial_id | mean_absolute_error | mean_squared_error | mean_squared_log_error | median_absolute_error | r2_score | explained_variance | +----------+---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+ | 7 | 1.151814398002232 | 4.109811493266523 | 0.4918733252641176 | 0.5736103414025084 | 0.6652110305659145 | 0.6652144696114834 | | 19 | 1.1518143358927102 | 4.109811921460791 | 0.4918672150119582 | 0.5736106106914161 | 0.6652109956848206 | 0.6652144346901685 | | 8 | 1.152747850702547 | 4.123625876152422 | 0.4897808307399327 | 0.5731702310239184 | 0.6640856984144734 | 0.664088410199906 | | 5 | 1.152895108945439 | 4.125775524878872 | 0.48939088205957937 | 0.5723300569616766 | 0.6639105860807425 | 0.6639132416838652 | | ... | ... | ... | ... | ... | ... | ... | +----------+---------------------+--------------------+------------------------+-----------------------+--------------------+--------------------+
The
r2_score
value for the optimal model, which is trial 7, is0.66521103056591446
, which shows significant improvement over the baseline model.
You can evaluate a specific trial by specifying the TRIAL_ID
argument in the ML.EVALUATE
function.
For more information about the difference between ML.TRIAL_INFO
objectives and ML.EVALUATE
evaluation metrics, see
Model serving functions.
Use the tuned model to predict taxi tips
Use the optimal model returned by tuning to predict tips for different taxi
trips. The optimal model is automatically used by the ML.PREDICT
function,
unless you select a different trial by specifying the TRIAL_ID
argument. The
predictions are returned in the predicted_label
column.
Follow these steps to get predictions:
In the Google Cloud console, go to the BigQuery page.
In the query editor, paste in the following query and click Run:
SELECT * FROM ML.PREDICT( MODEL `bqml_tutorial.hp_taxi_tip_model`, ( SELECT * FROM `bqml_tutorial.taxi_tip_input` LIMIT 5 ));
The results look similar to the following:
+----------+--------------------+-----------+---------------------+---------------------+-----------------+---------------+-----------+--------------------+--------------+-------------+-------+---------+--------------+---------------+--------------+--------------------+---------------------+----------------+-----------------+-------+ | trial_id | predicted_label | vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | rate_code | store_and_fwd_flag | payment_type | fare_amount | extra | mta_tax | tolls_amount | imp_surcharge | total_amount | pickup_location_id | dropoff_location_id | data_file_year | data_file_month | label | +----------+--------------------+-----------+---------------------+---------------------+-----------------+---------------+-----------+--------------------+--------------+-------------+-------+---------+--------------+---------------+--------------+--------------------+---------------------+----------------+-----------------+-------+ | 7 | 1.343367839584448 | 2 | 2018-01-15 18:55:15 | 2018-01-15 18:56:18 | 1 | 0 | 1 | N | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 193 | 193 | 2018 | 1 | 0 | | 7 | -1.176072791783461 | 1 | 2018-01-08 10:26:24 | 2018-01-08 10:26:37 | 1 | 0 | 5 | N | 3 | 0.01 | 0 | 0 | 0 | 0.3 | 0.31 | 158 | 158 | 2018 | 1 | 0 | | 7 | 3.839580104168765 | 1 | 2018-01-22 10:58:02 | 2018-01-22 12:01:11 | 1 | 16.1 | 1 | N | 1 | 54.5 | 0 | 0.5 | 0 | 0.3 | 55.3 | 140 | 91 | 2018 | 1 | 0 | | 7 | 4.677393985230036 | 1 | 2018-01-16 10:14:35 | 2018-01-16 11:07:28 | 1 | 18 | 1 | N | 2 | 54.5 | 0 | 0.5 | 0 | 0.3 | 55.3 | 138 | 67 | 2018 | 1 | 0 | | 7 | 7.938988937253062 | 2 | 2018-01-16 07:05:15 | 2018-01-16 08:06:31 | 1 | 17.8 | 1 | N | 1 | 54.5 | 0 | 0.5 | 0 | 0.3 | 66.36 | 132 | 255 | 2018 | 1 | 11.06 | +----------+--------------------+-----------+---------------------+---------------------+-----------------+---------------+-----------+--------------------+--------------+-------------+-------+---------+--------------+---------------+--------------+--------------------+---------------------+----------------+-----------------+-------+
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation panel, click the bqml_tutorial dataset you created.
On the right side of the window, click Delete dataset. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.