This tutorial introduces data analysts to BigQuery ML.
BigQuery ML enables users to create and execute machine learning models
in BigQuery using SQL queries. This tutorial introduces Hyperparameter
Tuning by specifying
num_trials training option.
In this tutorial, you use the
tlc_yellow_trips_2018 sample table
to create a model that predicts the tip of a taxi trip. You will see a ~40%
performance (r2_score) improvement with hyperparameter tuning.
In this tutorial, you use:
- BigQuery ML to create a linear regression model using the
CREATE MODELstatement with the
num_trialsset to 20.
ML.TRIAL_INFOfunction to check the overview of all 20 trials
ML.EVALUATEfunction to evaluate the ML model
ML.PREDICTfunction to make predictions using the ML model
This tutorial uses billable components of Google Cloud, including:
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.
- BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.
Step one: Create your training dataset
The first step is to create a BigQuery dataset to store your training data and ML model. To create your dataset:
In the Google Cloud Console, go to the BigQuery page.
In the navigation panel, under the Resources section, click your project name.
On the right side, in the details panel, click Create dataset.
On the Create dataset page:
- For Dataset ID, enter
For Data location, choose United States (US). Currently, the public datasets are stored in the
USmultiregional location. For simplicity, place your dataset in the same location.
- For Dataset ID, enter
Leave all of the other default settings in place and click Create dataset.
Step two: Create your training input table
In this step, you will materialize the training input table with 100k rows.
View the schema of the source table
Create the training input data table.
CREATE TABLE `bqml_tutorial.taxi_tip_input` AS SELECT * EXCEPT(tip_amount), tip_amount AS label FROM `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018` WHERE tip_amount IS NOT NULL LIMIT 100000
Step three: Create your model
Next, create a linear regression model with hyperparameter tuning using the
tlc_yellow_trips_2018 sample table in BigQuery. The following
standard SQL query is used to create the model with hyperparameter tuning.
CREATE MODEL `bqml_tutorial.hp_taxi_tip_model` OPTIONS (model_type='linear_reg', num_trials=20, max_parallel_trials=2) AS SELECT * FROM `bqml_tutorial.taxi_tip_input`
LINEAR_REG model has two tunable hyperparameters:
The above query uses the default search space. You can also specify the search
OPTIONS (... l1_reg=hparam_range(0, 20), l2_reg=hparam_candidates([0, 0.1, 1, 10]))
In addition, these other hyperparameter tuning training options also use their default values:
max_parallel_trials is set to 2 to accelerate the tuning process. With 2
trials running at any time, the whole tuning should take roughly as long as 10
serial training jobs instead of 20. Note, however, that the two concurrent trials cannot
benefit from each other's training results.
Run the CREATE MODEL query
To run the
CREATE MODEL query to create and train your model:
In the Cloud Console, click the Compose new query button.
Enter the above standard SQL query in the Query editor text area.
The query takes about 17 minutes to complete. You can track the tuning progress in execution details under Stages:
Step four: Get trials information
To see the overview of all trials including their hyperparameters, objectives,
status, and the optimal trial, you can use the
function, and you can view the result in the Cloud Console after running
SELECT * FROM ML.TRIAL_INFO(MODEL `bqml_tutorial.hp_taxi_tip_model`)
You can run this SQL query as soon as one trial is done. If the tuning is stopped in the middle, all already-completed trials will remain available to use.
Step five: Evaluate your model
After creating your model, you can get the evaluation metrics of all trials
by either using the
function or through Google Cloud Console.
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.hp_taxi_tip_model`)
This SQL fetches evaluation metrics for all trials calculated from the TEST
data. Check Data Split
section to see the difference between
ML.TRIAL_INFO objectives and
ML.EVALUATE evaluation metrics.
You can also evaluate a specific trial by providing your own data. Please check ML.EVALUATE for more details.
Check evaluation metrics through Google Cloud Console
You can also check evaluation metrics by selecting the
Step six: Use your model to predict taxi tips
Now that you have evaluated your model, the next step is to use it to predict the taxi tip.
The query used to predict the tip is as follows:
SELECT * FROM ML.PREDICT(MODEL `bqml_tutorial.hp_taxi_tip_model`, ( SELECT * FROM `bqml_tutorial.taxi_tip_input` LIMIT 10))
SELECT statement retrieves all columns including the
predicted_label column. This column is generated by the
When you use the
ML.PREDICT function, the output column name for the model is
The prediction is made against the optimal trial by default. You can select
other trial by specifying
SELECT * FROM ML.PREDICT(MODEL `bqml_tutorial.hp_taxi_tip_model`, ( SELECT * FROM `bqml_tutorial.taxi_tip_input` LIMIT 10), STRUCT(3 AS trial_id))
Please check ML.PREDICT for more details on how to use model serving functions.
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Deleting your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Cloud Console.
In the navigation panel, click the bqml_tutorial dataset you created.
On the right side of the window, click Delete dataset. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial) and then click Delete.
Deleting your project
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.