This tutorial uses BigQuery ML to predict three point field goal attempts in basketball. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.
In this tutorial, you'll use the sample NCAA basketball data public dataset for BigQuery to create a model that predicts three point field goal attempts.
Objectives
In this tutorial, you use:
- BigQuery ML to create a linear regression model using the
CREATE MODEL
statement - The
ML.EVALUATE
function to evaluate the ML model - The
ML.PREDICT
function to make predictions using the ML model
Costs
This tutorial uses billable components of Cloud Platform, including:
- BigQuery
- BigQuery ML
- Datalab
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see the BigQuery ML pricing page.
For more information on Datalab costs, see the Datalab pricing page.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Set up Datalab
In this tutorial, you set up Datalab using Cloud Shell. Datalab provides a Jupyter-based notebook environment you use to work with BigQuery ML and BigQuery.
Before you set up Datalab, you must enable the Compute Engine API and the Cloud Source Repositories API. For information on enabling these APIs, see Enabling and disabling services.
To set up Datalab on a Compute Engine VM:
Go to the Cloud console.
Click Activate Cloud Shell.
In the Cloud Shell window, enter the following command to configure the gcloud CLI to use us-central1-a as your default zone for the Google Cloud CLI. This zone will contain your Datalab instance.
gcloud config set compute/zone us-central1-a
Enter the following command to create a Compute Engine VM (named
mltutorial
) used to run the Datalab container. This command also creates an SSH connection to your VM and maps the remote Datalab web interface to localhost port 8081. If prompted, enter the number corresponding to your default Compute Engine zone.datalab create mltutorial
Once you see the message
The connection to Datalab is now open and will remain until this command is killed
, you are connected to the instance. The command line interface may no longer produce output once the SSH connection is established.In the Cloud Shell window, click the
Web preview icon.
Choose Change port.
In the Change Preview Port dialog, enter
8081
and then click Change and Preview. This opens Datalab in a new tab.gcloud compute ssh --quiet --zone "us-central1-a" --ssh-flag="-N" --ssh-flag="-L" --ssh-flag="localhost:8081:localhost:8080" "${USER}@mltutorial"
Click Notebook to create a new notebook. The notebook opens in a new tab.
In the title bar, click Untitled Notebook to change the notebook's name.
In the Rename Notebook dialog, enter
BigQuery ML tutorial
and then click Ok.In the first code cell, enter the following to update to the latest version of the BigQuery Python client library.
!pip install --upgrade google-cloud-bigquery
Click Run > Run from this cell. The query results appear below the code block. The installation is complete when you see the following message:
Successfully installed ...
The code produces quite a bit of output. To hide the output, click the cell menu and choose Collapse to collapse the cell.
Click Add Code to create a new code cell.
Enter the following code to import the BigQuery Python client library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.
Click Run > Run from this cell. This command produces no output.
Step two: Create your dataset
Next, you create a BigQuery dataset to store your ML model. To create your dataset:
Click Add Code to create a new code cell.
Enter the following command to create your dataset.
Click Run > Run from this cell. The command output should look like the following:
Dataset '[project_ID]:bqml_tutorial' successfully created.
Step three: Generate your training and feature data
Next, you generate your training and feature data. First you generate the input
features of the model by querying the
bigquery-public-data.ncaa_basketball.mbb_teams_games_sr
table in the NCAA
Basketball Data public dataset. After generating the input features, you
generate your training data.
Generate the input features (feature engineering)
The input features include the mean and standard deviation of previous
game statistics for both home teams and away teams using different time windows.
The time windows used are 10, 5 and 1 games before the current game. The
team-id
column (for both home teams and away teams), and the season
column
are also used as one-hot features, which contributes the average statistics to
the model.
Because of the length of the query that is used to generate your input features, you use a saved query in the classic BigQuery web UI to create the table.
To generate your input feature data:
Click the button below to open the input data query.
As time permits, examine the query.
Below the query window, click Show Options.
In the Destination Table section, click Select Table.
In the Select Destination Table dialog:
- For Project, choose your project.
- For Dataset, choose
bqml_tutorial
. - In the Table ID field, enter
cume_games
. - Click OK.
Click Run query. This creates a query job that writes the query results to the table you specified.
Alternately, if you forget to specify a destination table before running your query, you can copy the temporary table to a permanent table by clicking the Save as Table button in the results window.
The query will take a few seconds to complete. After the query runs, your table (
cume_games
) appears in the navigation panel of the BigQuery web UI.
Generate your training data
After generating the input features, you generate your training data. The
training data is written to the bqml_tutorial.wide_games
table. When you
generate your training data, you exclude a game to use for predicting outcomes.
In this query, you self-join the cume_games
table you created by running the
previous query. You do this to generate both the home team and away team
statistics before the current game.
Because of the length of the query that is used to generate your training data, you use a saved query in the classic BigQuery web UI to create the table.
To generate your training data:
Click the button below to open the input data query.
As time permits, examine the query.
Below the query window, click Show Options.
In the Destination Table section, click Select Table.
In the Select Destination Table dialog:
- For Project, choose your project.
- For Dataset, choose
bqml_tutorial
. - In the Table ID field, enter
wide_games
. - Click OK.
Click Run query. This creates a query job that writes the query results to the table you specified.
Alternately, if you forget to specify a destination table before running your query, you can copy the temporary table to a permanent table by clicking the Save as Table button in the results window.
The query will take a few seconds to complete. After the query runs, your table (
wide_games
) appears in the navigation panel of the BigQuery web UI.
Step four: Create your model
Next, you create a linear regression model. The model is used to predict the
combined three point field goal attempts based on the previous game statistics.
The standard SQL query uses a CREATE MODEL
statement to create the model.
Though the CREATE MODEL
statement can create and train your model, in this
tutorial, you train the model separately.
To run the CREATE MODEL
query to create your model:
Switch to your Datalab notebook.
Click Add Code to create a new code cell.
The BigQuery Python client library provides a magic command allows you to run queries with minimal code. To load the magic commands from the client library, enter the following code.
%load_ext google.cloud.bigquery
Click Run > Run from this cell. This command produces no output.
Click Add Code to create a new code cell.
The BigQuery client library provides a cell magic,
%%bigquery
, which runs a SQL query and returns the results as a Pandas DataFrame. Enter the following standard SQL query in the cell. The#standardSQL
prefix is not required for the client library. Standard SQL is the default query syntax.
Jupyter
%%bigquery CREATE OR REPLACE MODELbqml_tutorial.ncaa_model
OPTIONS( model_type='linear_reg', max_iteration=50) AS SELECT * EXCEPT( game_id, season, scheduled_date, total_three_points_made, total_three_points_att), total_three_points_att as label FROMbqml_tutorial.wide_games
WHERE # remove the game to predict game_id != 'f1063e80-23c7-486b-9a5e-faa52beb2d83'
Python
Click Run > Run from this cell.
The query takes several minutes to complete. After the first iteration is complete, your model (
ncaa_model
) appears in the navigation panel of the BigQuery web UI. Because the query uses aCREATE MODEL
statement, you do not see query results. The output is an empty string.
Step five: Get training statistics
To see the results of the model training, you can use the
ML.TRAINING_INFO
function, or you can view the statistics in the BigQuery web UI. In this tutorial, you
use the ML.TRAINING_INFO
function.
A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.
Loss is the penalty for a bad prediction — a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
To see the model training statistics that were generated when you ran the
CREATE MODEL
query:
Click Add Code to create a new code cell.
Enter the following standard SQL query in the cell.
Jupyter
%%bigquery SELECT * FROM ML.TRAINING_INFO(MODEL `bqml_tutorial.ncaa_model`)
Python
Click Run > Run from this cell.
When the query is complete, the results appear below the query. The results should look like the following:
The Training Data Loss column represents the loss metric calculated after the given iteration on the training dataset. Since you performed a linear regression, this column is the mean squared error. The Evaluation Data Loss column is the same loss metric calculated on the holdout dataset (data that is held back from training to validate the model).
For more details on the
ML.TRAINING_INFO
function, see the BigQuery ML Syntax Reference.
Step six: Evaluate your model
After creating your model, you evaluate the performance of the model using
the ML.EVALUATE
function.
To run the ML.EVALUATE
query that evaluates the model:
Click Add Code to create a new code cell.
Enter the following standard SQL query in the cell.
Jupyter
%%bigquery WITH eval_table AS ( SELECT *, total_three_points_att AS label FROM `bqml_tutorial.wide_games` ) SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.ncaa_model`, TABLE eval_table)
Python
Click Run > Run from this cell.
When the query is complete, the results appear below the query. The results should look like the following:
Because you performed a linear regression, the results include the following columns:
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score
explained_variance
An important metric in the evaluation results is the mean absolute error. Mean absolute error is the average difference between the predicted label and the actual label. In this case it is about 7.
Step seven: Use your model to predict outcomes
Now that you have evaluated your model, the next step is to use the ML.PREDICT
function to predict the total three point field goal attempts in the 2018 NCAA
final game: Michigan versus Villanova.
To run the query that uses the ML.PREDICT
function:
Click Add Code to create a new code cell.
Enter the following standard SQL query in the cell.
Jupyter
%%bigquery WITH game_to_predict AS ( SELECT * FROM `bqml_tutorial.wide_games` WHERE game_id='f1063e80-23c7-486b-9a5e-faa52beb2d83' ) SELECT truth.game_id AS game_id, total_three_points_att, predicted_total_three_points_att FROM ( SELECT game_id, predicted_label AS predicted_total_three_points_att FROM ML.PREDICT(MODEL `bqml_tutorial.ncaa_model`, table game_to_predict) ) AS predict JOIN ( SELECT game_id, total_three_points_att AS total_three_points_att FROM game_to_predict) AS truth ON predict.game_id = truth.game_id
Python
Click Run > Run from this cell.
When the query is complete, the results appear below the query. The results should look like the following. Because model training is not deterministic, your results may differ.
The total_three_points_att
value is the actual number of field goals that
occurred in the final game — 50. The model's prediction is 43.41.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the Datalab VM.
Deleting your Datalab VM
Deleting your project removes the Datalab VM. If you do not want to delete the Cloud Platform project, you can delete the Datalab VM.
To delete the Datalab VM:
Open the Compute Engine VM Instances page.
Check the mltutorial instance and then click Delete.
When prompted, click Delete.
Deleting your project
To delete the project:
- In the Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about Datalab, see the Datalab documentation.