Using BigQuery ML to predict basketball outcomes

This tutorial uses BigQuery ML to predict three point field goal attempts in basketball. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

In this tutorial, you'll use the sample NCAA Basketball Data public dataset for BigQuery to create a model that predicts three point field goal attempts.

Objectives

In this tutorial, you use:

  • BigQuery ML to create a linear regression model using the CREATE MODEL statement
  • The ML.EVALUATE function to evaluate the ML model
  • The ML.PREDICT function to make predictions using the ML model

Costs

This tutorial uses billable components of Cloud Platform, including:

  • Google BigQuery
  • BigQuery ML
  • Cloud Datalab

You incur charges for:

  • Storing your ML model and training data in BigQuery
    • The first 10 GB of storage is free each month.
  • Querying data in BigQuery
    • The first 1 TB is free each month.
    • If you are using flat-rate pricing, query costs are included in the monthly flat-rate price.
  • Running BigQuery ML SQL statements
  • Running Cloud Datalab on a Compute Engine VM
    • For more information, see the Cloud Datalab Pricing page.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Asegúrate de tener habilitada la facturación para tu proyecto.

    Aprende a habilitar la facturación

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Audience

This tutorial is an advanced tutorial that is intended for Data Scientists.

A Data Scientist uses BigQuery standard SQL and BigQuery ML to:

  • Explore, analyze, and visualize data
  • Clean data sets for use in statistical models
  • Create statistical and ML models
  • Create, train, evaluate, and improve ML algorithms and models
  • Explore data in a data warehouse such as BigQuery

A Data Scientist uses various tools including:

  • The BigQuery command-line tool (to query data using standard SQL)
  • Statistical software such as RStudio and Matlab
  • Cloud Datalab (Jupyter notebooks)
  • Cloud Machine Learning Engine
  • Programming languages such as Python and Java
  • ML libraries such as MLlib and TensorFlow
  • Visualization tools such as Cloud Datalab and Data Studio

Because this tutorial is intended for Data Scientists, many details related to ML concepts are not explained. To see a more detailed introductory tutorial that uses the BigQuery web UI, go to Getting Started with BigQuery ML for Data Analysts.

To see descriptions of the concepts presented here, see the:

Step one: Set up Cloud Datalab

In this tutorial, you set up Cloud Datalab using Google Cloud Shell. Cloud Datalab provides a Jupyter-based notebook environment you use to work with BigQuery ML and BigQuery.

Before you set up Cloud Datalab, you must enable the Compute Engine API and the Cloud Source Repositories API. For information on enabling these APIs, see Enabling and Disabling Services.

To set up Cloud Datalab on a Compute Engine VM:

  1. Go to the Cloud Platform Console.

    Cloud Platform Console

  2. Click Activate Cloud Shell.

    Activate Cloud Shell icon

  3. In the Cloud Shell window, enter the following command to configure the gcloud tool to use us-central1-a as your default zone for the Cloud SDK. This zone will contain your Cloud Datalab instance.

    gcloud config set compute/zone us-central1-a
    
  4. Enter the following command to create a Compute Engine VM (named mltutorial) used to run the Cloud Datalab container. This command also creates an SSH connection to your VM and maps the remote Cloud Datalab web interface to localhost port 8081. If prompted, enter the number corresponding to your default Compute Engine zone.

    datalab create mltutorial
    

    Once you see the message The connection to Datalab is now open and will remain until this command is killed, you are connected to the instance. The command line interface may no longer produce output once the SSH connection is established.

  5. In the Cloud Shell window, click the web preview icon Web preview icon.

  6. Choose Change port.

  7. In the Change Preview Port dialog, enter 8081 and then click Change and Preview. This opens Cloud Datalab in a new tab.

    gcloud compute ssh --quiet --zone "us-central1-a" --ssh-flag="-N" --ssh-flag="-L" --ssh-flag="localhost:8081:localhost:8080" "${USER}@mltutorial"
    
  8. Click Notebook to create a new notebook. The notebook opens in a new tab.

  9. In the title bar, click Untitled Notebook to change the notebook's name.

  10. In the Rename Notebook dialog, enter BigQuery ML Tutorial and then click Ok.

  11. In the first code cell, enter the following to update to the latest version of the BigQuery Python Client Library.

    !pip install --upgrade google-cloud-bigquery
    
  12. Click Run > Run from this cell. The query results appear below the code block. The installation is complete when you see the following message:

    Successfully installed ...

  13. The code produces quite a bit of output. To hide the output, click the cell menu and choose Collapse to collapse the cell.

    Cell menu

  14. Click Add Code to create a new code cell.

  15. Enter the following code to import the BigQuery Python Client Library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.

    from google.cloud import bigquery
    client = bigquery.Client()
  16. Click Run > Run from this cell. This command produces no output.

Step two: Create your dataset

Next, you create a BigQuery dataset to store your ML model. To create your dataset:

  1. Click Add Code to create a new code cell.

  2. Enter the following command to create your dataset.

    dataset = bigquery.Dataset(client.dataset('bqml_tutorial'))
    dataset.location = 'US'
    client.create_dataset(dataset)
  3. Click Run > Run from this cell. The command output should look like the following:

    Dataset '[project_ID]:bqml_tutorial' successfully created.

Step three: Generate your training and feature data

Next, you generate your training and feature data. First you generate the input features of the model by querying the bigquery-public-data.ncaa_basketball.mbb_teams_games_sr table in the NCAA Basketball Data public dataset. After generating the input features, you generate your training data.

Generate the input features (feature engineering)

The input features include the mean and standard deviation of previous game statistics for both home teams and away teams using different time windows. The time windows used are 10, 5 and 1 games before the current game. The team-id column (for both home teams and away teams), and the season column are also used as one-hot features, which contributes the average statistics to the model.

Because of the length of the query that is used to generate your input features, you use a saved query in the Classic BigQuery web UI to create the table.

To generate your input feature data:

  1. Click the button below to open the input data query.

    Open the feature input query

    As time permits, examine the query.

  2. Below the query window, click Show Options.

  3. In the Destination Table section, click Select Table.

  4. In the Select Destination Table dialog:

    1. For Project, choose your project.
    2. For Dataset, choose bqml_tutorial.
    3. In the Table ID field, enter cume_games.
    4. Click OK.
  5. Click Run query. This creates a query job that writes the query results to the table you specified.

    Alternately, if you forget to specify a destination table before running your query, you can copy the temporary table to a permanent table by clicking the Save as Table button in the results window.

  6. The query will take a few seconds to complete. After the query runs, your table (cume_games) appears in the navigation panel of the BigQuery web UI.

Generate your training data

After generating the input features, you generate your training data. The training data is written to the bqml_tutorial.wide_games table. When you generate your training data, you exclude a game to use for predicting outcomes.

In this query, you self-join the cume_games table you created by running the previous query. You do this to generate both the home team and away team statistics before the current game.

Because of the length of the query that is used to generate your training data, you use a saved query in the BigQuery web UI to create the table.

To generate your training data:

  1. Click the button below to open the input data query.

    Open the training data query

    As time permits, examine the query.

  2. Below the query window, click Show Options.

  3. In the Destination Table section, click Select Table.

  4. In the Select Destination Table dialog:

    1. For Project, choose your project.
    2. For Dataset, choose bqml_tutorial.
    3. In the Table ID field, enter wide_games.
    4. Click OK.
  5. Click Run query. This creates a query job that writes the query results to the table you specified.

    Alternately, if you forget to specify a destination table before running your query, you can copy the temporary table to a permanent table by clicking the Save as Table button in the results window.

  6. The query will take a few seconds to complete. After the query runs, your table (wide_games) appears in the navigation panel of the BigQuery web UI.

Step four: Create your model

Next, you create a linear regression model. The model is used to predict the combined three point field goal attempts based on the previous game statistics. The standard SQL query uses a CREATE MODEL statement to create the model. Though the CREATE MODEL statement can create and train your model, in this tutorial, you train the model separately.

To run the CREATE MODEL query to create your model:

  1. Switch to your Cloud Datalab notebook.

  2. Click Add Code to create a new code cell.

  3. The BigQuery Python Client Library provides a magic command allows you to run queries with minimal code. To load the magic commands from the client library, enter the following code.

    %load_ext google.cloud.bigquery
    
  4. Click Run > Run from this cell. This command produces no output.

  5. Click Add Code to create a new code cell.

  6. The BigQuery client library provides a cell magic, %%bigquery, which runs a SQL query and returns the results as a Pandas DataFrame. Enter the following standard SQL query in the cell. The #standardSQL prefix is not required for the Client Library. Standard SQL is the default query syntax.

Jupyter

%%bigquery
CREATE OR REPLACE MODEL bqml_tutorial.ncaa_model OPTIONS(
  model_type='linear_reg', max_iteration=50) AS
SELECT
  * EXCEPT(
    game_id, season, scheduled_date,
    total_three_points_made,
    total_three_points_att),
  total_three_points_att as label
FROM
  bqml_tutorial.wide_games
WHERE
  # remove the game to predict
  game_id != 'f1063e80-23c7-486b-9a5e-faa52beb2d83'

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

sql = """
    CREATE OR REPLACE MODEL `bqml_tutorial.ncaa_model`
    OPTIONS (
        model_type='linear_reg',
        max_iteration=50 ) AS
    SELECT
        * EXCEPT (
            game_id, season, scheduled_date,
            total_three_points_made,
            total_three_points_att),
        total_three_points_att as label
    FROM
        `bqml_tutorial.wide_games`
    WHERE
        # remove the game to predict
        game_id != 'f1063e80-23c7-486b-9a5e-faa52beb2d83'
"""
df = client.query(sql).to_dataframe()
print(df)

  1. Click Run > Run from this cell.

  2. The query takes several minutes to complete. After the first iteration is complete, your model (ncaa_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement, you do not see query results. The output is an empty string.

Step five: Get training statistics

To see the results of the model training, you can use the ML.TRAINING_INFO function, or you can view the statistics in the BigQuery web UI. This functionality is not currently available in the BigQuery Classic web UI. In this tutorial, you use the ML.TRAINING_INFO function.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

Loss is the penalty for a bad prediction — a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

To see the model training statistics that were generated when you ran the CREATE MODEL query:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

Jupyter

%%bigquery
SELECT
  *
FROM
  ML.TRAINING_INFO(MODEL `bqml_tutorial.ncaa_model`)

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

sql = """
    SELECT
        *
    FROM
        ML.TRAINING_INFO(MODEL `bqml_tutorial.ncaa_model`)
"""
df = client.query(sql).to_dataframe()
print(df)

  1. Click Run > Run from this cell.

  2. When the query is complete, the results appear below the query. The results should look like the following:

    ML.TRAINING_INFO output

    The Training Data Loss column represents the loss metric calculated after the given iteration on the training dataset. Since you performed a linear regression, this column is the mean squared error. The Evaluation Data Loss column is the same loss metric calculated on the holdout dataset (data that is held back from training to validate the model).

    For more details on the ML.TRAINING_INFO function, see the BigQuery ML Syntax Reference.

Step six: Evaluate your model

After creating your model, you evaluate the performance of the model using the ML.EVALUATE function.

To run the ML.EVALUATE query that evaluates the model:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

Jupyter

%%bigquery
WITH eval_table AS (
SELECT
  *,
  total_three_points_att AS label
FROM
  `bqml_tutorial.wide_games` )
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.ncaa_model`,
    TABLE eval_table)

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

sql = """
    WITH eval_table AS (
        SELECT
            *,
            total_three_points_att AS label
        FROM
            `bqml_tutorial.wide_games` )
    SELECT
        *
    FROM
        ML.EVALUATE(MODEL `bqml_tutorial.ncaa_model`,
            TABLE eval_table)
"""
df = client.query(sql).to_dataframe()
print(df)

  1. Click Run > Run from this cell.

  2. When the query is complete, the results appear below the query. The results should look like the following:

    ML.EVALUATE output

    Because you performed a linear regression, the results include the following columns:

    • mean_absolute_error
    • mean_squared_error
    • mean_squared_log_error
    • median_absolute_error
    • r2_score
    • explained_variance

An important metric in the evaluation results is the mean absolute error. Mean absolute error is the average difference between the predicted label and the actual label. In this case it is about 7.

Step seven: Use your model to predict outcomes

Now that you have evaluated your model, the next step is to use the ML.PREDICT function to predict the total three point field goal attempts in the 2018 NCAA final game: Michigan versus Villanova.

To run the query that uses the ML.PREDICT function:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

Jupyter

%%bigquery
  WITH game_to_predict AS (
  SELECT
    *
  FROM
    `bqml_tutorial.wide_games`
  WHERE
    game_id='f1063e80-23c7-486b-9a5e-faa52beb2d83' )
SELECT
  truth.game_id AS game_id,
  total_three_points_att,
  predicted_total_three_points_att
FROM (
  SELECT
    game_id,
    predicted_label AS predicted_total_three_points_att
  FROM
    ML.PREDICT(MODEL `bqml_tutorial.ncaa_model`,
      table game_to_predict) ) AS predict
JOIN (
  SELECT
    game_id,
    total_three_points_att AS total_three_points_att
  FROM
    game_to_predict) AS truth
ON
  predict.game_id = truth.game_id

Python

Before trying this sample, follow the Python setup instructions in the BigQuery Quickstart Using Client Libraries . For more information, see the BigQuery Python API reference documentation .

sql = """
    WITH game_to_predict AS (
        SELECT
            *
        FROM
            `bqml_tutorial.wide_games`
        WHERE
            game_id='f1063e80-23c7-486b-9a5e-faa52beb2d83' )
    SELECT
        truth.game_id AS game_id,
        total_three_points_att,
        predicted_total_three_points_att
    FROM (
        SELECT
            game_id,
            predicted_label AS predicted_total_three_points_att
        FROM
            ML.PREDICT(MODEL `bqml_tutorial.ncaa_model`,
            table game_to_predict) ) AS predict
    JOIN (
        SELECT
            game_id,
            total_three_points_att AS total_three_points_att
        FROM
            game_to_predict) AS truth
    ON
        predict.game_id = truth.game_id
"""
df = client.query(sql).to_dataframe()
print(df)

  1. Click Run > Run from this cell.

  2. When the query is complete, the results appear below the query. The results should look like the following. Because model training is not deterministic, your results may differ.

    ML.PREDICT output

The total_three_points_att value is the actual number of field goals that occurred in the final game — 50. The model's prediction is 43.41.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • You can delete the project you created.
  • Or you can keep the project and delete the Cloud Datalab VM.

Deleting your Cloud Datalab VM

Deleting your project removes the Cloud Datalab VM. If you do not want to delete the Cloud Platform project, you can delete the Cloud Datalab VM.

To delete the Cloud Datalab VM:

  1. Open the Compute Engine VM Instances page.

    Go to the VM instances page

  2. Check the mltutorial instance and then click Delete.

  3. When prompted, click Delete.

Deleting your project

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

¿Te sirvió esta página? Envíanos tu opinión:

Enviar comentarios sobre…

¿Necesitas ayuda? Visita nuestra página de asistencia.