Getting Started with BigQuery ML for Data Scientists

This tutorial introduces data scientists to BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

In this tutorial, you use the sample Google Analytics sample dataset for BigQuery to create a model that predicts whether a website visitor will make a transaction. For information on the schema of the Analytics dataset, see BigQuery Export schema in the Google Analytics Help Center.

Objectives

In this tutorial, you use:

  • BigQuery ML to create a binary logistic regression model using the CREATE MODEL statement
  • The ML.EVALUATE function to evaluate the ML model
  • The ML.PREDICT function to make predictions using the ML model

Costs

This tutorial uses billable components of Cloud Platform, including:

  • Google BigQuery
  • BigQuery ML
  • Cloud Datalab

You incur charges for:

  • Storing your ML model and training data in BigQuery
    • The first 10 GB of storage is free each month.
  • Querying data in BigQuery
    • The first 1 TB is free each month.
    • If you are using flat-rate pricing, query costs are included in the monthly flat-rate price.
  • Running BigQuery ML SQL statements
  • Running Cloud Datalab on a Compute Engine VM
    • For more information, see the Cloud Datalab Pricing page.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the Manage resources page

  3. Make sure that billing is enabled for your project.

    Learn how to enable billing

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Audience

This is an introductory tutorial that is intended for Data Scientists.

A Data Scientist uses BigQuery standard SQL and BigQuery ML to:

  • Explore, analyze, and visualize data
  • Clean data sets for use in statistical models
  • Create statistical and ML models
  • Create, train, evaluate, and improve ML algorithms and models
  • Explore data in a data warehouse such as BigQuery

A Data Scientist uses various tools including:

  • The BigQuery command-line tool (to query data using standard SQL)
  • Statistical software such as RStudio and Matlab
  • Cloud Datalab (Jupyter notebooks)
  • Cloud Machine Learning Engine
  • Programming languages such as Python and Java
  • ML libraries such as MLlib and TensorFlow
  • Visualization tools such as Cloud Datalab and Data Studio

Because this tutorial is intended for Data Scientists, many details related to ML concepts are not explained. To see a more detailed version of this tutorial that uses the BigQuery web UI, go to Getting Started with BigQuery ML for Data Analysts.

To see descriptions of the concepts presented here, see the:

Step one: Set up Cloud Datalab

In this tutorial, you set up Cloud Datalab using Google Cloud Shell. Cloud Datalab provides a Jupyter-based notebook environment you use to work with BigQuery ML and BigQuery.

Before you set up Cloud Datalab, you must enable the Compute Engine API and the Cloud Source Repositories API. For information on enabling these APIs, see Enabling and Disabling Services.

To set up Cloud Datalab on a Compute Engine VM:

  1. Go to the Cloud Platform Console.

    Cloud Platform Console

  2. Click Activate Google Cloud Shell.

    Activate Cloud Shell icon

  3. In the Cloud Shell window, enter the following command to configure the gcloud tool to use us-central1-a as your default zone for the Cloud SDK. This zone will contain your Cloud Datalab instance.

    gcloud config set compute/zone us-central1-a
    
  4. Enter the following command to create a Compute Engine VM (named mltutorial) used to run the Cloud Datalab container. This command also creates an SSH connection to your VM and maps the remote Cloud Datalab web interface to localhost port 8081. If prompted, enter the number corresponding to your default Compute Engine zone.

    datalab create mltutorial
    

    Once you see the message The connection to Datalab is now open and will remain until this command is killed, you are connected to the instance. The command line interface may no longer produce output once the SSH connection is established.

  5. In the Cloud Shell window, click the web preview icon Web preview icon.

  6. Choose Change port.

  7. In the Change Preview Port dialog, enter 8081 and then click Change and Preview. This opens Cloud Datalab in a new tab.

    gcloud compute ssh --quiet --zone "us-central1-a" --ssh-flag="-N" --ssh-flag="-L" --ssh-flag="localhost:8081:localhost:8080" "${USER}@mltutorial"
    
  8. Click Notebook to create a new notebook. The notebook opens in a new tab.

  9. In the title bar, click Untitled Notebook to change the notebook's name.

  10. In the Rename Notebook dialog, enter BigQuery ML Tutorial and then click Ok.

  11. In the first code cell, enter the following to update to the latest version of the BigQuery Python Client Library.

    !pip install --upgrade google-cloud-bigquery
    
  12. Click Run > Run from this cell. The query results appear below the code block. The installation is complete when you see the following message:

    Successfully installed ...

  13. The code produces quite a bit of output. To hide the output, click the cell menu and choose Collapse to collapse the cell.

    Cell menu

  14. Click Add Code to create a new code cell.

  15. Enter the following code to import the BigQuery Python Client Library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.

    from google.cloud import bigquery
    client = bigquery.Client()
    
  16. Click Run > Run from this cell. This command produces no output.

Step two: Create your dataset

Next, you create a BigQuery dataset to store your ML model. To create your dataset:

  1. Click Add Code to create a new code cell.

  2. Enter the following command to create your dataset.

    dataset = bigquery.Dataset(client.dataset('bqml_tutorial'))
    dataset.location = 'US'
    client.create_dataset(dataset)
    
  3. Click Run > Run from this cell. The command output should look like the following:

    Dataset '[project_ID]:bqml_tutorial' successfully created.

Step three: Create your model

Next, you create a logistic regression model using the Google Analytics sample dataset for BigQuery. The model is used to predict whether a website visitor will make a transaction. The standard SQL query uses a CREATE MODEL statement to create and train the model.

To run the CREATE MODEL query to create and train your model:

  1. Click Add Code to create a new code cell.

  2. The BigQuery Python Client Library provides a magic command allows you to run queries with minimal code. To load the magic commands from the client library, enter the following code.

    %load_ext google.cloud.bigquery
    
  3. Click Run > Run from this cell. This command produces no output.

  4. Click Add Code to create a new code cell.

  5. The BigQuery client library provides a cell magic, %%bigquery, which runs a SQL query and returns the results as a Pandas DataFrame. Enter the following standard SQL query in the cell. The #standardSQL prefix is not required for the Client Library. Standard SQL is the default query syntax.

    %%bigquery
    CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
    OPTIONS(model_type='logistic_reg') AS
    SELECT
      IF(totals.transactions IS NULL, 0, 1) AS label,
      IFNULL(device.operatingSystem, "") AS os,
      device.isMobile AS is_mobile,
      IFNULL(geoNetwork.country, "") AS country,
      IFNULL(totals.pageviews, 0) AS pageviews
    FROM
      `bigquery-public-data.google_analytics_sample.ga_sessions_*`
    WHERE
      _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
    

  6. Click Run > Run from this cell.

  7. The query takes several minutes to complete. After the first iteration is complete, your model (sample_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a table, you do not see query results. The output is an empty string.

Step four: Get training statistics

To see the results of the model training, you can use the ML.TRAINING_INFO function, or you can view the statistics in the BigQuery web UI. This functionality is not currently available in the BigQuery Classic web UI. In this tutorial, you use the ML.TRAINING_INFO function.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

Loss is the penalty for a bad prediction — a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights that have low loss, on average, across all examples.

To see the model training statistics that were generated when you ran the CREATE MODEL query:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

    %%bigquery
    SELECT
      *
    FROM
      ML.TRAINING_INFO(MODEL `bqml_tutorial.sample_model`)
    

  3. Click Run > Run from this cell.

  4. When the query is complete, the results appear below the query. The results should look like the following:

    ML.TRAINING_INFO output

    The loss column represents the loss metric calculated after the given iteration on the training dataset. Since you performed a logistic regression, this column is the log loss. The eval_loss column is the same loss metric calculated on the holdout dataset (data that is held back from training to validate the model).

    For more details on the ML.TRAINING_INFO function, see the BigQuery ML Syntax Reference.

Step five: Evaluate your model

After creating your model, you evaluate the performance of the classifier using the ML.EVALUATE function. You can also use the ML.ROC_CURVE function for logistic regression specific metrics.

A classifier is one of a set of enumerated target values for a label. For example, in this tutorial you are using a binary classification model that detects transactions. The two classes are the values in the label column: 0 (no transactions) and not 1 (transaction made).

To run the ML.EVALUATE query that evaluates the model:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

    %%bigquery
    SELECT
      *
    FROM ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
      SELECT
        IF(totals.transactions IS NULL, 0, 1) AS label,
        IFNULL(device.operatingSystem, "") AS os,
        device.isMobile AS is_mobile,
        IFNULL(geoNetwork.country, "") AS country,
        IFNULL(totals.pageviews, 0) AS pageviews
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
    

  3. Click Run > Run from this cell.

  4. When the query is complete, the results appear below the query. The results should look like the following:

    ML.EVALUATE output

    Because you performed a logistic regression, the results include the following columns:

Step six: Use your model to predict outcomes

Now that you have evaluated your model, the next step is to use it to predict outcomes. You use your model to predict the number of transactions made by website visitors from each country. And you use it to predict purchases per user.

To run the query that uses the model to predict the number of transactions:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

    %%bigquery
    SELECT
      country,
      SUM(predicted_label) as total_predicted_purchases
    FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
      SELECT
        IFNULL(device.operatingSystem, "") AS os,
        device.isMobile AS is_mobile,
        IFNULL(totals.pageviews, 0) AS pageviews,
        IFNULL(geoNetwork.country, "") AS country
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
      GROUP BY country
      ORDER BY total_predicted_purchases DESC
      LIMIT 10
    

  3. Click Run > Run from this cell.

  4. When the query is complete, the results appear below the query. The results should look like the following. Because model training is not deterministic, your results may differ.

    ML.PREDICT output one

In the next example, you try to predict the number of transactions each website visitor will make. This query is identical to the previous query except for the GROUP BY clause. Here the GROUP BY clause — GROUP BY fullVisitorId — is used to group the results by visitor ID.

To run the query that predicts purchases per user:

  1. Click Add Code to create a new code cell.

  2. Enter the following standard SQL query in the cell.

    %%bigquery
    SELECT
      fullVisitorId,
      SUM(predicted_label) as total_predicted_purchases
    FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
      SELECT
        IFNULL(device.operatingSystem, "") AS os,
        device.isMobile AS is_mobile,
        IFNULL(totals.pageviews, 0) AS pageviews,
        IFNULL(geoNetwork.country, "") AS country,
        fullVisitorId
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`
      WHERE
        _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
      GROUP BY fullVisitorId
      ORDER BY total_predicted_purchases DESC
      LIMIT 10
    

  3. Click Run > Run from this cell.

  4. When the query is complete, the results appear below the query. The results should look like the following:

    ML.PREDICT output two

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • You can delete the project you created.
  • Or you can keep the project and delete the Cloud Datalab VM.

Deleting your Cloud Datalab VM

Deleting your project removes the Cloud Datalab VM. If you do not want to delete the Cloud Platform project, you can delete the Cloud Datalab VM.

To delete the Cloud Datalab VM:

  1. Open the Compute Engine VM Instances page.

    Go to the VM instances page

  2. Check the mltutorial instance and then click Delete.

  3. When prompted, click Delete.

Deleting your project

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete project. After selecting the checkbox next to the project name, click
      Delete project
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...