Create machine learning models in BigQuery ML
This tutorial shows you how to create a logistic regression model by using BigQuery ML in the Google Cloud console.
BigQuery ML lets you create and train machine learning models in BigQuery by using SQL queries. This helps make machine learning more approachable by letting you use familiar tools like the BigQuery SQL editor, and also increases development speed by removing the need to move data into a separate machine learning environment.
In this tutorial, you use the sample Google Analytics sample dataset for BigQuery to create a model that predicts whether a website visitor will make a transaction. For information on the schema of the Analytics dataset, see BigQuery export schema in the Analytics Help Center.
Objectives
This tutorial shows you how to perform the following tasks:
- Using the
CREATE MODEL
statement to create a binary logistic regression model. - Using the
ML.EVALUATE
function to evaluate the model. - Using the
ML.PREDICT
function to make predictions by using the model.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For more information on BigQuery costs, see the BigQuery pricing page.
For more information on BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Create a dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Create a logistic regression model
Create a logistic regression model using the Analytics sample dataset for BigQuery.
SQL
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
CREATE OR REPLACE MODEL `bqml_tutorial.sample_model` OPTIONS(model_type='logistic_reg') AS SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
The query takes several minutes to complete. After the first iteration is complete, your model (
sample_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a model, you don't see query results.
Query details
The CREATE MODEL
statement creates the model and then trains the model
using the data retrieved by your query's SELECT
statement.
The OPTIONS(model_type='logistic_reg')
clause creates a logistic
regression model. A
logistic regression model splits input data into two classes, and then
estimates the probability that the data is in one of the classes. What you are
trying to detect, such as whether an email is spam, is represented by 1 and
other values are represented by 0. The likelihood of a given value belonging
to the class you are trying to detect is indicated by a value between 0 and 1.
For example, if an email receives a probability estimate of 0.9, then there is
a 90% probability that the email is spam.
This query's SELECT
statement retrieves the following columns that are used
by the model to predict the probability that a customer will complete a
transaction:
totals.transactions
: the total number of ecommerce transactions within the session. If the number of transactions isNULL
, the value in thelabel
column is set to0
. Otherwise, it is set to1
. These values represent the possible outcomes. Creating an alias namedlabel
is an alternative to setting theinput_label_cols=
option in theCREATE MODEL
statement.device.operatingSystem
: the operating system of the visitor's device.device.isMobile
— Indicates whether the visitor's device is a mobile device.geoNetwork.country
: the country from which the sessions originated, based on the IP address.totals.pageviews
: the total number of page views within the session.
The FROM
clause — causes the query to train the model by using the
bigquery-public-data.google_analytics_sample.ga_sessions
sample tables.
These tables are sharded by date, so you aggregate them by using a wildcard in
the table name: google_analytics_sample.ga_sessions_*
.
The WHERE
clause — _TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
— limits the number of tables scanned by the query. The date range
scanned is August 1, 2016 to June 30, 2017.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
View the model's loss statistics
Machine learning is about creating a model that can use data to make a prediction. The model is essentially a function that takes inputs and applies calculations to the inputs to produce an output — a prediction.
Machine learning algorithms work by taking several examples where the prediction is already known (such as the historical data of user purchases) and iteratively adjusting various weights in the model so that the model's predictions match the true values. It does this by minimizing how wrong the model is using a metric called loss.
The expectation is that for each iteration, the loss should be decreasing, ideally to zero. A loss of zero means the model is 100% accurate.
When training the model, BigQuery ML automatically splits the input data into training and evaluation sets, in order to avoid overfitting the model. This is necessary so that the training algorithm doesn't fit itself so closely to the training data that it can't generalize to new examples.
Use the Google Cloud console to see how the model's loss changes over the model's training iterations:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand bqml_tutorial > Models and then click sample_model.
Click the Training tab and look at the Loss graph. The Loss graph shows the change in the loss metric over the iterations on the training dataset. If you hold your cursor over the graph, you can see that there are lines for Training loss and Evaluation loss. Since you performed a logistic regression, the training loss value is calculated as log loss, using the training data. The evaluation loss is the log loss calculated on the evaluation data. Both loss types represent average loss values, averaged over all examples in the respective datasets for each iteration.
You can also see the results of the model training by using the
ML.TRAINING_INFO
function.
Evaluate the model
Evaluate the performance of the model by using the ML.EVALUATE
function. The
ML.EVALUATE
function evaluates the predicted values generated by the model
against the actual data. To calculate logistic regression specific
metrics, you can use the
ML.ROC_CURVE
SQL function
or the
bigframes.ml.metrics.roc_curve
BigQuery DataFrames function.
In this tutorial, you are using a binary classification model that
detects transactions. The values in the label
column are the two classes
generated by the model: 0
(no transactions) and 1
(transaction made).
SQL
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, ( SELECT IF(totals.transactions IS NULL, 0, 1) AS label, IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(geoNetwork.country, "") AS country, IFNULL(totals.pageviews, 0) AS pageviews FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
The results should look like the following:
+--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+ | precision | recall | accuracy | f1_score | log_loss | roc_auc | +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+ | 0.468503937007874 | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898 | 0.98174125874125873 | +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
Because you performed a logistic regression, the results include the following columns:
precision
: a metric for classification models. Precision identifies the frequency with which a model was correct when predicting the positive class.recall
: a metric for classification models that answers the following question: Out of all the possible positive labels, how many did the model correctly identify?accuracy
: accuracy is the fraction of predictions that a classification model got right.f1_score
: a measure of the accuracy of the model. The f1 score is the harmonic average of the precision and recall. An f1 score's best value is 1. The worst value is 0.log_loss
: the loss function used in a logistic regression. This is the measure of how far the model's predictions are from the correct labels.roc_auc
: the area under the ROC curve. This is the probability that a classifier is more confident that a randomly chosen positive example is actually positive than that a randomly chosen negative example is positive. For more information, see Classification in the Machine Learning Crash Course.
Query details
The initial SELECT
statement retrieves the columns from your model.
The FROM
clause uses the ML.EVALUATE
function against your model.
The nested SELECT
statement and FROM
clause are the same as those
in the CREATE MODEL
query.
The WHERE
clause — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
— limits the number of tables scanned by the query. The date range
scanned is July 1, 2017 to August 1, 2017. This is the data you're using to
evaluate the predictive performance of the model. It was collected in the month
immediately following the time period spanned by the training data.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Use the model to predict outcomes
Use the model to predict the number of transactions made by website visitors from each country.
SQL
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
SELECT country, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY country ORDER BY total_predicted_purchases DESC LIMIT 10
The results should look like the following:
+----------------+---------------------------+ | country | total_predicted_purchases | +----------------+---------------------------+ | United States | 220 | | Taiwan | 8 | | Canada | 7 | | India | 2 | | Turkey | 2 | | Japan | 2 | | Italy | 1 | | Brazil | 1 | | Singapore | 1 | | Australia | 1 | +----------------+---------------------------+
Query details
The initial SELECT
statement retrieves the country
column and sums the
predicted_label
column. The predicted_label
column is generated by the
ML.PREDICT
function. When you use the ML.PREDICT
function, the output
column name for the model is predicted_<label_column_name>
. For linear
regression models, predicted_label
is the estimated value of label
.
For logistic regression models, predicted_label
is the label that best
describes the given input data value, either 0
or 1
.
The ML.PREDICT
function is used to predict results using your model.
The nested SELECT
statement and FROM
clause are the same as those
in the CREATE MODEL
query.
The WHERE
clause — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801'
— limits the number of tables scanned by the query. The date range
scanned is July 1, 2017 to August 1, 2017. This is the data for which you're
making predictions. It was collected in the month immediately following the time
period spanned by the training data.
The GROUP BY
and ORDER BY
clauses group the results by country and order
them by the sum of the predicted purchases in descending order.
The LIMIT
clause is used here to display only the top 10 results.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Predict purchases per user
Predict the number of transactions each website visitor will make.
SQL
This query is identical to the query in the previous section except for the
GROUP BY
clause. Here the GROUP BY
clause — GROUP BY fullVisitorId
— is used to group the results by visitor ID.
In the Google Cloud console, go to the BigQuery page.
In the query editor, run the following statement:
SELECT fullVisitorId, SUM(predicted_label) as total_predicted_purchases FROM ML.PREDICT(MODEL `bqml_tutorial.sample_model`, ( SELECT IFNULL(device.operatingSystem, "") AS os, device.isMobile AS is_mobile, IFNULL(totals.pageviews, 0) AS pageviews, IFNULL(geoNetwork.country, "") AS country, fullVisitorId FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` WHERE _TABLE_SUFFIX BETWEEN '20170701' AND '20170801')) GROUP BY fullVisitorId ORDER BY total_predicted_purchases DESC LIMIT 10
The results should look like the following:
+---------------------+---------------------------+ | fullVisitorId | total_predicted_purchases | +---------------------+---------------------------+ | 9417857471295131045 | 4 | | 112288330928895942 | 2 | | 2158257269735455737 | 2 | | 489038402765684003 | 2 | | 057693500927581077 | 2 | | 2969418676126258798 | 2 | | 5073919761051630191 | 2 | | 7420300501523012460 | 2 | | 0456807427403774085 | 2 | | 2105122376016897629 | 2 | +---------------------+---------------------------+
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used on this page, follow these steps.
You can delete the project you created, or keep the project and delete the dataset.
Delete the dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, select the bqml_tutorial dataset you created.
Click > Delete.
ActionsIn the Delete dataset dialog, confirm the delete command by typing
delete
.Click Delete.
Delete the project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.