This tutorial introduces data analysts to BigQuery ML.
BigQuery ML enables users to create and execute machine learning models
in BigQuery using SQL queries. This tutorial introduces feature
engineering by using the TRANSFORM
clause. Using the TRANSFORM
clause, you
can specify all preprocessing
during model creation. The preprocessing is automatically applied during the
prediction and evaluation phases of machine learning.
In this tutorial, you use the
natality
sample table
to create a model that predicts the birth weight of a child based on the baby's
gender, the length of the pregnancy, and bucketized demographic information
about the mother. The natality
sample table contains information about every
birth in the United States over a 40 year period.
Objectives
In this tutorial, you use:
- BigQuery ML to create a linear regression model using the
CREATE MODEL
statement with theTRANSFORM
clause - The
ML.FEATURE_CROSS
andML.QUANTILE_BUCKETIZE
preprocessing functions - The
ML.EVALUATE
function to evaluate the ML model - The
ML.PREDICT
function to make predictions using the ML model
Costs
This tutorial uses billable components of Google Cloud, including:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Create your model
Next, create a linear regression model using the natality sample table for BigQuery. The following GoogleSQL query is used to create the model you use to predict the birth weight of a child.
#standardSQL CREATE MODEL `bqml_tutorial.natality_model` TRANSFORM(weight_pounds, is_male, gestation_weeks, ML.QUANTILE_BUCKETIZE(mother_age, 5) OVER() AS bucketized_mother_age, CAST(mother_race AS string) AS mother_race, ML.FEATURE_CROSS(STRUCT(is_male, CAST(mother_race AS STRING) AS mother_race)) is_male_mother_race) OPTIONS (model_type='linear_reg', input_label_cols=['weight_pounds']) AS SELECT * FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL AND RAND() < 0.001
In addition to creating the model, running the CREATE MODEL
command trains the
model you create.
Query details
The
CREATE MODEL
clause is used to create and train the model named
bqml_tutorial.natality_model
.
The OPTIONS(model_type='linear_reg', input_label_cols=['weight_pounds'])
clause indicates that you are creating a
linear regression
model. A linear regression is a type of regression model that generates a
continuous value from a linear combination of input features. The
weight_pounds
column is the input label column. For linear regression models,
the label column must be real valued (that is, the column values must be
real numbers).
This query's TRANSFORM
clause uses the following columns from the SELECT
statement:
weight_pounds
: The weight, in pounds, of the child (FLOAT64).is_male
: The sex of the child. TRUE if the child is male, FALSE if female (BOOL).gestation_weeks
: The number of weeks of the pregnancy (INT64).mother_age
: The age of the mother when giving birth (INT64).mother_race
: The race of the mother (INT64). This integer value is the same as thechild_race
value in the table schema. To force BigQuery ML to treatmother_race
as a non-numeric feature, with each distinct value representing a different category, the query castsmother_race
to a STRING. This is important because race is more likely to have more meaning as a category than an integer, which has ordering and scale.
Through the TRANSFORM
clause, the original features are preprocessed to feed
in training. The generated columns are:
weight_pounds
: Passed as is, without any change.is_male
: Passed through to feed in training.gestation_weeks
: Passed through to feed in training.bucketized_mother_age
: Generated frommother_age
by bucketizingmother_age
based on quantiles using theML.QUANTILE_BUCKETIZE()
analytic function.mother_race
: String format of the originalmother_race
.is_male_mother_race
: Generated from crossingis_male
andmother_race
using theML.FEATURE_CROSS
function.
The query's SELECT
statement provides the columns that you can use in the
TRANSFORM
clause. However, you do not need to use all columns in the
TRANSFORM
clause. As a result, you can do both feature selection and
preprocessing inside the TRANSFORM
clause.
The FROM
clause—bigquery-public-data.samples.natality
—indicates
that you are querying the natality sample table in the samples dataset.
This dataset is in the bigquery-public-data
project.
The WHERE
clause—WHERE weight_pounds IS NOT NULL AND RAND() < 0.001
—excludes
rows where weight is NULL and uses the RAND
function to draw a random sample
of the data.
Run the CREATE MODEL
query
To run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE MODEL `bqml_tutorial.natality_model` TRANSFORM(weight_pounds, is_male, gestation_weeks, ML.QUANTILE_BUCKETIZE(mother_age, 5) OVER() AS bucketized_mother_age, CAST(mother_race AS string) AS mother_race, ML.FEATURE_CROSS(STRUCT(is_male, CAST(mother_race AS STRING) AS mother_race)) is_male_mother_race) OPTIONS (model_type='linear_reg', input_label_cols=['weight_pounds']) AS SELECT * FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL AND RAND() < 0.001
Click Run.
The query takes about 30 seconds to complete, after which your model (
natality_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a table, you do not see query results.
Step three (optional): Get training statistics
To see the results of the model training, you can use the
ML.TRAINING_INFO
function, or you can view the statistics in the Google Cloud console. In this
tutorial, you use the Google Cloud console.
A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.
Loss is the penalty for a bad prediction—a number indicating how bad the model's prediction was on a single example. If the model's prediction is perfect, the loss is zero; otherwise, the loss is greater. The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.
To see the model training statistics that were generated when you ran the
CREATE MODEL
query:
In the Google Cloud console navigation panel, in the Resources section, expand project-name > bqml_tutorial and then click natality_model.
Click the Training tab and for View as, select the Table option. The results should look like the following:
+-----------+--------------------+----------------------+--------------------+ | Iteration | Training data loss | Evaluation data loss | Duration (seconds) | +-----------+--------------------+----------------------+--------------------+ | 0 | 1.6640 | 1.7352 | 6.27 | +-----------+--------------------+----------------------+--------------------+
The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Because you performed a linear regression, this column is the mean squared error.
The Evaluation Data Loss column is the same loss metric calculated on the holdout dataset (data that is held back from training to validate the model). The default optimize strategy used for the training is "normal_equation", so only one iteration is required to converge to the final model.
For more information about the
optimize_strategy
option, see theCREATE MODEL
statement.For more information about the
ML.TRAINING_INFO
function and the "optimize_strategy" training option, see the BigQuery ML syntax reference.
Step four: Evaluate your model
After creating your model, you evaluate the performance of the classifier
by using the ML.EVALUATE
function. The ML.EVALUATE
function evaluates
the predicted values against the actual data.
The query used to evaluate the model is as follows:
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.natality_model`, ( SELECT * FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL))
Query details
The upper SELECT
statement retrieves the columns from your model.
The FROM
clause uses the
ML.EVALUATE
function against your model: bqml_tutorial.natality_model
.
This query's nested SELECT
statement and FROM
clause are the same as those
in the CREATE MODEL
query. Because the TRANSFORM
clause is used in training,
you don't need to specify the specific columns and transformations. They are
automatically restored.
The WHERE
clause—WHERE weight_pounds IS NOT NULL
—excludes
rows where weight is NULL.
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.natality_model`)
Run the ML.EVALUATE
query
To run the ML.EVALUATE
query that evaluates the model, complete the
following steps:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.natality_model`, ( SELECT * FROM `bigquery-public-data.samples.natality` WHERE weight_pounds IS NOT NULL))
(Optional) To set the processing location, on the settings_applicationsMore drop-down list, click Query settings. For Processing location, choose United States (US). This step is optional because the processing location is automatically detected based on the dataset's location.
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
+---------------------+--------------------+------------------------+---------------------+---------------------+----------------------+ | mean_absolute_error | mean_squared_error | mean_squared_log_error | mean_absolute_error | r2_score | explained_variance | +---------------------+--------------------+------------------------+---------------------+---------------------+----------------------+ | 0.9566580179970666 | 1.6756289722442677 | 0.034241471462096516 | 0.7385590721661188 | 0.04650972930257946 | 0.046516832131241026 | +---------------------+--------------------+------------------------+---------------------+---------------------+----------------------+
Because you performed a linear regression, the results include the following columns:
mean_absolute_error
mean_squared_error
mean_squared_log_error
median_absolute_error
r2_score
explained_variance
An important metric in the evaluation results is the
R2 score.
The R2 score is a statistical measure that determines whether the
linear regression predictions approximate the actual data. A 0
value
indicates that the model explains none of the variability of the response data
around the mean. A 1
value indicates that the model explains all the
variability of the response data around the mean.
Step five: Use your model to predict outcomes
Now that you have evaluated your model, the next step is to use it to predict an outcome. You can use your model to predict the birth weight of all babies born in Wyoming.
The query used to predict the outcome is as follows:
#standardSQL SELECT predicted_weight_pounds FROM ML.PREDICT(MODEL `bqml_tutorial.natality_model`, ( SELECT * FROM `bigquery-public-data.samples.natality` WHERE state = "WY"))
Query details
The top-most SELECT
statement retrieves the predicted_weight_pounds
column.
This column is generated by the ML.PREDICT
function. When you use the
ML.PREDICT
function, the output column name for the model is
predicted_label_column_name
. For linear regression
models, predicted_label
is the estimated value of label
. For logistic
regression models, predicted_label
is one of the two input labels, depending
on which label has the higher predicted probability.
The ML.PREDICT
function is used to predict results using your model:
bqml_tutorial.natality_model
.
This query's nested SELECT
statement and FROM
clause are the same as those
in the CREATE MODEL
query. Note that you don't necessarily need to pass in all
columns as in training, and only the ones used in the TRANSFORM
clause are
required. Similar to ML.EVALUATE
, the transformations inside the TRANSFORM
are automatically restored.
The WHERE
clause—WHERE state = "WY"
—indicates that you are
limiting the prediction to the state of Wyoming.
Run the ML.PREDICT
query
To run the query that uses the model to predict an outcome:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT predicted_weight_pounds FROM ML.PREDICT(MODEL `bqml_tutorial.natality_model`, ( SELECT * FROM `bigquery-public-data.samples.natality` WHERE state = "WY"))
(Optional) To set the processing location, on the settings_applicationsMore drop-down list, click Query settings. For Processing location, choose United States (US). This step is optional because the processing location is automatically detected based on the dataset's location.
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
+----------------------------+ | predicted_weight_pounds | +----------------------------+ | 7.735962399307027 | +----------------------------+ | 7.728855793480761 | +----------------------------+ | 7.383850250400428 | +----------------------------+ | 7.4132677633242565 | +----------------------------+ | 7.734971309702814 | +----------------------------+
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation panel, click the bqml_tutorial dataset you created.
On the right side of the window, click Delete dataset. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.