This tutorial introduces data analysts to the matrix factorization model in BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.
In this tutorial, you will learn how to create a matrix factorization model from
implicit feedback using the
GA360_test.ga_sessions_sample
sample table
to make recommendations given a visitor ID and a content ID.
The ga_sessions_sample
table contains information about a slice of session
data collected by Google Analytics 360 and sent to BigQuery.
Objectives
In this tutorial, you use the following:
- BigQuery ML: To create an implicit recommendations model using
the
CREATE MODEL
statement. - The
ML.EVALUATE
function: To evaluate the ML models. - The
ML.WEIGHTS
function: To inspect the latent factor weights generated during training. - The
ML.RECOMMEND
function: To produce recommendations for a user.
Costs
This tutorial uses billable components of Google Cloud, including:
- BigQuery
- BigQuery ML
For more information about BigQuery costs, see the BigQuery pricing page.
For more information about BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select US (multiple regions in United States).
The public datasets are stored in the
US
multi-region. For simplicity, store your dataset in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Load Analytics 360 data into BigQuery
Most of the time, the ratings in your data will not reflect a value that a user sets explicitly. In these scenarios, we can devise a proxy to these values as an implicit rating and use a different algorithm to compute recommendations. In this sample, we will a sample Analytics 360 dataset. This sample is based on the following article.
The following is a query to run to create a dataset with implicit ratings from
the session duration that a visitor has had on a page from the
cloud-training-demos.GA360_test.ga_sessions_sample
. The goal of this query is
to create a dataset with three columns that we can map to a user column, item
column and rating column.
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE TABLE bqml_tutorial.analytics_session_data AS WITH visitor_page_content AS ( SELECT fullVisitorID, ( SELECT MAX( IF (index=10, value, NULL)) FROM UNNEST(hits.customDimensions)) AS latestContentId, (LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time) AS session_duration FROM `cloud-training-demos.GA360_test.ga_sessions_sample`, UNNEST(hits) AS hits WHERE # only include hits on pages hits.type = "PAGE" GROUP BY fullVisitorId, latestContentId, hits.time ) # aggregate web stats SELECT fullVisitorID AS visitorId, latestContentId AS contentId, SUM(session_duration) AS session_duration FROM visitor_page_content WHERE latestContentId IS NOT NULL GROUP BY fullVisitorID, latestContentId HAVING session_duration > 0 ORDER BY latestContentId
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query finishes running, (
bqml_tutorial.analytics_session_data
) will appear in the navigation panel. Because the query uses aCREATE TABLE
statement to create a table, you do not see the query results.If you take a look at the table produced it should look something like the following:
Note that this result is specific to how the data was exported to BigQuery. The query to extract your own data might differ.
Step three: Create your Implicit Recommendations Model
Next, you create an implicit recommendations model using the google analytics
table that was loaded in the previous step. The following GoogleSQL query is
used to create the model that will be used to predict a confidence rating for
every visitorId
contentId
pair. A rating is created with centering and scaling by the median session duration, and filtering those records where the session duration is more than 3.33 times the median as outliers.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.my_implicit_mf_model` OPTIONS (model_type='matrix_factorization', feedback_type='implicit', user_col='visitorId', item_col='contentId', rating_col='rating', l2_reg=30, num_factors=15) AS SELECT visitorId, contentId, 0.3 * (1 + (session_duration - 57937) / 57937) AS rating FROM `bqml_tutorial.analytics_session_data` WHERE 0.3 * (1 + (session_duration - 57937) / 57937) < 1
Query details
The CREATE MODEL
clause is used to create and train the model named bqml_tutorial.my_implicit_mf_model
.
The OPTIONS(model_type='matrix_factorization', feedback_type='IMPLICIT',
user_col='visitorId', ...)
clause indicates that you are creating a
matrix factorization
model. Because
feedback_type='IMPLICIT'
is specified an implicit matrix factorization model will be trained.
An example of how to create an explicit matrix factorization model is explained
in Creating an explicit matrix factorization model.
This query's SELECT
statement uses the following columns to generate
recommendations.
visitorId
—The visitor ID (INT64).contentId
—The content ID (INT64).rating
—The implicit rating from 0 to 1 calculated forvisitorId
and their respectivecontentId
centered and scaled (FLOAT64).
The FROM
clause—bqml_tutorial.analytics_session_data
—
indicates that you are querying the analytics_session_data
table in the
bqml_tutorial
dataset. This dataset is in your bigquery project if the
instructions in step two and eight were followed.
Run the CREATE MODEL
query
To run the CREATE MODEL
query to create and train your model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE MODEL `bqml_tutorial.my_implicit_mf_model` OPTIONS (model_type='matrix_factorization', feedback_type='implicit', user_col='visitorId', item_col='contentId', rating_col='rating', l2_reg=30, num_factors=15) AS SELECT visitorId, contentId, 0.3 * (1 + (session_duration - 57937) / 57937) AS rating FROM `bqml_tutorial.analytics_session_data`
Click Run.
The query takes about 12 minutes to complete, after which your model (
my_implicit_mf_model
) appears in the navigation panel. Because the query uses aCREATE MODEL
statement to create a model, you do not see query results.
Step four (optional): Get training statistics
To see the results of the model training, you can use the
ML.TRAINING_INFO
function, or you can view the statistics in the Google Cloud console. In this
tutorial, you use the Google Cloud console.
A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.
To see the model training statistics that were generated when you ran the
CREATE MODEL
query:
In the Google Cloud console navigation panel, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click my_implicit_mf_model.
Click the Training tab, and then click Table. The results should look like the following:
The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Since you performed matrix factorization, this column is the mean squared error. By default, matrix factorization models will not split the data, so the Evaluation Data Loss column will not be present unless a holdout dataset is specified because splitting the data has a chance of losing all the ratings for a user or an item. As a result, the model will not have latent factor information about missing users or items.
For more details on the
ML.TRAINING_INFO
function, see the BigQuery ML syntax reference.
Step five: Evaluate your model
After creating your model, you evaluate the performance of the recommender using
the ML.EVALUATE
function. The ML.EVALUATE
function evaluates the predicted
ratings against the actual ratings.
The query used to evaluate the model is as follows:
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.my_implicit_mf_model`)
Query details
The top-most SELECT
statement retrieves the columns from your model.
The FROM
clause uses the ML.EVALUATE
function against your model: bqml_tutorial.my_implicit_mf_model
.
Run the ML.EVALUATE
query
To run the ML.EVALUATE
query that evaluates the model:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL SELECT * FROM ML.EVALUATE(MODEL `bqml_tutorial.my_implicit_mf_model`)
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Because you performed an implicit matrix factorization, the results include the following columns:
mean_average_precision
mean_squared_error
normalized_discounted_cumulative_gain
average_rank
mean_average_precision
,normalized_discounted_cumulative_gain
, andaverage_rank
are ranking metrics that are explained here: Implicit matrix factorization metrics
Step six: Predict ratings and make recommendations
Use your model to predict ratings and make recommendations
Find all the contentId
rating confidences for a set of visitorIds
ML.RECOMMEND
does not need to take any additional arguments besides the model,
but can take in an optional table. If the input table only has one column that
matches the name of the input user
or input item
column, then all of the
predicted item ratings for each user
will be outputted and vice versa. Note
that if all of the users
or all of the items
are in the input table, it
will output the same results as passing no optional argument to ML.RECOMMEND
.
The following is an example of a query to fetch all of the predicted rating confidences for 5 visitors.
#standardSQL SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`, ( SELECT visitorId FROM `bqml_tutorial.analytics_session_data` LIMIT 5))
Query details
The top-most SELECT
statement retrieves the visitorId
, contentId
, and
predicted_rating_confidence
column. This last column is generated by the
ML.RECOMMEND
function. When you use the ML.RECOMMEND
function, the output
column name for implicit matrix factorization models is
predicted_rating-column-name_confidence
. For implicit
matrix factorization models, predicted_rating_confidence
is the estimated
confidence for the user
/item
pair. This confidence value approximately lies
between 0 and 1 where the higher confidence indicates that the user
prefers
item
more than an item
with a lower confidence value.
The ML.RECOMMEND
function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model
.
This query's nested SELECT
statement selects just the visitorId
column from
the original table used for training.
The LIMIT
clause—LIMIT 5
—will randomly filter out 5
visitorId
s to send to ML.RECOMMEND
.
Find the ratings for all visitorId contentId pairs
Now that you have evaluated your model, the next step is to use it to get a rating confidence. You use your model to predict the confidences of every user-item combination in the following query:
#standardSQL SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)
Query details
The top-most SELECT
statement retrieves the visitorId
, contentId
, and
predicted_rating_confidence
column. This last column is generated by the
ML.RECOMMEND
function. When you use the ML.RECOMMEND
function, the output
column name for implicit matrix factorization models is
predicted_rating-column-name_confidence
. For implicit matrix factorization
models, predicted_rating_confidence
is the estimated confidence for the user
/item
pair. This confidence value approximately lies between 0 and 1 where the higher
confidence indicates that the user
prefers item
more than an item
with a
lower confidence value.
The ML.RECOMMEND
function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model
.
One way of saving the result to table is:
#standardSQL CREATE OR REPLACE TABLE `bqml_tutorial.recommend_content` OPTIONS() AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)
If a Query Exceeded Resource Limits
error occurs for ML.RECOMMEND
, retry
with a higher billing tier. In the BigQuery command-line tool, this can be set using
--maximum_billing_tier
.
Generate recommendations
The following query uses ML.RECOMMEND
to output the top 5 recommended
contentId
s per visitorId
.
#standardSQL SELECT visitorId, ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence) ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec FROM `bqml_tutorial.recommend_content` GROUP BY visitorId
Query details
The SELECT
statement aggregates the results from the ML.RECOMMEND
query by
using GROUP BY visitorId
to aggregate the contentId
and
predicted_rating_confidence
in descending order and only keeps the top 5
content ids.
Using the previous recommendations query, we can order by the predicted rating
and output the top predicted items for each user. The following query joins the
item_ids
with the movie_ids
found in the movielens.movie_titles
table
uploaded earlier and outputs the top 5 recommended movies per user.
Run the ML.RECOMMEND
query
To run the ML.RECOMMEND
query that outputs the top 5 recommended content ids
per visitor id:
In the Google Cloud console, click the Compose new query button.
Enter the following GoogleSQL query in the Query editor text area.
#standardSQL CREATE OR REPLACE TABLE `bqml_tutorial.recommend_content` OPTIONS() AS SELECT * FROM ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)
Click Run.
When the query finishes running, (
bqml_tutorial.recommend_content
) will appear in the navigation panel of the Google Cloud console. Because the query uses aCREATE TABLE
statement to create a table, you do not see the query results.Compose another new query. Enter the following GoogleSQL query in the Query editor text area once the previous query has finished running.
#standardSQL SELECT visitorId, ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence) ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec FROM `bqml_tutorial.recommend_content` GROUP BY visitorId
(Optional) To set the processing location, click More > Query settings. For Processing location, choose
US
. This step is optional because the processing location is automatically detected based on the dataset's location.Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- To learn more about machine learning, see the Machine learning crash course.
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- To learn more about the Google Cloud console, see Using the Google Cloud console.