Use BigQuery ML to make recommendations from Google analytics data

This tutorial introduces data analysts to the matrix factorization model in BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

In this tutorial, you will learn how to create a matrix factorization model from implicit feedback using the GA360_test.ga_sessions_sample sample table to make recommendations given a visitor ID and a content ID.

The ga_sessions_sample table contains information about a slice of session data collected by Google Analytics 360 and sent to BigQuery.

Objectives

In this tutorial, you use the following:

BigQuery ML: To create an implicit recommendations model using the CREATE MODEL statement.
The ML.EVALUATE function: To evaluate the ML models.
The ML.WEIGHTS function: To inspect the latent factor weights generated during training.
The ML.RECOMMEND function: To produce recommendations for a user.

Costs

This tutorial uses billable components of Google Cloud, including:

BigQuery
BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see BigQuery ML pricing.

Before you begin

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Enable the API

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

In the Google Cloud console, go to the BigQuery page.

Go to the BigQuery page
In the Explorer pane, click your project name.
Click View actions > Create dataset.
On the Create dataset page, do the following:
- For Dataset ID, enter bqml_tutorial.
- For Location type, select Multi-region, and then select US (multiple regions in United States).
  
  The public datasets are stored in the US multi-region. For simplicity, store your dataset in the same location.
- Leave the remaining default settings as they are, and click Create dataset.

Step two: Load Analytics 360 data into BigQuery

Most of the time, the ratings in your data will not reflect a value that a user sets explicitly. In these scenarios, we can devise a proxy to these values as an implicit rating and use a different algorithm to compute recommendations. In this sample, we will a sample Analytics 360 dataset. This sample is based on the following article.

The following is a query to run to create a dataset with implicit ratings from the session duration that a visitor has had on a page from the cloud-training-demos.GA360_test.ga_sessions_sample. The goal of this query is to create a dataset with three columns that we can map to a user column, item column and rating column.

In the Google Cloud console, click the Compose new query button.

Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
CREATE OR REPLACE TABLE
 bqml_tutorial.analytics_session_data AS
WITH
 visitor_page_content AS (
 SELECT
   fullVisitorID,
   (
   SELECT
     MAX(
     IF
       (index=10,
         value,
         NULL))
   FROM
     UNNEST(hits.customDimensions)) AS latestContentId,
   (LEAD(hits.time, 1)
     OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time)
             AS session_duration
 FROM
   `cloud-training-demos.GA360_test.ga_sessions_sample`,
   UNNEST(hits) AS hits
 WHERE
   # only include hits on pages
   hits.type = "PAGE"
 GROUP BY
   fullVisitorId,
   latestContentId,
   hits.time )
 # aggregate web stats
SELECT
 fullVisitorID AS visitorId,
 latestContentId AS contentId,
 SUM(session_duration) AS session_duration
FROM
 visitor_page_content
WHERE
 latestContentId IS NOT NULL
GROUP BY
 fullVisitorID,
 latestContentId
HAVING
 session_duration > 0
ORDER BY
 latestContentId

(Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.
Click Run.

When the query finishes running, (bqml_tutorial.analytics_session_data) will appear in the navigation panel. Because the query uses a CREATE TABLE statement to create a table, you do not see the query results.
If you take a look at the table produced it should look something like the following:

Note that this result is specific to how the data was exported to BigQuery. The query to extract your own data might differ.

Step three: Create your Implicit Recommendations Model

Next, you create an implicit recommendations model using the google analytics table that was loaded in the previous step. The following GoogleSQL query is used to create the model that will be used to predict a confidence rating for every visitorId contentId pair. A rating is created with centering and scaling by the median session duration, and filtering those records where the session duration is more than 3.33 times the median as outliers.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.my_implicit_mf_model`
OPTIONS
  (model_type='matrix_factorization',
   feedback_type='implicit',
   user_col='visitorId',
   item_col='contentId',
   rating_col='rating',
   l2_reg=30,
   num_factors=15) AS
SELECT
  visitorId,
  contentId,
  0.3 * (1 + (session_duration - 57937) / 57937) AS rating
FROM `bqml_tutorial.analytics_session_data`
WHERE 0.3 * (1 + (session_duration - 57937) / 57937) < 1

Query details

The CREATE MODEL clause is used to create and train the model named bqml_tutorial.my_implicit_mf_model.

The OPTIONS(model_type='matrix_factorization', feedback_type='IMPLICIT', user_col='visitorId', ...) clause indicates that you are creating a matrix factorization model. Because feedback_type='IMPLICIT' is specified an implicit matrix factorization model will be trained. An example of how to create an explicit matrix factorization model is explained in Creating an explicit matrix factorization model.

This query's SELECT statement uses the following columns to generate recommendations.

visitorId—The visitor ID (INT64).
contentId—The content ID (INT64).
rating—The implicit rating from 0 to 1 calculated for visitorId and their respective contentId centered and scaled (FLOAT64).

The FROM clause—bqml_tutorial.analytics_session_data — indicates that you are querying the analytics_session_data table in the bqml_tutorial dataset. This dataset is in your bigquery project if the instructions in step two and eight were followed.

Run the `CREATE MODEL` query

To run the CREATE MODEL query to create and train your model:

In the Google Cloud console, click the Compose new query button.

Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
CREATE OR REPLACE MODEL `bqml_tutorial.my_implicit_mf_model`
OPTIONS
 (model_type='matrix_factorization',
  feedback_type='implicit',
  user_col='visitorId',
  item_col='contentId',
  rating_col='rating',
  l2_reg=30,
  num_factors=15) AS
SELECT
 visitorId,
 contentId,
 0.3 * (1 + (session_duration - 57937) / 57937) AS rating
FROM `bqml_tutorial.analytics_session_data`

Click Run.

The query takes about 12 minutes to complete, after which your model (my_implicit_mf_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

Step four (optional): Get training statistics

To see the results of the model training, you can use the ML.TRAINING_INFO function, or you can view the statistics in the Google Cloud console. In this tutorial, you use the Google Cloud console.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

To see the model training statistics that were generated when you ran the CREATE MODEL query:

In the Google Cloud console navigation panel, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click my_implicit_mf_model.
Click the Training tab, and then click Table. The results should look like the following:

The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Since you performed matrix factorization, this column is the mean squared error. By default, matrix factorization models will not split the data, so the Evaluation Data Loss column will not be present unless a holdout dataset is specified because splitting the data has a chance of losing all the ratings for a user or an item. As a result, the model will not have latent factor information about missing users or items.

For more details on the ML.TRAINING_INFO function, see the BigQuery ML syntax reference.

Step five: Evaluate your model

After creating your model, you evaluate the performance of the recommender using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted ratings against the actual ratings.

The query used to evaluate the model is as follows:

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.my_implicit_mf_model`)

Query details

The top-most SELECT statement retrieves the columns from your model.

The FROM clause uses the ML.EVALUATE function against your model: bqml_tutorial.my_implicit_mf_model.

Run the `ML.EVALUATE` query

To run the ML.EVALUATE query that evaluates the model:

In the Google Cloud console, click the Compose new query button.

Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
SELECT
 *
FROM
 ML.EVALUATE(MODEL `bqml_tutorial.my_implicit_mf_model`)

(Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:

Because you performed an implicit matrix factorization, the results include the following columns:
- mean_average_precision
- mean_squared_error
- normalized_discounted_cumulative_gain
- average_rank
mean_average_precision, normalized_discounted_cumulative_gain, and average_rank are ranking metrics that are explained here: Implicit matrix factorization metrics

Step six: Predict ratings and make recommendations

Use your model to predict ratings and make recommendations

Find all the `contentId` rating confidences for a set of `visitorIds`

ML.RECOMMEND does not need to take any additional arguments besides the model, but can take in an optional table. If the input table only has one column that matches the name of the input user or input item column, then all of the predicted item ratings for each user will be outputted and vice versa. Note that if all of the users or all of the items are in the input table, it will output the same results as passing no optional argument to ML.RECOMMEND.

The following is an example of a query to fetch all of the predicted rating confidences for 5 visitors.

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`,
    (
    SELECT
      visitorId
    FROM
      `bqml_tutorial.analytics_session_data`
    LIMIT 5))

Query details

The top-most SELECT statement retrieves the visitorId, contentId, and predicted_rating_confidence column. This last column is generated by the ML.RECOMMEND function. When you use the ML.RECOMMEND function, the output column name for implicit matrix factorization models is predicted_rating-column-name_confidence. For implicit matrix factorization models, predicted_rating_confidence is the estimated confidence for the user/item pair. This confidence value approximately lies between 0 and 1 where the higher confidence indicates that the user prefers item more than an item with a lower confidence value.

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model.

This query's nested SELECT statement selects just the visitorId column from the original table used for training.

The LIMIT clause—LIMIT 5—will randomly filter out 5 visitorIds to send to ML.RECOMMEND.

Find the ratings for all visitorId contentId pairs

Now that you have evaluated your model, the next step is to use it to get a rating confidence. You use your model to predict the confidences of every user-item combination in the following query:

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)

Query details

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model.

One way of saving the result to table is:

#standardSQL
CREATE OR REPLACE TABLE `bqml_tutorial.recommend_content`
OPTIONS() AS
SELECT
  *
FROM
  ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)

If a Query Exceeded Resource Limits error occurs for ML.RECOMMEND, retry with a higher billing tier. In the BigQuery command-line tool, this can be set using --maximum_billing_tier.

Generate recommendations

The following query uses ML.RECOMMEND to output the top 5 recommended contentIds per visitorId.

#standardSQL
SELECT
  visitorId,
  ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence)
    ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec
FROM
  `bqml_tutorial.recommend_content`
GROUP BY
  visitorId

Query details

The SELECT statement aggregates the results from the ML.RECOMMEND query by using GROUP BY visitorId to aggregate the contentId and predicted_rating_confidence in descending order and only keeps the top 5 content ids.

Using the previous recommendations query, we can order by the predicted rating and output the top predicted items for each user. The following query joins the item_ids with the movie_ids found in the movielens.movie_titles table uploaded earlier and outputs the top 5 recommended movies per user.

Run the `ML.RECOMMEND` query

To run the ML.RECOMMEND query that outputs the top 5 recommended content ids per visitor id:

In the Google Cloud console, click the Compose new query button.

Enter the following GoogleSQL query in the Query editor text area.

#standardSQL
CREATE OR REPLACE TABLE `bqml_tutorial.recommend_content`
OPTIONS() AS
SELECT
 *
FROM
 ML.RECOMMEND(MODEL `bqml_tutorial.my_implicit_mf_model`)

Click Run.

When the query finishes running, (bqml_tutorial.recommend_content) will appear in the navigation panel of the Google Cloud console. Because the query uses a CREATE TABLE statement to create a table, you do not see the query results.

Compose another new query. Enter the following GoogleSQL query in the Query editor text area once the previous query has finished running.

#standardSQL
SELECT
 visitorId,
 ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence)
   ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec
FROM
 `bqml_tutorial.recommend_content`
GROUP BY
 visitorId

(Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.
Click Run.
When the query is complete, click the Results tab below the query text area. The results should look like the following:

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

You can delete the project you created.
Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

If necessary, open the BigQuery page in the Google Cloud console.

Go to the BigQuery page
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.
In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

To learn more about machine learning, see the Machine learning crash course.
For an overview of BigQuery ML, see Introduction to BigQuery ML.
To learn more about the Google Cloud console, see Using the Google Cloud console.

Use BigQuery ML to make recommendations from Google analytics data

Objectives

Costs

Before you begin

Step one: Create your dataset

Step two: Load Analytics 360 data into BigQuery

Step three: Create your Implicit Recommendations Model

Query details

Run the CREATE MODEL query

Step four (optional): Get training statistics

Step five: Evaluate your model

Query details

Run the ML.EVALUATE query

Step six: Predict ratings and make recommendations

Find all the contentId rating confidences for a set of visitorIds

Query details

Find the ratings for all visitorId contentId pairs

Query details

Generate recommendations

Query details

Run the ML.RECOMMEND query

Clean up

Delete your dataset

Delete your project

What's next

Run the `CREATE MODEL` query

Run the `ML.EVALUATE` query

Find all the `contentId` rating confidences for a set of `visitorIds`

Run the `ML.RECOMMEND` query