Using BigQuery ML to make recommendations from Google analytics data

This tutorial introduces data analysts to the matrix factorization model in BigQuery ML. BigQuery ML enables users to create and execute machine learning models in BigQuery using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

In this tutorial, you will learn how to create a matrix factorization model from implicit feedback using the GA360_test.ga_sessions_sample sample table to make recommendations given a visitor ID and a content ID.

The ga_sessions_sample table contains information about a slice of session data collected by Google Analytics 360 and sent to BigQuery.

Objectives

In this tutorial, you use the following:

  • BigQuery ML: To create an implicit recommentations model using the CREATE MODEL statement.
  • The ML.EVALUATE function: To evaluate the ML models.
  • The ML.WEIGHTS function: To inspect the latent factor weights generated during training.
  • The ML.RECOMMEND function: To produce recommendations for a user.

Costs

This tutorial uses billable components of Google Cloud, including:

  • BigQuery
  • BigQuery ML

For more information about BigQuery costs, see the BigQuery pricing page.

For more information about BigQuery ML costs, see the BigQuery ML pricing page.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Step one: Create your dataset

The first step is to create a BigQuery dataset to store your ML model. To create your dataset:

  1. In the Cloud Console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Resources section, click your project name.

  3. In the details panel, click Create dataset.

    Create dataset.

  4. On the Create dataset page:

    • For Dataset ID, enter bqml_tutorial.
    • For Data location, choose United States (US). Currently, the public datasets are stored in the US multi-region location. For simplicity, you should place your dataset in the same location.

      Create dataset page.

  5. Leave all of the other default settings in place and click Create dataset.

Step two: Load GoogleAnalytics 360 data into BigQuery

Most of the time, the ratings in your data will not reflect a value that a user sets explicitly. In these scenarios, we can devise a proxy to these values as an implicit rating and use a different algorithm to compute recommendations. In this sample, we will a sample Google Analytics 360 dataset. This sample is based on the following article.

The following is a query to run to create a dataset with implicit ratings from the session duration that a visitor has had on a page from the cloud-training-demos.GA360_test.ga_sessions_sample. The goal of this query is to create a dataset with three columns that we can map to a user column, item column and rating column.

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE TABLE
     bqml_tutorial.analytics_session_data AS
    WITH
     visitor_page_content AS (
     SELECT
       fullVisitorID,
       (
       SELECT
         MAX(
         IF
           (index=10,
             value,
             NULL))
       FROM
         UNNEST(hits.customDimensions)) AS latestContentId,
       (LEAD(hits.time, 1)
         OVER (PARTITION BY fullVisitorId ORDER BY hits.time ASC) - hits.time)
                 AS session_duration
     FROM
       `cloud-training-demos.GA360_test.ga_sessions_sample`,
       UNNEST(hits) AS hits
     WHERE
       # only include hits on pages
       hits.type = "PAGE"
     GROUP BY
       fullVisitorId,
       latestContentId,
       hits.time )
     # aggregate web stats
    SELECT
     fullVisitorID AS visitorId,
     latestContentId AS contentId,
     SUM(session_duration) AS session_duration
    FROM
     visitor_page_content
    WHERE
     latestContentId IS NOT NULL
    GROUP BY
     fullVisitorID,
     latestContentId
    HAVING
     session_duration > 0
    ORDER BY
     latestContentId
    
  3. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  4. Click Run.

    When the query finishes running, (bqml_tutorial.analytics_session_data) will appear in the navigation panel. Because the query uses a CREATE TABLE statement to create a table, you do not see the query results.

  5. If you take a look at the table produced it should look something like the following:

    Query Results.

    Note that this result is specific to how the data was exported to BigQuery. The query to extract your own data might differ.

Step three: Create your Implicit Recommendations Model

Next, you create an implicit recommendations model using the google analytics table that was loaded in the previous step. The following standard SQL query is used to create the model that will be used to predict a confidence rating for every visitorId contentId pair.

#standardSQL
CREATE OR REPLACE MODEL bqml_tutorial.my_implicit_mf_model
OPTIONS
  (model_type='matrix_factorization',
   feedback_type='implicit',
   user_col='visitorId',
   item_col='contentId',
   rating_col='rating',
   l2_reg=30,
   num_factors=15) AS
SELECT
  visitorId,
  contentId,
  0.3 * (1 + (session_duration - 57937) / 57937) AS rating
FROM bqml_tutorial.analytics_session_data

Query details

The CREATE MODEL clause is used to create and train the model named bqml_tutorial.my_implicit_mf_model.

The OPTIONS(model_type='matrix_factorization', feedback_type='IMPLICIT', user_col='visitorId', ...) clause indicates that you are creating a matrix factorization model. Because feedback_type='IMPLICIT' is specified an implicit matrix factorization model will be trained. An example of how to create an explicit matrix factorization model is explained in Creating An Explicit Matrix Factorization Model.

This query's SELECT statement uses the following columns to generate recommendations.

  • visitorId—The visitor ID (INT64).
  • contentId—The content ID (INT64).
  • rating—The implicit rating from 0 to 1 calculated for visitorId and their respective contentId centered and scaled (FLOAT64).

The FROM clause—bqml_tutorial.analytics_session_data — indicates that you are querying the analytics_session_data table in the bqml_tutorial dataset. This dataset is in your bigquery project if the instructions in step two and eight were followed.

Run the CREATE MODEL query

To run the CREATE MODEL query to create and train your model:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE MODEL bqml_tutorial.my_implicit_mf_model
    OPTIONS
     (model_type='matrix_factorization',
      feedback_type='implicit',
      user_col='visitorId',
      item_col='contentId',
      rating_col='rating',
      l2_reg=30,
      num_factors=15) AS
    SELECT
     visitorId,
     contentId,
     0.3 * (1 + (session_duration - 57937) / 57937) AS rating
    FROM bqml_tutorial.analytics_session_data
    
  3. Click Run.

    The query takes about 12 minutes to complete, after which your model (my_implicit_mf_model) appears in the navigation panel. Because the query uses a CREATE MODEL statement to create a model, you do not see query results.

(Optional) Step four: Get training statistics

To see the results of the model training, you can use the ML.TRAINING_INFO function, or you can view the statistics in the Cloud Console. In this tutorial, you use the Cloud Console.

A machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss. This process is called empirical risk minimization.

To see the model training statistics that were generated when you ran the CREATE MODEL query:

  1. In the Cloud Console navigation panel, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click my_implicit_mf_model.

  2. Click the Training tab, and then click Table. The results should look like the following:

    ML.TRAINING_INFO output.

    The Training Data Loss column represents the loss metric calculated after the model is trained on the training dataset. Since you performed matrix factorization, this column is the mean squared error. By default, matrix factorization models will not split the data, so the Evaluation Data Loss column will not be present unless a holdout dataset is specified because splitting the data has a chance of losing all the ratings for a user or an item. As a result, the model will not have latent factor information about missing users or items.

    For more details on the ML.TRAINING_INFO function, see the BigQuery ML syntax reference.

Step five: Evaluate your model

After creating your model, you evaluate the performance of the recommender using the ML.EVALUATE function. The ML.EVALUATE function evaluates the predicted ratings against the actual ratings.

The query used to evaluate the model is as follows:

#standardSQL
SELECT
  *
FROM
  ML.EVALUATE(MODEL `bqml_tutorial.my_implicit_mf_model`)

Query details

The top-most SELECT statement retrieves the columns from your model.

The FROM clause uses the ML.EVALUATE function against your model: bqml_tutorial.my_implicit_mf_model.

Run the ML.EVALUATE query

To run the ML.EVALUATE query that evaluates the model:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    SELECT
     *
    FROM
     ML.EVALUATE(MODEL bqml_tutorial.my_implicit_mf_model)
    
  3. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  4. Click Run.

  5. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.EVALUATE output.

    Because you performed an implicit matrix factorization, the results include the following columns:

    • mean_average_precision
    • mean_squared_error
    • normalized_discounted_cumulative_gain
    • average_rank

    mean_average_precision, normalized_discounted_cumulative_gain, and average_rank are ranking metrics that are explained here: Implicit Matrix Factorization Metrics

Step six: Use your model to predict ratings and make recommendations

Find all the contentId rating confidences for a set of visitorIds

ML.RECOMMEND does not need to take any additional arguments besides the model, but can take in an optional table. If the input table only has one column that matches the name of the input user or input item column, then all of the predicted item ratings for each user will be outputed and vice versa. Note that if all of the users or all of the items are in the input table, it will output the same results as passing no optional argument to ML.RECOMMEND.

The following is an example of a query to fetch all of the predicted rating confidences for 5 visitors.

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL bqml_tutorial.my_implicit_mf_model,
    (
    SELECT
      visitorId
    FROM
      bqml_tutorial.analytics_session_data
    LIMIT 5))

Query details

The top-most SELECT statement retrieves the visitorId, contentId, and predicted_rating_confidence column. This last column is generated by the ML.RECOMMEND function. When you use the ML.RECOMMEND function, the output column name for implicit matrix factorization models is predicted_rating-column-name_confidence. For implicit matrix factorization models, predicted_rating_confidence is the estimated confidence for the user/item pair. This confidence value approximately lies between 0 and 1 where the higher confidence indicates that the user prefers item more than an item with a lower confidence value.

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model.

This query's nested SELECT statement selects just the visitorId column from the original table used for training.

The LIMIT clause—LIMIT 5—will randomly filter out 5 visitorIds to send to ML.RECOMMEND.

Find the ratings for all visitorId contentId pairs

Now that you have evaluated your model, the next step is to use it to get a rating confidence. You use your model to predict the confidences of every user-item combination in the following query:

#standardSQL
SELECT
  *
FROM
  ML.RECOMMEND(MODEL bqml_tutorial.my_implicit_mf_model)

Query details

The top-most SELECT statement retrieves the visitorId, contentId, and predicted_rating_confidence column. This last column is generated by the ML.RECOMMEND function. When you use the ML.RECOMMEND function, the output column name for implicit matrix factorization models is predicted_rating-column-name_confidence. For implicit matrix factorization models, predicted_rating_confidence is the estimated confidence for the user/item pair. This confidence value approximately lies between 0 and 1 where the higher confidence indicates that the user prefers item more than an item with a lower confidence value.

The ML.RECOMMEND function is used to predict ratings using your model: bqml_tutorial.my_implicit_mf_model.

One way of saving the result to table is:

#standardSQL
CREATE OR REPLACE TABLE bqml_tutorial.recommend_content
OPTIONS() AS
SELECT
  *
FROM
  ML.RECOMMEND(MODEL bqml_tutorial.my_implicit_mf_model)

If a Query Exceeded Resource Limits error occurs for ML.RECOMMEND, retry with a higher billing tier. In the BigQuery command-line tool, this can be set using --maximum_billing_tier.

Generate recommendations

The following query uses ML.RECOMMEND to output the top 5 recommended contentIds per visitorId.

#standardSQL
SELECT
  visitorId,
  ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence)
    ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec
FROM
  bqml_tutorial.recommend_content
GROUP BY
  visitorId

Query details

The SELECT statement aggregates the results from the ML.RECOMMEND query by using GROUP BY visitorId to aggregate the contentId and predicted_rating_confidence in descending order and only keeps the top 5 content ids.

Using the previous recommendations query, we can order by the predicted rating and output the top predicted items for each user. The following query joins the item_ids with the movie_ids found in the movielens.movie_titles table uploaded earlier and outputs the top 5 recommended movies per user.

Run the ML.RECOMMEND query

To run the ML.RECOMMEND query that outputs the top 5 recommended content ids per visitor id:

  1. In the Cloud Console, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    #standardSQL
    CREATE OR REPLACE TABLE bqml_tutorial.recommend_content
    OPTIONS() AS
    SELECT
     *
    FROM
     ML.RECOMMEND(MODEL bqml_tutorial.my_implicit_mf_model)
    
  3. Click Run.

    When the query finishes running, (bqml_tutorial.recommend_content) will appear in the navigation panel of the Cloud Console. Because the query uses a CREATE TABLE statement to create a table, you do not see the query results.

  4. Compose another new query. Enter the following standard SQL query in the Query editor text area once the previous query has finished running.

    #standardSQL
    SELECT
     visitorId,
     ARRAY_AGG(STRUCT(contentId, predicted_rating_confidence)
       ORDER BY predicted_rating_confidence DESC LIMIT 5) AS rec
    FROM
     `bqml_tutorial.recommend_content`
    GROUP BY
     visitorId
    
  5. (Optional) To set the processing location, click More > Query settings. For Processing location, choose US. This step is optional because the processing location is automatically detected based on the dataset's location.

    Query settings.

  6. Click Run.

  7. When the query is complete, click the Results tab below the query text area. The results should look like the following:

    ML.RECOMMEND output.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Cloud Console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset, the table, and all the data.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Deleting your project

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

What's next