Recommendations in TensorFlow: Apply to Data from Google Analytics

This article is the third part of a multi-part tutorial series that shows you how to implement a machine learning (ML) recommendation system with TensorFlow and AI Platform in Google Cloud Platform (GCP). This part shows you how to apply the TensorFlow model to data from Google Analytics 360 to produce content recommendations for a website.

The series consists of these parts:

This tutorial assumes that you have completed the preceding tutorials in the series.

This tutorial shows you how to use the TensorFlow WALS model to produce recommendations for a content website, based on the following:

  • The input data for the recommendation technique are events that track user behavior.
  • Each event represents the interaction of a user with an item on a website.
  • For content websites, the relevant events are those in which a user selects an article to read by clicking a link to the article.
  • These events can be captured using Google Analytics.

The following diagram shows the components you'll use in this tutorial.

Architecture diagram showing Google Analytics feeding into BigQuery, then Cloud Storage, and from there to AI Platform
Figure 1. Technical components used in this tutorial

Objectives

  • Prepare Google Analytics data from BigQuery for training a recommendation model.
  • Train the recommendation model.
  • Tune model hyperparameters for Google Analytics data.
  • Run the TensorFlow model code on Google Analytics data to generate recommendations.

Costs

This tutorial uses Cloud Storage, AI Platform, and BigQuery, which are billable services. You can use the pricing calculator to estimate the costs for your projected usage. The projected cost for this tutorial is $0.15. If you are a new GCP user, you might be eligible for a free trial.

Before you begin

This tutorial assumes that you have run the previous two tutorials in the series and that you have an existing GCP project and a Cloud Storage bucket. It also assumes you've installed the model according to the instructions in Part 1.

For this tutorial you must also enable the BigQuery API:

  1. Select the GCP project you used for previous parts of the tutorial:

    Go to the Projects Page

  2. Enable the BigQuery API.

    Enable the API

Using Google Analytics events for recommendations

This section provides an overview of using Google Analytics events for recommendations. You don't have to have an existing Google Analytics account to use this tutorial. Sample Google Analytics data is provided, along with instructions for uploading that data to BigQuery.

Read this section if you have an existing Google Analytics 360 account and want to configure it to provide recommendation data.

Review Google Analytics event fields

Four fields of Google Analytics click events are relevant to recommendations:

  • Timestamp—the timestamp of the event. The click events you are interested in for recommendations are referred to as page tracking hits in Google Analytics. The timestamps for all hits are automatically stored by Google Analytics.

  • User ID—the unique ID of the user, or the unique client ID. By default, Google Analytics tracks users using a cookie, which stores a unique identifier for the browser client used in the web session. This tutorial assumes that user identification is based on this client ID. In the BigQuery schema for Google Analytics 360, the client ID is stored in the fullVisitorId column. More-accurate recommendations can achieved by enabling the user ID feature of Google Analytics, which tracks user interactions across all devices and clients. If you want to use the user ID, replace the references to fullVisitorId with userId in the SQL query used in the section on preparing the data later in this document.

  • Article ID—the unique ID of the article. The recommended way to capture the article ID is by configuring a custom dimension in Google Analytics, as discussed in the next section. You can also use the URL of the article as its unique identifier, as long as the URL doesn't change over time. If you use the URL as the unique identifier, replace articleId with hits.page.pagePath in the SQL query below.

  • Time On Page—the amount of time in seconds that the user spent viewing the article. Time On Page is not explicitly tracked in page tracking events for Google Analytics. This tutorial uses a common assumption to calculate the time on page: subtract the time of the page tracking event in which the user accessed the page from the time of the next page tracking event for that user. We provide more detail in the section on preparing the data later in this document.

Configure Google Analytics to capture events

Follow the Google Analytics documentation on custom dimensions to add an article ID dimension with hit-level scope to page tracking events. Take note of the index of the custom dimension, because it's needed for the SQL query that you use later to export the event data from BigQuery. You should also create a view in Google Analytics that corresponds to the properties that you want to generate recommendations for.

Transfer Google Analytics data to BigQuery

If you use Google Analytics 360, you can import page tracking event data directly to a BigQuery database in a GCP project. For information on how to set this up, see the Google Analytics documentation for BigQuery export.

The event data is transferred daily from Google Analytics into a dataset in BigQuery that corresponds to the Google Analytics view that you choose in the export setup. A table is exported into the dataset for each day's worth of data.

Preparing the data from BigQuery for training

In this section, you upload sample data to BigQuery and you export the training data.

Upload the sample Google Analytics data

This tutorial comes with a sample Google Analytics dataset that contains page tracking events from the Austrian news site Kurier.at. The schema file ga_sessions_sample_schema.json is located in the data folder in the tutorial code, and the data file ga_sessions_sample.json.gz is located in a public Cloud Storage bucket associated with this tutorial.

To upload this dataset to BigQuery:

  1. In your shell, set the environment variable BUCKET to the Cloud Storage bucket URL of the Cloud Storage bucket that you created in Part 2:

    BUCKET=gs://[YOUR_BUCKET_NAME]
  2. Copy the data file ga_sessions_sample.json.gz to this bucket:

    gsutil cp gs://solutions-public-assets/recommendation-tensorflow/data/ga_sessions_sample.json.gz ${BUCKET}/data/ga_sessions_sample.json.gz
  3. Go to the BigQuery page.

    Go to the BigQuery page

  4. Choose an existing dataset in your project, or follow the BigQuery instructions to create a new dataset.

  5. In the navigation panel, hold the mouse pointer over the dataset name, click the down arrow icon, and then click Create new table.

  6. On the Create Table page, in the Source Data section, do the following:

    • For Location, select Google Cloud Storage, and then enter the following path:

      [YOUR_BUCKET_NAME]/data/ga_sessions_sample.json.gz

      Note that this path does not include the gs:// prefix.

    • For File format, select JSON.

  7. In the Destination Table section of the Create Table page, do the following:

    • For Table name, choose the dataset, and in the table name field, enter the name of the table (for example, ga_sessions_sample).
    • Verify that Table type is set to Native table.
  8. In a text editor, open the ga_sessions_sample_schema.json file, select all the text, and then copy the complete text of the file.

  9. In the Schema section, click Edit as text, and then paste the table schema into the text field:

    Screenshot of the JSON schema

  10. Click Create Table.

Export the training data

The model code takes a CSV file as input, with a header row that contains three columns:

  • clientId
  • contentId
  • timeOnPage

To create training data, assume you imported the sample dataset into a dataset named GA360_sample, in a table named ga_sessions_sample. You can create the CSV training data file as follows.

To begin, run a SQL query to extract those columns from the page tracking event table in BigQuery, and save the output of the query to a table:

  1. Go to the BigQuery page.

    Go to the BigQuery page

  2. Click the Compose query button.

  3. In the New Query text area, enter the following query:

    #legacySql
    SELECT
     fullVisitorId as clientId,
     ArticleID as contentId,
     (nextTime - hits.time) as timeOnPage,
    FROM(
      SELECT
        fullVisitorId,
        hits.time,
        MAX(IF(hits.customDimensions.index=10,
               hits.customDimensions.value,NULL)) WITHIN hits AS ArticleID,
        LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber
      ORDER BY hits.time ASC) as nextTime
      FROM [GA360_sample.ga_sessions_sample]
      WHERE hits.type = "PAGE"
    ) HAVING timeOnPage is not null and contentId is not null;

    The article ID in the sample dataset is stored in a custom dimension with an index of 10. This is used in the inner select clause to filter the events. Also note that time on page is calculated by subtracting the next event for the user ID fullVisitorId from the page hit event.

  4. Click the Show Options button.

  5. Click the Select Table button in the Destination Table section.

  6. Select the same dataset that you imported your data to.

  7. Enter a table ID (for example, recommendation_events) and click OK.

  8. Click the Run query button.

The next step is to export the destination table to a CSV file in the Cloud Storage bucket that you created in Part 2.

  1. Go to the BigQuery page.

    Go to the BigQuery page

  2. In the navigation pane, find the dataset that you uploaded the sample data to, and then expand to display its contents.

    Screenshot showing the recommendation events dataset in the console

  3. Click the down arrow icon next to the destination table from the previous step.

  4. Select Export table. The Export to Google Cloud Storage dialog is displayed.

  5. Leave the default settings—make sure that Export format is set to CSV and Compression is set to None.

  6. In the Google Cloud Storage URI box, enter a URI in the following format:

    gs://[YOUR_BUCKET_NAME]/ga_pageviews.csv

    For [YOUR_BUCKET_NAME], use the name of the Cloud Storage bucket you created in Part 2.

  7. Click OK to export the table.

    While the job is running, (extracting) appears next to the name of the table in the navigation.

To check on the progress of the job, look near the top of the navigation for Job History for an Extract job.

Training the recommendation model

  1. Navigate to the model code that you installed in Part 1 of this series.
  2. Make sure the BUCKET shell variable is set to the Cloud Storage bucket that you created in Part 2:

    BUCKET="gs://[YOUR_BUCKET_NAME]"
  3. Run the mltrain.sh script, passing the path to the CSV file exported from BigQuery, and using the data-type web_views parameter.

    For example, to train the model on AI Platform using a CSV file at the following URL:

    gs://[YOUR_BUCKET_NAME]/ga_pageviews.csv

    Use the following command:

    ./mltrain.sh train $BUCKET ga_pageviews.csv --data-type web_views

When the training is finished, the model data is saved in a subdirectory named model under the job directory of the training task. This data consists of several arrays, all saved in numpy format:

  • The user factor matrix, which is in row.npy
  • The column factor matrix, which is in col.npy
  • The mapping between rating matrix row index and client IDs, which is in user.npy
  • The mapping between rating matrix column index and article IDs, which is in item.npy

For an explanation of the row and column factor matrices, see the overview article in this series.

The path for the job directory is created using the BUCKET argument passed to the mltrain.sh script, then /jobs/, and then the identifier of the training job. The job identifier is set in the mltrain.sh script as well. By default, that identifier is wals_ml_train appended with the job start date and time. For example, if you specified a BUCKET of gs://my_bucket, the model files would be saved to paths like these:

gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/row.npy
gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/col.npy
gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/user.npy
gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/item.npy

Tuning model hyperparameters for Google Analytics data

Part 2 of this tutorial series showed you how to perform parameter tuning for the model using the AI Platform hyperparameter tuning feature. For the most accurate recommendations, this process should be repeated for the Google Analytics dataset, because page-view times have a very different scale and distribution than the MovieLens ratings used in Part 2.

To perform hyperparameter tuning on the Google Analytics data, run the mltrain.sh script with the tune option, passing the path to your exported data file, and the parameter --data-type web_views.

./mltrain.sh tune gs://your_bucket data/ga_pageviews.csv --data-type web_views

The mltrain.sh script uses the config/config_tune_web.json configuration file for tuning page view data. The feature weight factor is not part of the tuning parameters, because the exponential observed weight is used for page view data. For more information about tuning configuration, see Part 2 of this tutorial series.

The following table summarizes the results of hyperparameter tuning for the sample dataset.

Hyperparameter Name Description Value From Tuning
latent_factors Latent factors K 30
regularization L2 Regularization constant 5.05
unobs_weight Unobserved weight 0.01
feature_wt_factor Observed weight (linear) N/A
feature_wt_exp Feature weight exponent 5.05

Table 1 Values discovered by AI Platform hyperparameter tuning for the sample Google Analytics data

Running the model code to generate recommendations

The tutorial code file model.py contains a method named generate_recommendations, which can be used to generate a set of recommendations using a trained model. The theoretical basis for generating predictions from the row and column factors is explained in the overview article of this series.

The generate_recommendations method takes the following arguments:

  • The row index of the user in the rating matrix.
  • A list of indexes for items that the user has previously rated (that is, the item indexes of articles that the user has previously viewed).
  • The row and column factors generated by training the model.
  • The number of desired recommendations.

The row index of the user in the rating matrix is not the same as the client ID. In order to retrieve the row index of the user, you can perform a lookup in the array mapping between the client IDs from Google Analytics and the rating matrix row indexes. Similarly, the column indices of articles in the ratings matrix are not the same as article IDs, and must be looked up from the array mapping article IDs to column indexes.

The generate_recommendations method produces recommendations by generating predicted ratings—in this case, predicted page view times—for pages that the user has not viewed previously. The predictions are sorted and the top K are returned, where K is the number of desired recommendations.

Example of generating recommendations

Suppose you have the model data stored in the /tmp/model directory. You want to generate 5 recommendations for client ID 1000163602560555666, which represents a user who has previously viewed the list of articles 295436355, 295044773, and 295195092. The following Python code generates the recommendations:

import numpy as np
from model import generate_recommendations

client_id = 1000163602560555666
already_rated = [295436355, 295044773, 295195092]
k = 5
user_map = np.load("/tmp/model/user.npy")
item_map = np.load("/tmp/model/item.npy")
row_factor = np.load("/tmp/model/row.npy")
col_factor = np.load("/tmp/model/col.npy")
user_idx = np.searchsorted(user_map, client_id)
user_rated = [np.searchsorted(item_map, i) for i in already_rated]

recommendations = generate_recommendations(user_idx, user_rated, row_factor, col_factor, k)

The user index and column indices of the previously viewed articles are looked up using the np.searchsorted method on the user and item map arrays. The returned value recommendations is a list of column indexes for articles, which can be mapped to article IDs using the item map:

article_recommendations = [item_map[i] for i in recommendations]

Generate recommendations in production

The remaining components needed for a production system to serve recommendations on Google Analytics data include:

  • Training a recommendation model on a regular schedule—for example, nightly.
  • Serving the recommendations using an API.

In Part 4 of this series you will see how to deploy a production system on GCP to perform those tasks.

Acknowledgements

Thanks to e-dialog for their assistance in preparing certain technical content for this article.

Cleaning up

If you want to continue in the series, no cleanup is necessary. Proceed to Part 4 of the tutorial.

If you don't want to continue this series, you should delete the resources you created in order to avoid incurring charges to your GCP account.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Was this page helpful? Let us know how we did:

Send feedback about...