This article is the third part of a multi-part tutorial series that shows you how to implement a machine learning (ML) recommendation system with TensorFlow and AI Platform in Google Cloud Platform (GCP). This part shows you how to apply the TensorFlow model to data from Google Analytics 360 to produce content recommendations for a website.
The series consists of these parts:
- Overview
- Create the model (Part 1)
- Train and Tune on AI Platform (Part 2)
- Apply to Data from Google Analytics (Part 3) (this tutorial)
- Deploy the Recommendation System (Part 4)
This tutorial assumes that you have completed the preceding tutorials in the series.
This tutorial shows you how to use the TensorFlow WALS model to produce recommendations for a content website, based on the following:
- The input data for the recommendation technique are events that track user behavior.
- Each event represents the interaction of a user with an item on a website.
- For content websites, the relevant events are those in which a user selects an article to read by clicking a link to the article.
- These events can be captured using Google Analytics.
The following diagram shows the components you'll use in this tutorial.
Objectives
- Prepare Google Analytics data from BigQuery for training a recommendation model.
- Train the recommendation model.
- Tune model hyperparameters for Google Analytics data.
- Run the TensorFlow model code on Google Analytics data to generate recommendations.
Costs
This tutorial uses Cloud Storage, AI Platform, and BigQuery, which are billable services. You can use the pricing calculator to estimate the costs for your projected usage. The projected cost for this tutorial is $0.15. If you are a new GCP user, you might be eligible for a free trial.
Before you begin
This tutorial assumes that you have run the previous two tutorials in the series and that you have an existing GCP project and a Cloud Storage bucket. It also assumes you've installed the model according to the instructions in Part 1.
For this tutorial you must also enable the BigQuery API:
- Select the GCP project you used for previous parts of the tutorial:
- Enable the BigQuery API.
Using Google Analytics events for recommendations
This section provides an overview of using Google Analytics events for recommendations. You don't have to have an existing Google Analytics account to use this tutorial. Sample Google Analytics data is provided, along with instructions for uploading that data to BigQuery.
Read this section if you have an existing Google Analytics 360 account and want to configure it to provide recommendation data.
Review Google Analytics event fields
Four fields of Google Analytics click events are relevant to recommendations:
Timestamp—the timestamp of the event. The click events you are interested in for recommendations are referred to as page tracking hits in Google Analytics. The timestamps for all hits are automatically stored by Google Analytics.
User ID—the unique ID of the user, or the unique client ID. By default, Google Analytics tracks users using a cookie, which stores a unique identifier for the browser client used in the web session. This tutorial assumes that user identification is based on this client ID. In the BigQuery schema for Google Analytics 360, the client ID is stored in the
fullVisitorId
column. More-accurate recommendations can achieved by enabling the user ID feature of Google Analytics, which tracks user interactions across all devices and clients. If you want to use the user ID, replace the references tofullVisitorId
withuserId
in the SQL query used in the section on preparing the data later in this document.Article ID—the unique ID of the article. The recommended way to capture the article ID is by configuring a custom dimension in Google Analytics, as discussed in the next section. You can also use the URL of the article as its unique identifier, as long as the URL doesn't change over time. If you use the URL as the unique identifier, replace
articleId
withhits.page.pagePath
in the SQL query below.Time On Page—the amount of time in seconds that the user spent viewing the article. Time On Page is not explicitly tracked in page tracking events for Google Analytics. This tutorial uses a common assumption to calculate the time on page: subtract the time of the page tracking event in which the user accessed the page from the time of the next page tracking event for that user. We provide more detail in the section on preparing the data later in this document.
Configure Google Analytics to capture events
Follow the Google Analytics documentation on custom dimensions to add an article ID dimension with hit-level scope to page tracking events. Take note of the index of the custom dimension, because it's needed for the SQL query that you use later to export the event data from BigQuery. You should also create a view in Google Analytics that corresponds to the properties that you want to generate recommendations for.
Transfer Google Analytics data to BigQuery
If you use Google Analytics 360, you can import page tracking event data directly to a BigQuery database in a GCP project. For information on how to set this up, see the Google Analytics documentation for BigQuery export.
The event data is transferred daily from Google Analytics into a dataset in BigQuery that corresponds to the Google Analytics view that you choose in the export setup. A table is exported into the dataset for each day's worth of data.
Preparing the data from BigQuery for training
In this section, you upload sample data to BigQuery and you export the training data.
Upload the sample Google Analytics data
This tutorial comes with a sample Google Analytics dataset that contains page
tracking events from the Austrian news site Kurier.at
. The schema file
ga_sessions_sample_schema.json
is located in the data
folder
in the tutorial code, and the data file ga_sessions_sample.json.gz
is located
in a public Cloud Storage bucket associated with this tutorial.
To upload this dataset to BigQuery:
In your shell, set the environment variable
BUCKET
to the Cloud Storage bucket URL of the Cloud Storage bucket that you created in Part 2:BUCKET=gs://[YOUR_BUCKET_NAME]
Copy the data file
ga_sessions_sample.json.gz
to this bucket:gsutil cp gs://solutions-public-assets/recommendation-tensorflow/data/ga_sessions_sample.json.gz ${BUCKET}/data/ga_sessions_sample.json.gz
Go to the BigQuery page.
Choose an existing dataset in your project, or follow the BigQuery instructions to create a new dataset.
Click Create Table.
On the Create Table page, in the Source section, do the following:
For Create table from, select Google Cloud Storage, and then enter the following path:
gs://[YOUR_BUCKET_NAME]/data/ga_sessions_sample.json.gz
For File format, select JSONL.
In the Destination section of the Create Table page, do the following:
- For Table name, choose the dataset, and in the table name
field, enter the name of the table (for example,
ga_sessions_sample
). - Verify that Table type is set to Native table.
- For Table name, choose the dataset, and in the table name
field, enter the name of the table (for example,
In a text editor, open the
ga_sessions_sample_schema.json
file, select all the text, and then copy the complete text of the file.In the Schema section, click Edit as text, and then paste the table schema into the text field:
Click Create Table.
Export the training data
The model code takes a CSV file as input, with a header row that contains three columns:
clientId
contentId
timeOnPage
To create training data, assume you imported the sample dataset into a dataset
named GA360_sample
, in a table named ga_sessions_sample
. You can create the
CSV training data file as follows.
To begin, run a SQL query to extract those columns from the page tracking event table in BigQuery, and save the output of the query to a table:
Go to the BigQuery page.
Click the Compose query button.
In the New Query text area, enter the following query:
#legacySql SELECT fullVisitorId as clientId, ArticleID as contentId, (nextTime - hits.time) as timeOnPage, FROM( SELECT fullVisitorId, hits.time, MAX(IF(hits.customDimensions.index=10, hits.customDimensions.value,NULL)) WITHIN hits AS ArticleID, LEAD(hits.time, 1) OVER (PARTITION BY fullVisitorId, visitNumber ORDER BY hits.time ASC) as nextTime FROM [GA360_sample.ga_sessions_sample] WHERE hits.type = "PAGE" ) HAVING timeOnPage is not null and contentId is not null;
The article ID in the sample dataset is stored in a custom dimension with an index of 10. This is used in the inner select clause to filter the events. Also note that time on page is calculated by subtracting the next event for the user ID
fullVisitorId
from the page hit event.Click the Show Options button.
Click the Select Table button in the Destination Table section.
Select the same dataset that you imported your data to.
Enter a table ID (for example,
recommendation_events
) and click OK.Click the Run query button.
The next step is to export the destination table to a CSV file in the Cloud Storage bucket that you created in Part 2.
Go to the BigQuery page.
In the navigation pane, find the dataset that you uploaded the sample data to, and then expand to display its contents.
Click the down arrow icon next to the destination table from the previous step.
Select Export table. The Export to Google Cloud Storage dialog is displayed.
Leave the default settings—make sure that Export format is set to CSV and Compression is set to None.
In the Google Cloud Storage URI box, enter a URI in the following format:
gs://[YOUR_BUCKET_NAME]/ga_pageviews.csv
For
[YOUR_BUCKET_NAME]
, use the name of the Cloud Storage bucket you created in Part 2.Click OK to export the table.
While the job is running, (extracting) appears next to the name of the table in the navigation.
To check on the progress of the job, look near the top of the navigation for Job History for an Extract job.
Training the recommendation model
- Navigate to the model code that you installed in Part 1 of this series.
Make sure the
BUCKET
shell variable is set to the Cloud Storage bucket that you created in Part 2:BUCKET="gs://[YOUR_BUCKET_NAME]"
Run the
mltrain.sh
script, passing the path to the CSV file exported from BigQuery, and using thedata-type web_views
parameter.For example, to train the model on AI Platform using a CSV file at the following URL:
gs://[YOUR_BUCKET_NAME]/ga_pageviews.csv
Use the following command:
./mltrain.sh train $BUCKET ga_pageviews.csv --data-type web_views
When the training is finished, the model data is saved in a subdirectory named
model
under the job directory of the training task. This data consists of
several arrays, all saved in numpy
format:
- The user factor matrix, which is in
row.npy
- The column factor matrix, which is in
col.npy
- The mapping between rating matrix row index and client IDs, which is in
user.npy
- The mapping between rating matrix column index and article IDs, which is
in
item.npy
For an explanation of the row and column factor matrices, see the overview article in this series.
The path for the job directory is created using the BUCKET
argument passed to the
mltrain.sh
script, then /jobs/
, and then the identifier of the training job.
The job identifier is set in the mltrain.sh
script as well. By default, that
identifier is wals_ml_train
appended with the job start date and time. For
example, if you specified a BUCKET
of gs://my_bucket
, the model files
would be saved to paths like these:
gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/row.npy gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/col.npy gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/user.npy gs://my_bucket/jobs/wals_ml_train_20171201_120001/model/item.npy
Tuning model hyperparameters for Google Analytics data
Part 2
of this tutorial series showed you how to perform parameter tuning for
the model using the AI Platform hyperparameter tuning feature. For the
most accurate recommendations, this process should be repeated for the Google
Analytics dataset, because page-view times have a very different scale and
distribution than the MovieLens
ratings used in Part 2.
To perform hyperparameter tuning on the Google Analytics data, run the
mltrain.sh
script with the tune
option, passing the path to your exported
data file, and the parameter --data-type web_views
.
./mltrain.sh tune $BUCKET ga_pageviews.csv --data-type web_views
The mltrain.sh
script uses the config/config_tune_web.json
configuration
file for tuning page view data. The feature weight factor is not part of the
tuning parameters, because the exponential observed weight is used for page
view data. For more information about tuning configuration, see
Part 2
of this tutorial series.
The following table summarizes the results of hyperparameter tuning for the sample dataset.
Hyperparameter Name | Description | Value From Tuning |
---|---|---|
latent_factors |
Latent factors K | 30 |
regularization |
L2 Regularization constant | 5.05 |
unobs_weight |
Unobserved weight | 0.01 |
feature_wt_factor |
Observed weight (linear) | N/A |
feature_wt_exp |
Feature weight exponent | 5.05 |
Table 1 Values discovered by AI Platform hyperparameter tuning for the sample Google Analytics data
Running the model code to generate recommendations
The tutorial code file model.py
contains a method named
generate_recommendations
, which can be used to generate a set of
recommendations using a trained model. The theoretical basis for generating
predictions from the row and column factors is explained in the
overview
article of this series.
The generate_recommendations
method takes the following arguments:
- The row index of the user in the rating matrix.
- A list of indexes for items that the user has previously rated (that is, the item indexes of articles that the user has previously viewed).
- The row and column factors generated by training the model.
- The number of desired recommendations.
The row index of the user in the rating matrix is not the same as the client ID. In order to retrieve the row index of the user, you can perform a lookup in the array mapping between the client IDs from Google Analytics and the rating matrix row indexes. Similarly, the column indices of articles in the ratings matrix are not the same as article IDs, and must be looked up from the array mapping article IDs to column indexes.
The generate_recommendations
method produces recommendations by generating
predicted ratings—in this case, predicted page view times—for pages that the
user has not viewed previously. The predictions are sorted and the top K are
returned, where K is the number of desired recommendations.
Example of generating recommendations
Suppose you have the model data stored in the /tmp/model
directory. You want
to generate 5 recommendations for client ID 1000163602560555666
, which
represents a user who has previously viewed the list of articles 295436355
,
295044773
, and 295195092
. The following Python code generates the
recommendations:
import numpy as np from model import generate_recommendations client_id = 1000163602560555666 already_rated = [295436355, 295044773, 295195092] k = 5 user_map = np.load("/tmp/model/user.npy") item_map = np.load("/tmp/model/item.npy") row_factor = np.load("/tmp/model/row.npy") col_factor = np.load("/tmp/model/col.npy") user_idx = np.searchsorted(user_map, client_id) user_rated = [np.searchsorted(item_map, i) for i in already_rated] recommendations = generate_recommendations(user_idx, user_rated, row_factor, col_factor, k)
The user index and column indices of the previously viewed articles are looked
up using the np.searchsorted
method on the user and item map arrays. The
returned value recommendations is a list of column indexes for articles, which
can be mapped to article IDs using the item map:
article_recommendations = [item_map[i] for i in recommendations]
Generate recommendations in production
The remaining components needed for a production system to serve recommendations on Google Analytics data include:
- Training a recommendation model on a regular schedule—for example, nightly.
- Serving the recommendations using an API.
In Part 4 of this series you will see how to deploy a production system on GCP to perform those tasks.
Acknowledgements
Thanks to e-dialog for their assistance in preparing certain technical content for this article.
Cleaning up
If you want to continue in the series, no cleanup is necessary. Proceed to Part 4 of the tutorial.
If you don't want to continue this series, you should delete the resources you created in order to avoid incurring charges to your GCP account.
Delete the project
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- The next tutorial, Deploy the Recommendation System (Part 4) deploys a production system on GCP to serve recommendations.
- Learn more about Machine Learning on GCP.