Learn how to build a recommendation system by using BigQuery ML to generate product or service recommendations from customer data in BigQuery. Then, learn how to make that data available to other production systems by exporting it to Google Analytics 360 or Cloud Storage, or programmatically reading it from the BigQuery table.
This is the recommended approach when you have data that already resides in BigQuery. BigQuery ML lets you create a machine learning model, train it on customer data in BigQuery, and deploy it, all by using standard SQL queries. You avoid having to export your data to another product or build a model training and deployment pipeline, and BigQuery auto-scales to handle any compute resources you need.
The machine learning model you create in this tutorial uses matrix factorization, a common and effective method of creating a recommendation system based on user preference data. For more information about this approach, see matrix factorization.
This tutorial uses the Google Analytics Sample dataset, which is hosted publicly on BigQuery. This dataset provides 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.
Objectives
- Process sample data into a format suitable for training a matrix factorization model.
- Create, train, and deploy a matrix factorization model.
- Get predictions from the deployed model about what products your customers are most likely to be interested in.
- Export prediction data from BigQuery to one or more other products for use in making recommendations to customers.
Costs
This tutorial uses the following billable components of Google Cloud:
- BigQuery
- BigQuery ML
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Process the sample data
When using matrix factorization, you evaluate explicit or implicit user feedback to determine customer preferences. To use explicit feedback, the dataset must contain data about user product preferences, like star ratings between 1 and 5. In cases where there isn't explicit feedback available, you must use other behavioral metrics to infer customer preferences, like looking at the total time a user spends on a product detail page. This is the approach used in this tutorial.
To train the matrix factorization model, you need a table with columns that identify the customer, the item being rated, and the implicit rating. In this section, you will create such a table with the columns userid, itemId, and session_duration, where the session_duration column contains the duration of the user's session on the given item's product page.
Follow these steps to create such a table using data from the Google Analytics Sample dataset:
- Open the Google Analytics Sample dataset in Google Cloud Marketplace and click View Dataset. This opens the BigQuery console with the Google Analytics Sample dataset selected.
- In the Resources section, select the project in which you are completing this tutorial.
Click Create dataset.
On the Create dataset page:
- For Dataset ID, type
bqml
. - For Data location, choose whatever location is closest to you.
- Click Create dataset.
- For Dataset ID, type
Follow this step if you plan to export the data to Analytics 360, otherwise skip it.
In the Query editor pane, run the following SQL statement to copy a subset of the Google Analytics Sample data into a new table and populate the
clientId
field, which you will use as the key to aggregate implicit user feedback in the next step:CREATE OR REPLACE TABLE bqml.ga_clientid_sample AS ( SELECT * FROM `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` LIMIT 100000); UPDATE bqml.ga_clientid_sample SET clientId = fullvisitorId WHERE true;
You must use
clientId
as the key field when importing audience data into Analytics 360.clientId
is normally a hashed version offullVisitorId
, but it isn't populated in the Google Analytics Sample dataset. To populateclientId
in your own Analytics 360 data, you would create a custom dimension and populate it.Run the following SQL statement to create a table containing the training data. Run the version that uses the
clientId
field as the key if you plan to export the data to Analytics 360. Run the version that uses thefullVisitorId
field as the key if you plan to use the data with other marketing systems.clientId
CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS ( WITH durations AS ( --calculate pageview durations SELECT CONCAT(clientId,'-', CAST(visitNumber AS STRING),'-', CAST(hitNumber AS STRING) ) AS visitorId_session_hit, LEAD(time, 1) OVER ( PARTITION BY CONCAT(clientId,'-',CAST(visitNumber AS STRING)) ORDER BY time ASC ) - time AS pageview_duration FROM `bqml.ga_clientid_sample`, UNNEST(hits) AS hit ), prodview_durations AS ( --filter for product detail pages only SELECT CONCAT(clientId,'-',CAST(visitNumber AS STRING)) AS userId, productSKU AS itemId, IFNULL(dur.pageview_duration, 1) AS pageview_duration, FROM `bqml.ga_clientid_sample` t, UNNEST(hits) AS hits, UNNEST(hits.product) AS hits_product JOIN durations dur ON CONCAT(clientId,'-', CAST(visitNumber AS STRING),'-', CAST(hitNumber AS STRING)) = dur.visitorId_session_hit WHERE eCommerceAction.action_type = "2" ), aggregate_web_stats AS( --sum pageview durations by userId, itemId SELECT userId, itemId, SUM(pageview_duration) AS session_duration FROM prodview_durations GROUP BY userId, itemId ) SELECT * FROM aggregate_web_stats );
fullVisitorId
CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS ( WITH durations AS ( --calculate pageview durations SELECT CONCAT(fullVisitorId,'-', CAST(visitNumber AS STRING),'-', CAST(hitNumber AS STRING) ) AS visitorId_session_hit, LEAD(time, 1) OVER ( PARTITION BY CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING)) ORDER BY time ASC ) - time AS pageview_duration FROM `bigquery-public-data.google_analytics_sample.ga_sessions_2017*`, UNNEST(hits) AS hit ), prodview_durations AS ( --filter for product detail pages only SELECT CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING)) AS userId, productSKU AS itemId, IFNULL(dur.pageview_duration, 1) AS pageview_duration, FROM `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` t, UNNEST(hits) AS hits, UNNEST(hits.product) AS hits_product JOIN durations dur ON CONCAT(fullVisitorId,'-', CAST(visitNumber AS STRING),'-', CAST(hitNumber AS STRING)) = dur.visitorId_session_hit WHERE eCommerceAction.action_type = "2" ), aggregate_web_stats AS( --sum pageview durations by userId, itemId SELECT userId, itemId, SUM(pageview_duration) AS session_duration FROM prodview_durations GROUP BY userId, itemId ) SELECT * FROM aggregate_web_stats );
Run the following SQL statement to see a sample of the date in the resulting
bqml.aggregate_web_stats
table:SELECT * FROM bqml.aggregate_web_stats LIMIT 10;
You should see results similar to the following:
Purchase flex slots
If you use on-demand pricing for BigQuery, you must purchase flex slots and then create reservations and assignments for them in order to train a matrix factorization model. You can skip this section if you use flat-rate pricing with BigQuery.
You must have the bigquery.reservations.create
permission in
order to purchase flex slots. This permission is granted to the project owner,
and also to the bigquery.admin
and bigquery.resourceAdmin
predefined
Identity and Access Management roles.
- In the BigQuery console, click Reservations.
- If you are redirected to the BigQuery Reservation API page to enable the API, click Enable. Otherwise, proceed to the next step.
- On the Reservations page, click Buy Slots.
On the Buy Slots page, set the options as follows:
- In Commitment duration, choose Flex.
- In Location, choose whatever location you selected when creating the dataset during the Process the sample data procedure.
- In Number of slots, choose 500.
- Click Next.
In Purchase confirmation, type
CONFIRM
.
Click Purchase.
Click View Slot Commitments.
Allow up to 20 minutes for the capacity to be provisioned. After the capacity is provisioned, the slot commitment status turns green and shows a checkmark
.Click Create Reservation.
On the Create Reservation page, set the options as follows:
- In Reservation name, type
model
. - In Location, choose whatever location you purchased the flex slots in.
- In Number of slots, type
500
. - Click Save. This returns you to the Reservations page.
- In Reservation name, type
Select the Assignments tab.
In Select an organization, folder, or project, click Browse.
Type the name of the project in which you are completing this tutorial.
Click Select.
In Reservation, choose the model reservation you created.
Click Create.
Click BigQuery to return to the BigQuery console.
Create, train, and deploy the model
Run the CREATE MODEL SQL statement to create, train, and deploy the matrix factorization model:
CREATE OR REPLACE MODEL bqml.retail_recommender
OPTIONS(model_type='matrix_factorization',
user_col='userId',
item_col='itemId',
rating_col='session_duration',
feedback_type='implicit'
)
AS
SELECT * FROM bqml.aggregate_web_stats;
After training completes, the trained model is deployed as
bqml.retail_recommender
.
Use the trained model to make predictions
Use the
ML.RECOMMEND
SQL function to get predictions from the deployed
bqml.retail_recommender
model.
To see an example of the recommendations data, run the following SQL statement to get predictions that represent the top 5 recommendations for a specified
userId
:DECLARE MY_USERID STRING DEFAULT "0824461277962362623-1"; SELECT * FROM ML.RECOMMEND(MODEL `bqml.retail_recommender`, (SELECT MY_USERID as userID) ) ORDER BY predicted_session_duration_confidence DESC LIMIT 5;
You should see results similar to the following:
Run the following SQL statement to get the top 5 predictions for all users. This generates a large number of rows, so this output is written to a table and then the first ten records are retrieved so you can see an example of the data.
-- Create output table of top 5 predictions CREATE OR REPLACE TABLE bqml.prod_recommendations AS ( WITH predictions AS ( SELECT userId, ARRAY_AGG(STRUCT(itemId, predicted_session_duration_confidence) ORDER BY predicted_session_duration_confidence DESC LIMIT 5) as recommended FROM ML.RECOMMEND(MODEL bqml.retail_recommender) GROUP BY userId ) SELECT userId, itemId, predicted_session_duration_confidence FROM predictions p, UNNEST(recommended) ); -- Show table SELECT * FROM bqml.prod_recommendations ORDER BY userId LIMIT 10;
You should see results similar to the following:
Use the predicted recommendations in production
After you have the recommendations, how you make them available to your production pipeline depends on your use case. The following sections describe how to export prediction data to Analytics 360 or Cloud Storage, or to programmatically read data from BigQuery into a Pandas dataframe.
Export recommendations to Analytics 360
If you want to export data to Analytics 360, we recommend that you provide a column for each product that scores the likelihood of the client to purchase that product, something similar to:
clientId | likelihoodProductA | likelihoodProductB |
---|---|---|
123 | .6527238 | .3464891 |
456 | .8720673 | .2750274 |
789 | .5620734 | .9127595 |
To create a "likelihood to buy" column per product, create a pivot()
procedure as described in
Easy pivot() in BigQuery, in one step.
Run the following SQL statement to create the
pivot
procedure:CREATE OR REPLACE FUNCTION `bqml.normalize_col_name`(col_name STRING) AS ( REGEXP_REPLACE(col_name,r'[/+#|]', '_' )); CREATE OR REPLACE PROCEDURE `bqml.pivot`( table_name STRING , destination_table STRING , row_ids ARRAY<STRING> , pivot_col_name STRING , pivot_col_value STRING , max_columns INT64 , aggregation STRING , optional_limit STRING ) BEGIN DECLARE pivotter STRING; EXECUTE IMMEDIATE ( "SELECT STRING_AGG(' "||aggregation ||"""(IF('||@pivot_col_name||'="'||x.value||'", '||@pivot_col_value||', null)) e_'||bqml.normalize_col_name(x.value)) FROM UNNEST(( SELECT APPROX_TOP_COUNT("""||pivot_col_name||", @max_columns) FROM `"||table_name||"`)) x" ) INTO pivotter USING pivot_col_name AS pivot_col_name, pivot_col_value AS pivot_col_value, max_columns AS max_columns; EXECUTE IMMEDIATE ( 'CREATE OR REPLACE TABLE `'||destination_table ||'` AS SELECT ' ||(SELECT STRING_AGG(x) FROM UNNEST(row_ids) x) ||', '||pivotter ||' FROM `'||table_name||'` GROUP BY ' || (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)||' ORDER BY ' || (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i) ||' '||optional_limit ); END;
Run the following SQL statement to create a table containing the
clientId
and a "likelihood to buy" column for each product:CALL bqml.pivot( 'bqml.prod_recommendations' # source table , 'bqml.prod_recommendations_export' # destination table , ['userId'] # row IDs , 'itemId' # pivot column name , 'predicted_session_duration_confidence' # pivot column value , 30 # max number of columns , 'AVG' # aggregation , '' # optional_limit );
Run the following SQL statement to see a sample of the date in the resulting
bqml.prod_recommendations_export
table:SELECT * FROM bqml.prod_recommendations_export ORDER BY userId LIMIT 10;
You should see results similar to the following:
After you have the data in the right format,
save it
as a CSV file and then use
Data Import
to import the data into Analytics 360. Note that the
column names in your exported recommendations data must map to
the Analytics 360 data import schema. For example, if the
data import schema is ga:clientId, ga:dimension1, ga:dimension2
then the
column names in your data should be ga:clientId, ga:dimension1, ga:dimension2
.
BigQuery doesn't allow the use of colons in column names, so you
must update the column names in the exported CSV file prior to importing it.
If you want, you can use the MoDeM (Model Deployment for Marketing) reference implementation for BigQuery ML models to make loading data into Analytics 360 easier. Use the interactive instructions in the BQML Deployment Template notebook to get started.
Export recommendations to Cloud Storage
Export the recommendations data from the BigQuery table into Cloud Storage by following the instructions at Exporting table data.
Read recommendations programmatically
Read the recommendations data from the BigQuery table into a Pandas dataframe by using the BigQuery Storage API by following the instructions at Query notebook data using the %%bigquery magic command. Alternatively, you can use one of the BigQuery client libraries to program your own solution instead.
Summary
You have completed the tutorial, and now know how to train your recommender system using BigQuery ML, deploy your model, and use the results in production.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.
Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.
Delete the project
The easiest way to eliminate billing is to delete the project you created for the tutorial.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the components
If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.
Delete the BigQuery dataset
- Open the BigQuery console
- In the Resources section, expand the project in which you are completing this tutorial and select the bqml dataset.
- Click Delete dataset in the header of the dataset pane.
- In the overlay window that appears, type bqml and then click Delete.
Delete flex slots
If you created flex slots, follow these steps to delete them:
- In the BigQuery console, click Reservations.
- Select the Assignments tab.
- Locate the row of the assignment you created for the model reservation, click More in the Actions column, then click Delete.
- Select the Reservations tab.
- Locate the row of the model reservation, click More in the Actions column, then click Delete.
- Select the Slot Commitments tab.
- Locate the row containing the 500 flex slots you purchased, click More in the Actions column, then click Delete.
- In Confirm slot commitment removal, type
REMOVE
. - Click Proceed.
What's next
- Learn about other predictive forecasting solutions.
- Learn more about BigQuery ML.
- Learn more about Analytics 360.