Building an e-commerce recommendation system by using BigQuery ML

Learn how to build a recommendation system by using BigQuery ML to generate product or service recommendations from customer data in BigQuery. Then, learn how to make that data available to other production systems by exporting it to Google Analytics 360 or Cloud Storage, or programmatically reading it from the BigQuery table.

This is the recommended approach when you have data that already resides in BigQuery. BigQuery ML lets you create a machine learning model, train it on customer data in BigQuery, and deploy it, all by using standard SQL queries. You avoid having to export your data to another product or build a model training and deployment pipeline, and BigQuery auto-scales to handle any compute resources you need.

The machine learning model you create in this tutorial uses matrix factorization, a common and effective method of creating a recommendation system based on user preference data. For more information about this approach, see matrix factorization.

This tutorial uses the Google Analytics Sample dataset, which is hosted publicly on BigQuery. This dataset provides 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.

Objectives

  • Process sample data into a format suitable for training a matrix factorization model.
  • Create, train, and deploy a matrix factorization model.
  • Get predictions from the deployed model about what products your customers are most likely to be interested in.
  • Export prediction data from BigQuery to one or more other products for use in making recommendations to customers.

Costs

This tutorial uses the following billable components of Google Cloud:

  • BigQuery
  • BigQuery ML

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  2. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Process the sample data

When using matrix factorization, you evaluate explicit or implicit user feedback to determine customer preferences. To use explicit feedback, the dataset must contain data about user product preferences, like star ratings between 1 and 5. In cases where there isn't explicit feedback available, you must use other behavioral metrics to infer customer preferences, like looking at the total time a user spends on a product detail page. This is the approach used in this tutorial.

To train the matrix factorization model, you need a table with columns that identify the customer, the item being rated, and the implicit rating. In this section, you will create such a table with the columns userid, itemId, and session_duration, where the session_duration column contains the duration of the user's session on the given item's product page.

Follow these steps to create such a table using data from the Google Analytics Sample dataset:

  1. Open the Google Analytics Sample dataset in Google Cloud Marketplace and click View Dataset. This opens the BigQuery console with the Google Analytics Sample dataset selected.
  2. In the Resources section, select the project in which you are completing this tutorial.
  3. Click Create dataset.

    Show location of Create Dataset button.

  4. On the Create dataset page:

    • For Dataset ID, type bqml.
    • For Data location, choose whatever location is closest to you.
    • Click Create dataset.
  5. Follow this step if you plan to export the data to Analytics 360, otherwise skip it.

    In the Query editor pane, run the following SQL statement to copy a subset of the Google Analytics Sample data into a new table and populate the clientId field, which you will use as the key to aggregate implicit user feedback in the next step:

    CREATE OR REPLACE TABLE bqml.ga_clientid_sample AS (
      SELECT *
      FROM `bigquery-public-data.google_analytics_sample.ga_sessions_2017*`
      LIMIT 100000);
    
     UPDATE bqml.ga_clientid_sample
       SET clientId = fullvisitorId
       WHERE true;
    

    You must use clientId as the key field when importing audience data into Analytics 360. clientId is normally a hashed version of fullVisitorId, but it isn't populated in the Google Analytics Sample dataset. To populate clientId in your own Analytics 360 data, you would create a custom dimension and populate it.

  6. Run the following SQL statement to create a table containing the training data. Run the version that uses the clientId field as the key if you plan to export the data to Analytics 360. Run the version that uses the fullVisitorId field as the key if you plan to use the data with other marketing systems.

    clientId

    CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS (
      WITH
        durations AS (
          --calculate pageview durations
          SELECT
            CONCAT(clientId,'-',
                 CAST(visitNumber AS STRING),'-',
                 CAST(hitNumber AS STRING) ) AS visitorId_session_hit,
            LEAD(time, 1) OVER (
              PARTITION BY CONCAT(clientId,'-',CAST(visitNumber AS STRING))
              ORDER BY
              time ASC ) - time AS pageview_duration
          FROM
            `bqml.ga_clientid_sample`,
            UNNEST(hits) AS hit
        ),
        prodview_durations AS (
          --filter for product detail pages only
         SELECT
            CONCAT(clientId,'-',CAST(visitNumber AS STRING)) AS userId,
            productSKU AS itemId,
            IFNULL(dur.pageview_duration,
             1) AS pageview_duration,
          FROM
            `bqml.ga_clientid_sample` t,
            UNNEST(hits) AS hits,
            UNNEST(hits.product) AS hits_product
          JOIN
            durations dur
          ON
            CONCAT(clientId,'-',
                   CAST(visitNumber AS STRING),'-',
                   CAST(hitNumber AS STRING)) = dur.visitorId_session_hit
          WHERE
          eCommerceAction.action_type = "2"
        ),
        aggregate_web_stats AS(
          --sum pageview durations by userId, itemId
          SELECT
            userId,
            itemId,
            SUM(pageview_duration) AS session_duration
          FROM
            prodview_durations
          GROUP BY
            userId,
            itemId )
        SELECT
         *
       FROM
          aggregate_web_stats
    );
    

    fullVisitorId

    CREATE OR REPLACE TABLE bqml.aggregate_web_stats AS (
      WITH
        durations AS (
          --calculate pageview durations
          SELECT
            CONCAT(fullVisitorId,'-',
                 CAST(visitNumber AS STRING),'-',
                 CAST(hitNumber AS STRING) ) AS visitorId_session_hit,
            LEAD(time, 1) OVER (
              PARTITION BY CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING))
              ORDER BY
              time ASC ) - time AS pageview_duration
          FROM
            `bigquery-public-data.google_analytics_sample.ga_sessions_2017*`,
            UNNEST(hits) AS hit
        ),
        prodview_durations AS (
          --filter for product detail pages only
         SELECT
            CONCAT(fullVisitorId,'-',CAST(visitNumber AS STRING)) AS userId,
            productSKU AS itemId,
            IFNULL(dur.pageview_duration,
             1) AS pageview_duration,
          FROM
            `bigquery-public-data.google_analytics_sample.ga_sessions_2017*` t,
            UNNEST(hits) AS hits,
            UNNEST(hits.product) AS hits_product
          JOIN
            durations dur
          ON
            CONCAT(fullVisitorId,'-',
                   CAST(visitNumber AS STRING),'-',
                   CAST(hitNumber AS STRING)) = dur.visitorId_session_hit
          WHERE
          eCommerceAction.action_type = "2"
        ),
        aggregate_web_stats AS(
          --sum pageview durations by userId, itemId
          SELECT
            userId,
            itemId,
            SUM(pageview_duration) AS session_duration
          FROM
            prodview_durations
          GROUP BY
            userId,
            itemId )
        SELECT
         *
       FROM
          aggregate_web_stats
    );
    
  7. Run the following SQL statement to see a sample of the date in the resulting bqml.aggregate_web_stats table:

    SELECT
     *
    FROM
      bqml.aggregate_web_stats
    LIMIT
      10;
    

    You should see results similar to the following:

    First 10 rows of processed training data.

Purchase flex slots

If you use on-demand pricing for BigQuery, you must purchase flex slots and then create reservations and assignments for them in order to train a matrix factorization model. You can skip this section if you use flat-rate pricing with BigQuery.

You must have the bigquery.reservations.create permission in order to purchase flex slots. This permission is granted to the project owner, and also to the bigquery.admin and bigquery.resourceAdmin predefined Identity and Access Management roles.

  1. In the BigQuery console, click Reservations.
  2. If you are redirected to the BigQuery Reservation API page to enable the API, click Enable. Otherwise, proceed to the next step.
  3. On the Reservations page, click Buy Slots.
  4. On the Buy Slots page, set the options as follows:

    1. In Commitment duration, choose Flex.
    2. In Location, choose whatever location you selected when creating the dataset during the Process the sample data procedure.
    3. In Number of slots, choose 500.
    4. Click Next.
    5. In Purchase confirmation, type CONFIRM.

  5. Click Purchase.

  6. Click View Slot Commitments.

  7. Allow up to 20 minutes for the capacity to be provisioned. After the capacity is provisioned, the slot commitment status turns green and shows a checkmark .

  8. Click Create Reservation.

  9. On the Create Reservation page, set the options as follows:

    1. In Reservation name, type model.
    2. In Location, choose whatever location you purchased the flex slots in.
    3. In Number of slots, type 500.
    4. Click Save. This returns you to the Reservations page.
  10. Select the Assignments tab.

  11. In Select an organization, folder, or project, click Browse.

  12. Type the name of the project in which you are completing this tutorial.

  13. Click Select.

  14. In Reservation, choose the model reservation you created.

  15. Click Create.

  16. Click BigQuery to return to the BigQuery console.

Create, train, and deploy the model

Run the CREATE MODEL SQL statement to create, train, and deploy the matrix factorization model:

      CREATE OR REPLACE MODEL bqml.retail_recommender
      OPTIONS(model_type='matrix_factorization',
            user_col='userId',
            item_col='itemId',
            rating_col='session_duration',
            feedback_type='implicit'
            )
      AS
      SELECT * FROM bqml.aggregate_web_stats;

After training completes, the trained model is deployed as bqml.retail_recommender.

Use the trained model to make predictions

Use the ML.RECOMMEND SQL function to get predictions from the deployed bqml.retail_recommender model.

  1. To see an example of the recommendations data, run the following SQL statement to get predictions that represent the top 5 recommendations for a specified userId:

    DECLARE MY_USERID STRING DEFAULT "0824461277962362623-1";
    
    SELECT
      *
    FROM
      ML.RECOMMEND(MODEL `bqml.retail_recommender`,
      (SELECT MY_USERID as userID)
                  )
    ORDER BY predicted_session_duration_confidence DESC
    LIMIT 5;
    

    You should see results similar to the following:

    Top 5 recommendations for a given user ID.

  2. Run the following SQL statement to get the top 5 predictions for all users. This generates a large number of rows, so this output is written to a table and then the first ten records are retrieved so you can see an example of the data.

    -- Create output table of top 5 predictions
    CREATE OR REPLACE TABLE bqml.prod_recommendations AS (
    WITH predictions AS (
        SELECT
          userId,
          ARRAY_AGG(STRUCT(itemId,
                           predicted_session_duration_confidence)
                    ORDER BY
                      predicted_session_duration_confidence DESC
                    LIMIT 5) as recommended
        FROM ML.RECOMMEND(MODEL bqml.retail_recommender)
        GROUP BY userId
    )
    
    SELECT
      userId,
      itemId,
      predicted_session_duration_confidence
    FROM
      predictions p,
      UNNEST(recommended)
    );
    
    -- Show table
    SELECT
     *
    FROM
      bqml.prod_recommendations
    ORDER BY
      userId
    LIMIT
      10;
    

    You should see results similar to the following:

    First 10 recommendations for all users.

Use the predicted recommendations in production

After you have the recommendations, how you make them available to your production pipeline depends on your use case. The following sections describe how to export prediction data to Analytics 360 or Cloud Storage, or to programmatically read data from BigQuery into a Pandas dataframe.

Export recommendations to Analytics 360

If you want to export data to Analytics 360, we recommend that you provide a column for each product that scores the likelihood of the client to purchase that product, something similar to:

clientId likelihoodProductA likelihoodProductB
123 .6527238 .3464891
456 .8720673 .2750274
789 .5620734 .9127595

To create a "likelihood to buy" column per product, create a pivot() procedure as described in Easy pivot() in BigQuery, in one step.

  1. Run the following SQL statement to create the pivot procedure:

    CREATE OR REPLACE FUNCTION
    `bqml.normalize_col_name`(col_name STRING) AS (
      REGEXP_REPLACE(col_name,r'[/+#|]', '_'
    ));
    
    CREATE OR REPLACE PROCEDURE `bqml.pivot`(
      table_name STRING
      , destination_table STRING
      , row_ids ARRAY<STRING>
      , pivot_col_name STRING
     , pivot_col_value STRING
      , max_columns INT64
      , aggregation STRING
      , optional_limit STRING
      )
    
    BEGIN
    
      DECLARE pivotter STRING;
    
      EXECUTE IMMEDIATE (
       "SELECT STRING_AGG(' "||aggregation
        ||"""(IF('||@pivot_col_name||'="'||x.value||'", '||@pivot_col_value||', null)) e_'||bqml.normalize_col_name(x.value))
       FROM UNNEST((
           SELECT APPROX_TOP_COUNT("""||pivot_col_name||", @max_columns) FROM `"||table_name||"`)) x"
      ) INTO pivotter
      USING pivot_col_name AS pivot_col_name, pivot_col_value AS pivot_col_value, max_columns AS max_columns;
    
      EXECUTE IMMEDIATE (
       'CREATE OR REPLACE TABLE `'||destination_table
       ||'` AS SELECT '
       ||(SELECT STRING_AGG(x) FROM UNNEST(row_ids) x)
       ||', '||pivotter
       ||' FROM `'||table_name||'` GROUP BY '
       || (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)||' ORDER BY '
       || (SELECT STRING_AGG(''||(i+1)) FROM UNNEST(row_ids) WITH OFFSET i)
       ||' '||optional_limit
      );
    
    END;
    
  2. Run the following SQL statement to create a table containing the clientId and a "likelihood to buy" column for each product:

    CALL bqml.pivot(
      'bqml.prod_recommendations' # source table
      , 'bqml.prod_recommendations_export' # destination table
      , ['userId'] # row IDs
      , 'itemId' # pivot column name
      , 'predicted_session_duration_confidence' # pivot column value
      , 30 # max number of columns
      , 'AVG' # aggregation
      , '' # optional_limit
    );
    
  3. Run the following SQL statement to see a sample of the date in the resulting bqml.prod_recommendations_export table:

    SELECT
      *
    FROM
      bqml.prod_recommendations_export
    ORDER BY
      userId
    LIMIT
      10;
    

    You should see results similar to the following:

    First 10 recommendations for all users.

After you have the data in the right format, save it as a CSV file and then use Data Import to import the data into Analytics 360. Note that the column names in your exported recommendations data must map to the Analytics 360 data import schema. For example, if the data import schema is ga:clientId, ga:dimension1, ga:dimension2 then the column names in your data should be ga:clientId, ga:dimension1, ga:dimension2. BigQuery doesn't allow the use of colons in column names, so you must update the column names in the exported CSV file prior to importing it.

If you want, you can use the MoDeM (Model Deployment for Marketing) reference implementation for BigQuery ML models to make loading data into Analytics 360 easier. Use the interactive instructions in the BQML Deployment Template notebook to get started.

Export recommendations to Cloud Storage

Export the recommendations data from the BigQuery table into Cloud Storage by following the instructions at Exporting table data.

Read recommendations programmatically

Read the recommendations data from the BigQuery table into a Pandas dataframe by using the BigQuery Storage API by following the instructions at Download table data using the BigQuery client library. Alternatively, you can use one of the BigQuery client libraries to program your own solution instead.

Summary

You have completed the tutorial, and now know how to train your recommender system using BigQuery ML, deploy your model, and use the results in production.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.

Delete the BigQuery dataset

  1. Open the BigQuery console
  2. In the Resources section, expand the project in which you are completing this tutorial and select the bqml dataset.
  3. Click Delete dataset in the header of the dataset pane.
  4. In the overlay window that appears, type bqml and then click Delete.

Delete flex slots

If you created flex slots, follow these steps to delete them:

  1. In the BigQuery console, click Reservations.
  2. Select the Assignments tab.
  3. Locate the row of the assignment you created for the model reservation, click More in the Actions column, then click Delete.
  4. Select the Reservations tab.
  5. Locate the row of the model reservation, click More in the Actions column, then click Delete.
  6. Select the Slot Commitments tab.
  7. Locate the row containing the 500 flex slots you purchased, click More in the Actions column, then click Delete.
  8. In Confirm slot commitment removal, type REMOVE.
  9. Click Proceed.

What's next