Predicting customer propensity to buy by using BigQuery ML and AI Platform

Learn how to build a system to predict customer propensity to purchase by using BigQuery ML and AI Platform.

You can use a propensity to purchase system to predict customers who are most likely to make a purchase, so that you can personalize communications with them. Use online predictions to take real-time action based on user behavior on your website, or batch predictions to inform less time-sensitive communications like email.

This tutorial shows you how to create a logistic regression model to determine whether a customer will make a purchase. This type of model is used because it is good for evaluating the probability of an outcome. The model evaluates metrics that reflect customer behavior on a website, and assigns the customer a probability to purchase value between 0 and 1 based on this data. The model then sets a label indicating "likely to purchase" for any customer with a probability of greater than .5.

This tutorial uses the Google Analytics Sample and ecommerce datasets to train the model. These datasets are hosted publicly on BigQuery. These datasets provide 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.

To apply the lessons from this tutorial to a production use case, you could use your own Analytics 360 data, or data from a similar system that gives you access to metrics about customer behaviour on your website.

Objectives

The solution involves the following steps:

  • Process sample data into a format suitable for training the model.
  • Create, train, and deploy the model in BigQuery.
  • Evaluate the model to understand its performance.
  • Use the trained model deployed in BigQuery to get batch predictions.
  • Deploy the trained model to AI Platform.
  • Use the trained model deployed in AI Platform to get online predictions.

Costs

This tutorial uses the following billable components of Google Cloud:

  • AI Platform
  • BigQuery
  • BigQuery ML
  • Cloud Storage

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to the project selector page

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the AI Platform Training and Prediction API.

    Enable the API

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Prepare the sample data

To train the model, you will need a table with the following columns:

  • fullVisitorId — Contains the customer ID.
  • bounces - Identifies the number of time that a visitor clicked a search or social ad and started a session on the website, but left without interacting with any other pages. These column is used as one of the model's features.
  • time_on_site — Identifies the total time of the customer's session on the website, expressed in seconds.This column is used as the model's features.
  • will_buy_on_return_visit — A label indicating the customer's propensity to purchase. A value of 1 indicates that the customer is likely to purchase, a value of 0 indicates that the customer is not likely to purchase.

Follow these steps to create this table using data from the Google Analytics Sample dataset:

  1. Open the Google Analytics Sample dataset in Google Cloud Marketplace and click View Dataset. This opens the BigQuery console with the Google Analytics Sample dataset selected.
  2. In the Resources section, select the project in which you are completing this tutorial.
  3. Click Create dataset.

    Show location of Create Dataset button.

  4. On the Create dataset page:

    1. For Dataset ID, type bqml.
    2. For Data location, choose whatever location is closest to you.
    3. Click Create dataset.
  5. In the Query editor, run the following SQL statement to create a table containing the training data.

    # select initial features and label to feed into your model
    CREATE OR REPLACE TABLE bqml.propensity_data AS
      SELECT
        fullVisitorId,
        bounces,
        time_on_site,
        will_buy_on_return_visit
      FROM (
            # select features
            SELECT
              fullVisitorId,
              IFNULL(totals.bounces, 0) AS bounces,
              IFNULL(totals.timeOnSite, 0) AS time_on_site
            FROM
              `data-to-insights.ecommerce.web_analytics`
            WHERE
              totals.newVisits = 1
            AND date BETWEEN '20160801' # train on first 9 months of data
            AND '20170430'
           )
      JOIN (
            SELECT
              fullvisitorid,
              IF (
                  COUNTIF (
                           totals.transactions > 0
                           AND totals.newVisits IS NULL
                          ) > 0,
                  1,
                  0
                 ) AS will_buy_on_return_visit
            FROM
              `bigquery-public-data.google_analytics_sample.*`
            GROUP BY
              fullvisitorid
           )
      USING (fullVisitorId)
      ORDER BY time_on_site DESC;
    
  6. Run the following SQL statement to see a sample of the date in the resulting bqml.propensity_data table:

    SELECT
     *
    FROM
      bqml.propensity_data
    LIMIT
      10;
    

    You should see results similar to the following:

    First 10 rows of processed training data.

Create and train the model

Create a model that makes predictions about customer propensity to purchase, based on the customer's online behavior as identified by the bounces and time_on_site data.

Run the CREATE MODEL SQL statement to create, train, and deploy the logistic regression model:

CREATE OR REPLACE MODEL `bqml.rpm_bqml_model`
OPTIONS(MODEL_TYPE = 'logistic_reg',
        labels = [ 'will_buy_on_return_visit' ]
        )
AS
SELECT * EXCEPT (fullVisitorId)
FROM `bqml.propensity_data`;

Training might take several minutes. After training completes, the trained model is deployed in BigQuery as bqml.rpm_bqml_model.

Evaluate the model

Get a sense for the model's performance by looking at its AUC: Area Under the ROC Curve performance metric. A model's AUC value can range between 0, indicating no predictions were correct, and 1, indicating all predictions were correct.

In a tutorial like this with a non-optimized model, you are likely to get good but not great results from the model. For a production use case, you would want to optimize the model using hyperparameter tuning, feature engineering, and other techniques to achieve a score of .9 or better.

Run the ML.EVALUATE SQL statement to evaluate the model:

SELECT
  roc_auc,
  # evaluating the auc value based on the scale at http://gim.unmc.edu/dxtests/roc3.htm
  CASE WHEN roc_auc >.9 THEN 'excellent' WHEN roc_auc >.8 THEN 'good'
  WHEN roc_auc >.7 THEN 'fair' WHEN roc_auc >.6 THEN 'poor' ELSE 'fail' END
  AS modelquality
FROM ML.EVALUATE(MODEL `bqml.rpm_bqml_model`);

You should see results similar to the following:

First 10 rows of processed training data.

Use the trained model to get batch predictions

Use batch predictions to get data for an email campaign or other non-real-time scenario.

Run the ML.PREDICT SQL statement to get batch predictions for the sample data in the bqml.propensity_data table:

# predict the inputs (rows) from the input table
SELECT
  fullVisitorId,
  predicted_will_buy_on_return_visit
FROM ML.PREDICT(MODEL bqml.rpm_bqml_model,
(
   SELECT
   fullVisitorId,
   bounces,
   time_on_site
   from bqml.propensity_data
))

You should see results similar to the following:

First 10 rows of processed training data.

The ML.PREDICT statement above uses the default threshold level of .5. The threshold value determines the cutoff between the two possible label values, with predictions above the threshold value getting labeled as positive. For a production model, you would want to consider adjusting the threshold value in conjunction with other techniques like hyperparameter tuning to optimize the model.

Use the trained model to get online predictions

Use online predictions to take real-time action based on user behavior on your website. To get online predictions, you must export the model from BigQuery to Cloud Storage, and then deploy it to AI Platform. You can then send prediction requests to the deployed model.

Follow these steps to export and deploy the model:

  1. Activate Cloud Shell
  2. In Cloud Shell, run the command below to create a Cloud Storage bucket to store the exported model. Replace myProject with the ID of the project you are using to complete this tutorial.

    gsutil mb 'gs://myProject-bucket'
    
  3. Export the model to a Cloud Storage bucket, replacing myBucket with the name of the bucket you created in Step 2:

     bq extract -m bqml.rpm_bqml_model gs://myBucket/V_1
    
  4. Create the model in AI Platform:

    gcloud ai-platform models create rpm_bqml_model
    
  5. Deploy the exported model to AI Platform, replacing myBucket with the name of the bucket you created in Step 2:

    gcloud ai-platform versions create --model=rpm_bqml_model V_1 --framework=tensorflow --python-version=3.7 --runtime-version=1.15 --origin=gs://myBucket/V_1/ --staging-bucket=gs://myBucket
    
  6. Create an input.json file containing example customer data:

    echo "{\"bounces\": 0, \"time_on_site\": 7363}" > input.json
    
  7. Request the predictions:

    gcloud ai-platform predict --model rpm_bqml_model --version V_1 --json-instances input.json
    

    This command produces output similar to the following:

    PREDICTED_WILL_BUY_ON_RETURN_VISIT  WILL_BUY_ON_RETURN_VISIT_PROBS            WILL_BUY_ON_RETURN_VISIT_VALUES
    ['1']                               [0.661069205638202, 0.33893079436179796]  ['1', '0']
    

    The model returns 1 for PREDICTED_WILL_BUY_ON_RETURN_VISIT, predicting that this example customer has the propensity to purchase, with a 66% probability.

    You can use the probability value of a customer with a positive prediction to decide whether to provide a coupon or other incentive. For example, you could assume that customers with a probability greater than 80% are reasonably likely to purchase without an incentive, so you might want to focus incentives on customers with probabilities between 50% and 80%.

Optional next steps

Integrate the solution with a CRM system

You can optionally expand on this solution by integrating propensity to purchase predictions with Customer Relationship Management (CRM) system data like email addresses to make customer outreach easier. For an example of this type of integration, see Salesforce Marketing Cloud audience integration. This article describes how to integrate Google Analytics 360 with Salesforce Marketing Cloud so you can use Analytics 360 audiences in Salesforce email and SMS direct-marketing campaigns.

Automate the workflow with a pipeline

You can optionally automate the data preparation -> model creation and training -> request predictions workflow shown in this tutorial by creating an AI Platform pipeline. Use the bqml_kfp_retail_propensity_to_purchase notebook to get step-by-step instructions for automating a propensity to purchase machine learning workflow.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to the Manage resources page

  2. In the project list, select the project that you want to delete and then click Delete .
  3. In the dialog, type the project ID and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.

Delete the BigQuery dataset

  1. Open the BigQuery console
  2. In the Resources section, expand the project in which you are completing this tutorial and select the bqml dataset.
  3. Click Delete dataset in the header of the dataset pane.
  4. In the overlay window that appears, type bqml and then click Delete.

Delete the AI Platform model

  1. Open the AI Platform Models page
  2. In the models list, click rpm_bqml_model.
  3. On the Model Details page, select the checkbox for the V_1 (default) version.
  4. Click More, then click Delete.
  5. When the version has finished deleting, click Back to return to the models list.
  6. Select the checkbox for the rpm_bqml_model model.
  7. Click More, then click Delete.

Delete the Cloud Storage bucket

  1. Open the Cloud Storage browser
  2. Select the checkbox of the <myProject>-bucket bucket.
  3. Click Delete.
  4. In the overlay window that appears, type DELETE and then click Confirm.

What's next