Learn how to predict customer lifetime value (CLV), and then use this data to identify potential new customers by building similar audiences in Google Ads.
Overview
This tutorial shows you how to predict the likely monetary value per customer for a specified timeframe, based on the customer's transaction history. You accomplish this through the following steps:
- Importing transaction and customer data into BigQuery.
- Processing the data for use with a machine learning (ML) model.
- Building the ML model by using BigQuery ML.
- Getting predictions from the model to identify top customers by CLV.
- Retrieving email addresses for these customers from a customer data table.
- Using this combined data to create a remarketing list for use in Google Ads.
- Using this remarketing list as the basis for generating a similar audience for use in ad targeting.
This tutorial walks you through creating a Vertex AI Workbench user-managed notebooks instance to accomplish these objectives, with prescriptive guidance and step-by-step procedures. If you just want a completed notebook to get started with instead, you can use bqml_automl_ltv_activate_lookalike.ipynb on the notebook server of your choice.
This tutorial is intended for data engineers, data scientists, and data analysts who build ML datasets and models to support business decisions. It assumes that you have basic knowledge of the following:
- Basic understanding of machine learning concepts and Python
- Standard SQL
This tutorial refactors the example used in Predicting Customer Lifetime Value with AutoML Tables, which is the fourth and last part of another tutorial series on predicting CLV. Consider reading the first tutorial in that series, Predicting Customer Lifetime Value with AI Platform: introduction, if you want to better understand key decisions points specific to doing feature engineering and model development for CLV discovery.
The example in this tutorial differs from the one in Predicting Customer Lifetime Value with AutoML Tables in the following ways:
- It creates multiple records per customer to use in training the model. It does this by segmenting customer transactions into overlapping windows of time.
- It enables you to use the model to predict CLV for different periods of time, for example one month or one quarter, by simply changing model input parameters.
- It minimizes development time by using AutoML directly from BigQuery ML.
- It uses different datasets for the sales and customer data.
- It shows how to use the CLV predictions to create similar audiences in Google Ads.
Datasets
To create a list of high-CLV customers that you can build a similar audience from, you need two sets of data:
- Sales transactions for the customers you want to evaluate. You use computations based this data to train the CLV model. In a production use case, you would pull this information from your order processing system.
- Customer email addresses. In a production use case, you would pull this information from your customer relationship management (CRM) system.
This tutorial provides sample data in two CSV files that you load into BigQuery tables. Both of these datasets contain a customer ID, which you use to join tables to associate the customer email address with their CLV information.
Model
This tutorial shows you how to create a regression model, to use for predicting the future monetary value of your customers, because this type of model is good for predicting a continuous variable (also called a continuous value) like CLV amount.
Specifically, this tutorial uses BigQuery ML to create an AutoML regression model. Using AutoML helps minimize development overhead in the following ways:
- Feature engineering — AutoML automatically handles several common feature engineering tasks.
- BigQuery ML integration — You can create AutoML models by using BigQuery ML SQL statements, which limits the need to switch back and forth between different development contexts.
- Splitting the training dataset — When training a model, it is best practice to split your data into three non-overlapping datasets: training, validation, and test. AutoML handles this by default, although you can also specify a specific split if you want to.
Calculating features and labels
In this tutorial, you use threshold dates to segment customer transactions over time into multiple overlapping windows. A threshold date separates input transactions from target transactions. Input transactions occur before the threshold date, and are used to calculate the features that you input to the model. Target transactions occur after the threshold date and are used to compute the CLV for that time period, which is used as the label you are training the model to predict.
RFM features
Three important features that are used in CLV models are the recency, frequency, and monetary (RFM) values, which you calculate from customer transaction history data:
- Recency: When was the customer's last order?
- Frequency: How often do they buy?
- Monetary: What amount do they spend?
The following diagram shows a succession of past sales for a set of four customers.
The diagram illustrates the RFM values for the customers, showing for each customer:
- Recency: The time between the last purchase and today, represented by the distance between the rightmost circle and the vertical dotted line that's labeled Now.
- Frequency: The time between purchases, represented by the distance between the circles on a single line.
- Monetary: The amount of money spent on each purchase, represented by the size of the circle.
Requirements for Google Ads integration
The example code in the sections of this tutorial on Google Ads integration requires access to a working Google Ads environment to run. To make the example code runnable with your own Google Ads environment, you must have the following items:
- A Google Ads Manager account and access to the Google Ads API. If you don't already have these, follow the instructions in the Sign Up procedure to get them.
- A Google Ads developer token, an OAuth client ID and client secret, and an OAuth refresh token for the Google Ads Python client library. If you don't already have these items, follow the instructions in Set up AdWords API access to get them. For step 5 in that procedure, you can use googleads-python-lib/examples/adwords/authentication/generate_refresh_token.py to see how to get a refresh token.
- Transaction data that you can use to calculate CLV, and customer data that includes the customer email address. Use this in place of the sample data provided in the Import the data section. You will need to make sure your data uses the same table names and schemas as the sample data. The following tables describe the expected BigQuery tables:
Table clv.sales
Column | Data type |
---|---|
customer_id | INTEGER |
order_id | STRING |
transaction_date | DATE |
product_sku | STRING |
qty | INTEGER |
unit_price | FLOAT |
Table clv.customers
Column | Data type |
---|---|
customer_id | INTEGER |
full_name | STRING |
age | INTEGER |
job_title | STRING |
STRING |
Objectives
- Prepare and explore training data by using BigQuery.
- Build, train, and deploy an ML model by using BigQuery ML.
- Get CLV predictions from the deployed model.
- Create a list of the customers that represent the top 20% of CLV.
- Create a Google Ads remarketing list based on the list of top CLV customers, which you can then use to generate a similar audience to those customers.
Costs
This tutorial uses the following billable components of Google Cloud:
- AI Platform
- BigQuery
- BigQuery ML
- Compute Engine
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine, BigQuery, and Notebooks APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Compute Engine, BigQuery, and Notebooks APIs.
Create a notebook
In the Google Cloud console, go to the Notebooks page.
Click
New notebook.Choose Python 3
For Instance name, type
clv
.Click Create. It takes a few minutes for the notebook instance to be created.
When the instance is available, click Open JupyterLab.
In the Notebook section of the JupyterLab Launcher, click Python 3.
Install libraries
Install googleads and other libraries needed for this tutorial:
Copy the following code into the first cell of the notebook:
# Install libraries. %pip install -q googleads %pip install -q -U kfp matplotlib Faker --user # Restart kernel after installs import IPython app = IPython.Application.instance() app.kernel.do_shutdown(True)
Click Run
in the menu bar.
Import packages
Install the packages needed for this tutorial by copying the following code into the next empty cell of the notebook and then running it:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os, json, random
import hashlib, uuid
import time, calendar, math
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from datetime import datetime
from google.cloud import bigquery
from googleads import adwords
Create a BigQuery client
Create a BigQuery client for the Python client library for BigQuery by copying the following code into the next empty cell of the notebook and then running it. Replace myProject with the ID of the project you are using to complete this tutorial. You can find the project ID in the Project info card on your Google Cloud project dashboard.
bq_client = bigquery.Client(project="myProject")
Import the data
Use the instructions in the following sections to import the sample data into BigQuery and then process it for use training the ML model.
If you have a working Google Ads environment, you can use your
own data instead of importing the sample. Populate the clv.sales
table with
transaction data that you can use to calculate CLV, and populate the
clv.customers
table with customer data that includes the customer email
address.
Create a dataset
Before you can create BigQuery tables for the example data, you must create a dataset to contain them.
Create the clv
dataset by copying the following code into the next empty cell
of the notebook and then running it. Replace myProject with the ID of
the project you are using to complete this tutorial.
PROJECT_ID = "myProject"
! bq mk $PROJECT_ID:clv
Import the sales data
Create and populate the clv.sales
table by copying the following code into
the next empty cell of the notebook and then running it:
! bq load \
--project_id $PROJECT_ID \
--skip_leading_rows 1 \
--max_bad_records 100000 \
--replace \
--field_delimiter "," \
--autodetect \
clv.sales \
gs://solutions-public-assets/analytics-componentized-patterns/ltv/sales_*
Import the customer data
Create and populate the clv.customers
table by copying the following code
into the next empty cell of the notebook and then running it:
! bq load \
--project_id $PROJECT_ID \
--skip_leading_rows 1 \
--max_bad_records 100000 \
--replace \
--field_delimiter "," \
--autodetect \
clv.customers \
gs://solutions-public-assets/analytics-componentized-patterns/ltv/crm.csv
Prepare the data
To prepare the customer transaction data so you can use it to train the ML model, you must complete the following tasks:
- Set parameters that determine the standard deviation allowed for order monetary value and item quantity. These are used to identify and remove outlier records for each customer when performing the initial aggregation of the customer transaction data.
- Aggregate the customer transaction data.
- Check the distribution of the aggregated data across criteria like transactions per customer and number of items per order to identify areas where you might want to further refine the data.
- Define the features to compute from the aggregated data, such as the RFM values.
- Set parameters that define the time windows to use for feature computation. For example, how many days there should be between threshold dates.
- Compute the features needed to train the model.
Set aggregation parameters
Use the following parameters in the aggregation procedure to define the maximum standard deviation values:
MAX_STDV_MONETARY
— The standard deviation of the monetary value per customer.MAX_STDV_QTY
— The standard deviation of the quantity of products per customer.
Set parameters to determine the standard variation values to use by copying the following code into the next empty cell of the notebook and then running it:
AGG_PARAMS = {
'MAX_STDV_MONETARY': 500,
'MAX_STDV_QTY': 100
}
In a production use case, you could modify these parameters to determine what ranges of values are acceptable for order value and item quantity.
Aggregate customer transaction data
The following query aggregates all orders per day per customer. Because the ML task is to predict a monetary value for a period of time like weeks or months, days works well as a time unit to group the transactions per customer.
This query also removes orders that are outliers for each customer. Orders
where the order value or quantity of products fall outside the acceptable
standard deviation values specified by MAX_STDV_MONETARY
and MAX_STDV_QTY
are filtered out.
Create a table with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:
%%bigquery --params $AGG_PARAMS --project $PROJECT_ID DECLARE MAX_STDV_MONETARY INT64 DEFAULT @MAX_STDV_MONETARY; DECLARE MAX_STDV_QTY INT64 DEFAULT @MAX_STDV_QTY; CREATE OR REPLACE TABLE `clv.aggregation` AS SELECT customer_id, order_day, ROUND(day_value_after_returns, 2) AS value, day_qty_after_returns as qty_articles, day_num_returns AS num_returns, CEIL(avg_time_to_return) AS time_to_return FROM ( SELECT customer_id, order_day, SUM(order_value_after_returns) AS day_value_after_returns, STDDEV(SUM(order_value_after_returns)) OVER(PARTITION BY customer_id ORDER BY SUM(order_value_after_returns)) AS stdv_value, SUM(order_qty_after_returns) AS day_qty_after_returns, STDDEV(SUM(order_qty_after_returns)) OVER(PARTITION BY customer_id ORDER BY SUM(order_qty_after_returns)) AS stdv_qty, CASE WHEN MIN(order_min_qty) < 0 THEN count(1) ELSE 0 END AS day_num_returns, CASE WHEN MIN(order_min_qty) < 0 THEN AVG(time_to_return) ELSE NULL END AS avg_time_to_return FROM ( SELECT customer_id, order_id, -- Gives the order date vs return(s) dates. MIN(transaction_date) AS order_day, MAX(transaction_date) AS return_final_day, DATE_DIFF(MAX(transaction_date), MIN(transaction_date), DAY) AS time_to_return, -- Aggregates all products in the order -- and all products returned later. SUM(qty * unit_price) AS order_value_after_returns, SUM(qty) AS order_qty_after_returns, -- If negative, order has qty return(s). MIN(qty) order_min_qty FROM `clv.sales` GROUP BY customer_id, order_id) GROUP BY customer_id, order_day) WHERE -- [Optional] Remove dates with outliers per a customer. (stdv_value < MAX_STDV_MONETARY OR stdv_value IS NULL) AND (stdv_qty < MAX_STDV_QTY OR stdv_qty IS NULL);
See a sample of the data in the resulting
clv.aggregation
table by copying the following code into the next empty cell of the notebook and then running it:%%bigquery SELECT * FROM clv.aggregation LIMIT 5;
You should see results similar to the following:
In a production use case, you would probably want to further refine the data by applying additional selection criteria, such as the following options:
- Remove records that don't have positive item quantities and purchase amount.
- Remove records that don't have a customer ID.
- Keep only active customers, however you define that. For example, you might want to focus on customers who have purchased something in the past 90 days.
Check data distribution
This tutorial does minimal data cleansing, instead focusing on doing basic transformation of the transaction data so that it will work with the model. In this section, you check data distribution to determine where there are outliers and identify areas for further data refinement.
Check distribution by date
Create and populate the
df_dist_dates
dataframe with date data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:%%bigquery df_dist_dates --project $PROJECT_ID SELECT count(1) c, SUBSTR(CAST(order_day AS STRING), 0, 7) as yyyy_mm FROM `clv.aggregation` WHERE qty_articles > 0 GROUP BY yyyy_mm ORDER BY yyyy_mm
Use seaborn to visualize the date data by copying the following code into the next empty cell of the notebook and then running it:
plt.figure(figsize=(12,5)) sns.barplot( x='yyyy_mm', y='c', data=df_dist_dates)
You should see results similar to the following:
Orders are well distributed across the year, although a little lower in the first few months captured in this dataset. In a production use case, you could use data distribution by date as one factor to consider when setting the time window parameters for the feature computation procedure.
Check distribution by transactions per customer
Create and populate the
df_dist_customers
dataframe with customer transaction counts from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:%%bigquery df_dist_customers --project $PROJECT_ID SELECT customer_id, count(1) c FROM `clv.aggregation` GROUP BY customer_id
Use seaborn to visualize the customer transaction data by copying the following code into the next empty cell of the notebook and then running it:
plt.figure(figsize=(12,4)) sns.distplot(df_dist_customers['c'], hist_kws=dict(ec="k"), kde=False)
You should see results similar to the following:
This data has good distribution, with the number of transactions per customer distributed across a narrow range, and no clear outliers.
Check distribution by item quantity
Create and populate the
df_dist_qty
dataframe with item quantity data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:%%bigquery df_dist_qty --project $PROJECT_ID SELECT qty_articles, count(1) c FROM `clv.aggregation` GROUP BY qty_articles
Use seaborn to visualize the item quantity data by copying the following code into the next empty cell of the notebook and then running it:
plt.figure(figsize=(12,4)) sns.distplot(df_dist_qty['qty_articles'], hist_kws=dict(ec="k"), kde=False)
You should see results similar to the following:
A few customers have very large quantities in their orders, but the distribution is generally good. In a production use case, you could do more data engineering in this area. You want to remove transactions that are outliers for individual customers, for example removing a transaction for 20 items for a customer who usually purchases 1 or 2 items. You want to keep customers who are outliers in that they usually purchase many more items than others.
Check distribution by monetary value
Create and populate the
df_dist_values
dataframe with monetary value data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:%%bigquery df_dist_values --project $PROJECT_ID SELECT value FROM `clv.aggregation`
Use seaborn to visualize the monetary value data by copying the following code into the next empty cell of the notebook and then running it:
axv = sns.violinplot(x=df_dist_values["value"]) axv.set_xlim(-200, 3500)
You should see results similar to the following:
The distribution shows a few outliers that spend significantly more than average. In a production use case, you could do more data engineering in this area. You want to remove transactions that are outliers for individual customers, for example removing a $1000 transaction for a customer who usually spends about $50 per transaction. You want to keep customers who are outliers in that they usually spend a lot more than others.
Define features
The following table describes the features that are computed to train the model:
Feature name | Type | Description |
---|---|---|
monetary | FLOAT | The sum of monetary values for all transactions for a given customer for a given time window. |
frequency | INTEGER | The number of transactions placed by a given customer for a given time window. |
recency | INTEGER | The time between the first and last transactions that were placed by a given customer for a given time window. |
T | INTEGER | The time between the first transaction in the dataset and the end of the input transactions window. |
time_between | FLOAT | The average time between transactions for a given customer for a given time window. |
avg_basket_value | FLOAT | The average monetary value of the customer's basket for a given time window. |
avg_basket_size | FLOAT | The average number of items that a given customer has in their basket during a given time window. |
has_returns | STRING | Whether a given customer has returned items for at least one transaction during a given time window. |
avg_time_to_return | FLOAT | The time that it takes for a given customer to return their first item for a transaction. |
num_returns | INTEGER | The number of items that the customer has returned that are associated with transactions that are in a given time window. |
Set feature computation parameters
The following parameters define the time windows to use for computing features:
WINDOW_STEP
— Specifies the number of days between thresholds. This determines how often you compute features and labels.WINDOW_STEP_INITIAL
— Specifies the number of days between the first order date in your training data and the first threshold.WINDOW_LENGTH
— Specifies the number of days back to use for input transactions. The default is to use a value of 0, which takes all transactions before the given threshold.LENGTH_FUTURE
— Specifies the number of days in the future to predict themonetary
value that serves as the ML label. At each threshold, BigQuery calculates the label value for all orders that happenLENGTH_FUTURE
after the threshold date.
For example, assuming a dataset with an earliest transaction date of 1/1/2015 and the following parameter values:
WINDOW_STEP
: 30WINDOW_STEP_INITIAL
: 90WINDOW_LENGTH
: 0LENGTH_FUTURE
: 30
The time windows for the first few loops of the feature creation procedure would work as follows:
First loop
Parameter | How it is calculated | Value |
---|---|---|
Threshold date | The earliest transaction date + the WINDOW_STEP_INITIAL value, so
1/1/2015 + 90 days. |
4/1/2015 |
Window start | WINDOW_LENGTH is 0, so all transactions prior to the threshold are used.
The earliest transaction is on 1/1/2015. |
1/1/2015 |
Input transaction window | From the window start date until the threshold date, inclusive. | 1/1/2015 - 4/1/2015 |
Target transaction window | From the threshold date (exclusive) until the threshold date + the number
of days specified by LENGTH_FUTURE (4/1/2015 + 30 days). |
4/2/2015 - 5/1/2015 |
Second loop
Parameter | How it is calculated | Value |
---|---|---|
Threshold date | The threshold date from previous loop + the WINDOW_STEP value, so
4/1/2015 + 30 days |
5/1/2015 |
Window start | WINDOW_LENGTH is 0, so all transactions prior to the threshold are used.
The earliest transaction is on 1/1/2015. |
1/1/2015 |
Input transaction window | From the window start date until the threshold date, inclusive. | 1/1/2015 - 5/1/2015 |
Target transaction window | From the threshold date (exclusive) until the threshold date + the number
of days specified by LENGTH_FUTURE (5/1/2015 + 30 days). |
5/2/2015 - 5/31/2015 |
If you were to modify WINDOW_LENGTH
to have a value of 15, the time windows
would instead work as follows:
First loop
Parameter | How it is calculated | Value |
---|---|---|
Threshold date | The earliest transaction date + the WINDOW_STEP_INITIAL value, so
1/1/2015 + 90 days. |
4/1/2015 |
Window start | Threshold date - WINDOW_LENGTH value, so 4/1/2015 - 15 days. |
3/17/2015 |
Input transaction window | From the window start date until the threshold date, inclusive. | 3/17/2015 - 4/1/2015 |
Target transaction window | From the threshold date (exclusive) until the threshold date + the number
of days specified by LENGTH_FUTURE (4/1/2015 + 30 days). |
4/2/2015 - 5/1/2015 |
Second loop
Parameter | How it is calculated | Value |
---|---|---|
Threshold date | Threshold date from previous loop + WINDOW_STEP value, so 4/1/2015 + 30
days |
5/1/2015 |
Window start | Threshold date - WINDOW_LENGTH value, so 5/1/2015 - 15 days |
4/16/2015 |
Input transaction window | From the window start date until the threshold date, inclusive. | 4/16/2015 - 5/1/2015 |
Target transaction window | From the threshold date (exclusive) until the threshold date + the number
of days specified by LENGTH_FUTURE (5/1/2015 + 30 days). |
5/2/2015 - 5/31/2015 |
Set the parameters to determine feature computation windows by copying the following code into the next empty cell of the notebook and then running it:
CLV_PARAMS = {
'WINDOW_LENGTH': 0,
'WINDOW_STEP': 30,
'WINDOW_STEP_INITIAL': 90,
'LENGTH_FUTURE': 30
}
In a production use case, you would alter these parameters to get results that best suit the data you are working with. For example, if your customers buy multiple times a week, you could compute features using weekly windows. Or if you have a lot of data, you could create more windows by decreasing their sizes and possibly reducing the number of days between threshold dates. Experimenting with these parameters, in addition to hyperparameter tuning can potentially improve the performance of your model.
Create features
Create a table with computed features for training the model by copying the following code into the next empty cell of the notebook and then running it:
%%bigquery --params $CLV_PARAMS --project $PROJECT_ID -- Length -- Date of the first order in the dataset. DECLARE MIN_DATE DATE; -- Date of the final order in the dataset. DECLARE MAX_DATE DATE; -- Date that separates inputs orders from target transactions. DECLARE THRESHOLD_DATE DATE; -- How many days back for inputs transactions. 0 means from the start. DECLARE WINDOW_LENGTH INT64 DEFAULT @WINDOW_LENGTH; -- Date at which an input transactions window starts. DECLARE WINDOW_START DATE; -- How many days between thresholds. DECLARE WINDOW_STEP INT64 DEFAULT @WINDOW_STEP; -- How many days for the first window. DECLARE WINDOW_STEP_INITIAL INT64 DEFAULT @WINDOW_STEP_INITIAL; -- Index of the window being run. DECLARE STEP INT64 DEFAULT 1; -- How many days to predict for. DECLARE LENGTH_FUTURE INT64 DEFAULT @LENGTH_FUTURE; SET (MIN_DATE, MAX_DATE) = ( SELECT AS STRUCT MIN(order_day) AS min_days, MAX(order_day) AS max_days FROM `clv.aggregation` ); SET THRESHOLD_DATE = MIN_DATE; -- For more information about the features of this table, -- see https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/utils.py#L246 -- and https://cloud.google.com/solutions/machine-learning/clv-prediction-with-offline-training-train#aggregating_data CREATE OR REPLACE TABLE clv.features ( customer_id STRING, monetary FLOAT64, frequency INT64, recency INT64, T INT64, time_between FLOAT64, avg_basket_value FLOAT64, avg_basket_size FLOAT64, has_returns STRING, avg_time_to_return FLOAT64, num_returns INT64, -- threshold DATE, -- step INT64, target_monetary FLOAT64, ); -- Using a BigQuery LOOP procedural language statement -- (https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#loop) -- lets you enhance the logic of your query without the need of another -- programming language or client code. LOOP -- Can choose a longer original window in case -- there were not many orders in the early days. IF STEP = 1 THEN SET THRESHOLD_DATE = DATE_ADD(THRESHOLD_DATE, INTERVAL WINDOW_STEP_INITIAL DAY); ELSE SET THRESHOLD_DATE = DATE_ADD(THRESHOLD_DATE, INTERVAL WINDOW_STEP DAY); END IF; SET STEP = STEP + 1; IF THRESHOLD_DATE >= DATE_SUB(MAX_DATE, INTERVAL (WINDOW_STEP) DAY) THEN LEAVE; END IF; -- Takes all transactions before the threshold date unless you decide -- to use a different window length to test model performance. IF WINDOW_LENGTH != 0 THEN SET WINDOW_START = DATE_SUB(THRESHOLD_DATE, INTERVAL WINDOW_LENGTH DAY); ELSE SET WINDOW_START = MIN_DATE; END IF; INSERT clv.features SELECT CAST(tf.customer_id AS STRING), ROUND(tf.monetary_orders, 2) AS monetary, tf.cnt_orders AS frequency, tf.recency, tf.T, ROUND(tf.recency/cnt_orders, 2) AS time_between, ROUND(tf.avg_basket_value, 2) AS avg_basket_value, ROUND(tf.avg_basket_size, 2) AS avg_basket_size, has_returns, CEIL(avg_time_to_return) AS avg_time_to_return, num_returns, ROUND(tt.target_monetary, 2) AS target_monetary, FROM ( -- This SELECT uses only data before THRESHOLD_DATE to make features. SELECT customer_id, SUM(value) AS monetary_orders, DATE_DIFF(MAX(order_day), MIN(order_day), DAY) AS recency, DATE_DIFF(THRESHOLD_DATE, MIN(order_day), DAY) AS T, COUNT(DISTINCT order_day) AS cnt_orders, AVG(qty_articles) avg_basket_size, AVG(value) avg_basket_value, CASE WHEN SUM(num_returns) > 0 THEN 'y' ELSE 'n' END AS has_returns, AVG(time_to_return) avg_time_to_return, THRESHOLD_DATE AS threshold, SUM(num_returns) num_returns, FROM `clv.aggregation` WHERE order_day <= THRESHOLD_DATE AND order_day >= WINDOW_START GROUP BY customer_id ) tf INNER JOIN ( -- This SELECT uses all data after threshold as target. SELECT customer_id, SUM(value) target_monetary FROM `clv.aggregation` WHERE order_day <= DATE_ADD(THRESHOLD_DATE, INTERVAL LENGTH_FUTURE DAY) -- For the sample data, the overall value is similar to the value -- after threshold, and prediction performs better using the overall -- value. When using your own data, try uncommenting the following -- AND clause and see which version of the procedure gives you better -- results. -- AND order_day > THRESHOLD_DATE GROUP BY customer_id) tt ON tf.customer_id = tt.customer_id; END LOOP;
See a sample of the data for one customer in the resulting
clv.features
table by copying the following code into the next empty cell of the notebook and then running it:%%bigquery SELECT * FROM `clv.features` WHERE customer_id = "10" UNION ALL (SELECT * FROM `clv.features` LIMIT 5) ORDER BY customer_id, frequency, T LIMIT 5
You should see results similar to the following:
Train the model
This tutorial uses the AUTOML_REGRESSOR in BigQuery ML to create, train, and deploy an AutoML regression model.
The tutorial uses the default settings to train the model; in a production use case, you could try additional feature engineering techniques and different training data splits in order to improve the model.
Create and populate the clv_model
model by copying the following code
into the next empty cell of the notebook and then running it:
%%bigquery
CREATE OR REPLACE MODEL `clv.clv_model`
OPTIONS(MODEL_TYPE="AUTOML_REGRESSOR",
INPUT_LABEL_COLS=["target_monetary"],
OPTIMIZATION_OBJECTIVE="MINIMIZE_MAE")
AS SELECT
* EXCEPT(customer_id)
FROM
`clv.features`
This should give you a good base model to work from. To adapt the model to your own data and use case, you can optimize it using hyperparameter tuning, feature engineering, and other techniques.
Predict CLV
Your next step is to get CLV predictions from the model for each customer and then write this data to a table. Prediction records contain the fields described in the following table:
Field name | Type | Description |
---|---|---|
customer_id | STRING | Customer ID |
monetary_so_far | FLOAT | Total amount of money spent by the customer before the prediction date. |
monetary_predicted | FLOAT | Predicted total amount of money spent by the customer, which is the sum of
the monetary_so_far and the monetary_future field
values. |
monetary_future | FLOAT | Predicted future amount of money that the customer will spend between the prediction date and the end of the prediction period. |
Create and populate the
clv.predictions
table by copying the following code into the next empty cell of the notebook and then running it:%%bigquery --params $CLV_PARAMS --project $PROJECT_ID -- How many days back for inputs transactions. 0 means from the start. DECLARE WINDOW_LENGTH INT64 DEFAULT @WINDOW_LENGTH; -- Date at which an input transactions window starts. DECLARE WINDOW_START DATE; -- Date of the first transaction in the dataset. DECLARE MIN_DATE DATE; -- Date of the final transaction in the dataset. DECLARE MAX_DATE DATE; -- Date from which you want to predict. DECLARE PREDICT_FROM_DATE DATE; SET (MIN_DATE, MAX_DATE) = ( SELECT AS STRUCT MIN(order_day) AS min_days, MAX(order_day) AS max_days FROM `clv.aggregation` ); -- You can set any date here. In production, it is generally today. SET PREDICT_FROM_DATE = MAX_DATE; IF WINDOW_LENGTH != 0 THEN SET WINDOW_START = DATE_SUB(PREDICT_FROM_DATE, INTERVAL WINDOW_LENGTH DAY); ELSE SET WINDOW_START = MIN_DATE; END IF; CREATE OR REPLACE TABLE `clv.predictions` AS ( SELECT customer_id, monetary AS monetary_so_far, ROUND(predicted_target_monetary, 2) AS monetary_predicted, ROUND(predicted_target_monetary - monetary, 2) AS monetary_future FROM ML.PREDICT( -- To use your own model, set the model name here. MODEL clv.clv_model, ( SELECT customer_id, ROUND(monetary_orders, 2) AS monetary, cnt_orders AS frequency, recency, T, ROUND(recency/cnt_orders, 2) AS time_between, ROUND(avg_basket_value, 2) AS avg_basket_value, ROUND(avg_basket_size, 2) AS avg_basket_size, has_returns, CEIL(avg_time_to_return) AS avg_time_to_return, num_returns FROM ( SELECT customer_id, SUM(value) AS monetary_orders, DATE_DIFF(MAX(order_day), MIN(order_day), DAY) AS recency, DATE_DIFF(PREDICT_FROM_DATE, MIN(order_day), DAY) AS T, COUNT(DISTINCT order_day) AS cnt_orders, AVG(qty_articles) avg_basket_size, AVG(value) avg_basket_value, CASE WHEN SUM(num_returns) > 0 THEN 'y' ELSE 'n' END AS has_returns, AVG(time_to_return) avg_time_to_return, SUM(num_returns) num_returns, FROM `clv.aggregation` WHERE order_day <= PREDICT_FROM_DATE AND order_day >= WINDOW_START GROUP BY customer_id ) ) ) )
See a sample of the data in the resulting
clv.features
table by copying the following code into the next empty cell of the notebook and then running it:%%bigquery SELECT * FROM clv.predictions ORDER BY customer_id LIMIT 5;
You should see results similar to the following:
Evaluate prediction data
Visualize and generate statistics on the prediction data to better understand data distribution and see if there are any clear trends.
Create a dataframe based on the
clv.predictions
table:%%bigquery df_predictions --project $PROJECT_ID clv.predictions
Use pandas.DataFrame.describe to generate descriptive statistics on the prediction data by copying the following code into the next empty cell of the notebook and then running it:
df_predictions.describe()
You should see results similar to the following:
Use matplotlib.gridspec to visualize the prediction data by copying the following code into the next empty cell of the notebook and then running it:
from matplotlib.gridspec import GridSpec fig = plt.figure(constrained_layout=True, figsize=(15, 5)) gs = GridSpec(2, 2, figure=fig) sns.set(font_scale = 1) plt.tick_params(axis='x', labelsize=14) ax0 = plt.subplot(gs.new_subplotspec((0, 0), colspan=1)) ax1 = plt.subplot(gs.new_subplotspec((0, 1), colspan=1)) ax2 = plt.subplot(gs.new_subplotspec((1, 0), colspan=2)) sns.violinplot(df_predictions['monetary_so_far'], ax=ax0, label='monetary_so_far') sns.violinplot(df_predictions['monetary_predicted'], ax=ax1, label='monetary_predicted') sns.violinplot(df_predictions['monetary_future'], ax=ax2, label='monetary_future')
You should see results similar to the following:
The monetary distribution analysis shows small monetary amounts for the next month compare to the overall historical value. One reason is that the model is trained to predict the value for the next month, as specified in the
LENGTH_FUTURE
parameter. You can experiment with changing that value to train and predict for the next quarter (this would beLENGTH_FUTURE
= 90) and see how the distribution changes.
Create a similar audience
Use the procedures in this section to create and export a list of customers with high CLV and then use that list to create a similar audience in Google Ads.
The code examples in the sections following Identify top customers by CLV are only runnable if you have access to a working Google Ads environment. Address the Requirements for Google Ads integration to make this example code runnable with your own Google Ads environment.
Identify top customers by CLV
As the first step to create a similar audience, you need to identify your top customers based on their predicted CLV, then associate email addresses with them. You do this by using the customer ID to join the predictions table with a customer data table that contains this information.
This tutorial uses the top 20% of customers, based on CLV. You can change the
percentage of customers to select by modifying the TOP_CLV_RATIO
parameter.
The SQL statement to select the customers uses the
PERCENT_RANK
function to identify customers whose predicted future spending puts
them at or above the percentage identified by the TOP_CLV_RATIO
value.
Set the
TOP_CLV_RATIO
parameter by copying the following code into the next empty cell of the notebook and then running it:CLV_PARAMS = { 'TOP_CLV_RATIO': 0.2 }
Create and populate the
df_top_ltv
dataframe by copying the following code into the next empty cell of the notebook and then running it:%%bigquery df_top_ltv --params $CLV_PARAMS --project $PROJECT_ID DECLARE TOP_CLV_RATIO FLOAT64 DEFAULT @TOP_CLV_RATIO; SELECT p.customer_id, monetary_future, c.email AS email FROM ( SELECT customer_id, monetary_future, PERCENT_RANK() OVER (ORDER BY monetary_future DESC) AS percent_rank_monetary FROM `clv.predictions` ) p INNER JOIN ( SELECT customer_id, email FROM `clv.customers` ) c ON p.customer_id = c.customer_id WHERE -- Decides the size of your list of emails. For similar-audience use cases -- where you need to find a minimum of matching emails, 20% should provide -- enough potential emails. percent_rank_monetary <= TOP_CLV_RATIO ORDER BY monetary_future DESC
See a sample of the
df_top_ltv
data by copying the following code into the next empty cell of the notebook and then running it:df_top_ltv.head(5)
You should see results similar to the following:
Create the Google Ads configuration file
Create the configuration YAML file for the Google Ads client.
Set the YAML file variables that control access to the Google Ads API by replacing the placeholder variables below with appropriate values for your environment. See Requirements for Google Ads integration for information on how to get the required tokens and OAuth credentials if you don't already have them.
Copy the following code into the next empty cell of the notebook and then run it:
DEVELOPER_TOKEN = "myDeveloperToken" OAUTH_2_CLIENT_ID = "myClientId" CLIENT_SECRET = "myClientSecret" REFRESH_TOKEN = "myRefreshToken"
Create the YAML file content by copying the following code into the next empty cell of the notebook and then running it.
ADWORDS_FILE = "/tmp/adwords.yaml" adwords_content = f""" # AdWordsClient configurations adwords: ############################################################################# # Required Fields # ############################################################################# developer_token: {DEVELOPER_TOKEN} ############################################################################# # Optional Fields # ############################################################################# # client_customer_id: INSERT_CLIENT_CUSTOMER_ID_HERE # user_agent: INSERT_USER_AGENT_HERE # partial_failure: True # validate_only: True ############################################################################# # OAuth2 Configuration # # Below you may provide credentials for either the installed application or # # service account flows. Remove or comment the lines for the flow you're # # not using. # ############################################################################# # The following values configure the client for the installed application # flow. client_id: {OAUTH_2_CLIENT_ID} client_secret: {CLIENT_SECRET} refresh_token: {REFRESH_TOKEN} # The following values configure the client for the service account flow. # path_to_private_key_file: INSERT_PATH_TO_JSON_KEY_FILE_HERE # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT ############################################################################# # ReportDownloader Headers # # Below you may specify boolean values for optional headers that will be # # applied to all requests made by the ReportDownloader utility by default. # ############################################################################# # report_downloader_headers: # skip_report_header: False # skip_column_header: False # skip_report_summary: False # use_raw_enum_values: False # AdManagerClient configurations ad_manager: ############################################################################# # Required Fields # ############################################################################# application_name: INSERT_APPLICATION_NAME_HERE ############################################################################# # Optional Fields # ############################################################################# # The network_code is required for all services except NetworkService: # network_code: INSERT_NETWORK_CODE_HERE # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT ############################################################################# # OAuth2 Configuration # # Below you may provide credentials for either the installed application or # # service account (recommended) flows. Remove or comment the lines for the # # flow you're not using. # ############################################################################# # The following values configure the client for the service account flow. path_to_private_key_file: INSERT_PATH_TO_JSON_KEY_FILE_HERE # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT # The following values configure the client for the installed application # flow. # client_id: INSERT_OAUTH_2_CLIENT_ID_HERE # client_secret: INSERT_CLIENT_SECRET_HERE # refresh_token: INSERT_REFRESH_TOKEN_HERE # Common configurations: ############################################################################### # Compression (optional) # # Below you may specify whether to accept and automatically decompress gzip # # encoded SOAP requests. By default, gzip compression is not enabled. # ############################################################################### # enable_compression: False ############################################################################### # Logging configuration (optional) # # Below you may specify the logging configuration. This will be provided as # # an input to logging.config.dictConfig. # ############################################################################### # logging: # version: 1 # disable_existing_loggers: False # formatters: # default_fmt: # format: ext://googleads.util.LOGGER_FORMAT # handlers: # default_handler: # class: logging.StreamHandler # formatter: default_fmt # level: INFO # loggers: # Configure root logger # "": # handlers: [default_handler] # level: INFO ############################################################################### # Proxy configurations (optional) # # Below you may specify an HTTP or HTTPS Proxy to be used when making API # # requests. Note: You must specify the scheme used for the proxy endpoint. # # # # For additional information on configuring these values, see: # # http://docs.python-requests.org/en/master/user/advanced/#proxies # ############################################################################### # proxy_config: # http: INSERT_HTTP_PROXY_URI_HERE # https: INSERT_HTTPS_PROXY_URI_HERE # If specified, the given cafile will only be used if certificate validation # is not disabled. # cafile: INSERT_PATH_HERE # disable_certificate_validation: False ################################################################################ # Utilities Included (optional) # # Below you may specify whether the library will include utilities used in the # # user agent. By default, the library will include utilities used in the user # # agent. # ################################################################################ # include_utilities_in_user_agent: True ################################################################################ # Custom HTTP headers (optional) # # Specify one or more custom headers to pass along with all requests to # # the API. # ################################################################################ # custom_http_headers: # X-My-Header: 'content' """
Populate the YAML file by copying the following code into the next empty cell of the notebook and then running it.
with open(ADWORDS_FILE, "w") as adwords_file: print(adwords_content, file=adwords_file)
Create a Google Ads remarketing list
Use the emails of the top CLV customers to create a Google Ads remarketing list.
Create a list containing the emails of the top CLV customers by copying the following code into the next empty cell of the notebook and then running it:
ltv_emails = list(set(df_top_ltv['email']))
Create the remarketing list by copying the following code into the next empty cell of the notebook and then running it:
"""Adds a user list and populates it with hashed email addresses. Note: It may take several hours for the list to be populated with members. Email addresses must be associated with a Google account. For privacy purposes, the user list size will show as zero until the list has at least 1000 members. After that, the size will be rounded to the two most significant digits. """ import hashlib import uuid # Import appropriate modules from the client library. from googleads import adwords def main(client): # Initialize appropriate services. user_list_service = client.GetService('AdwordsUserListService', 'v201809') user_list = { 'xsi_type': 'CrmBasedUserList', 'name': 'Customer relationship management list #%d' % uuid.uuid4(), 'description': 'A list of customers that originated from email addresses', # CRM-based user lists can use a membershipLifeSpan of 10000 to indicate # unlimited; otherwise normal values apply. 'membershipLifeSpan': 30, 'uploadKeyType': 'CONTACT_INFO' } # Create an operation to add the user list. operations = [{ 'operator': 'ADD', 'operand': user_list }] result = user_list_service.mutate(operations) user_list_id = result['value'][0]['id'] emails = ['customer1@example.com', 'customer2@example.com', ' Customer3@example.com '] members = [{'hashedEmail': NormalizeAndSHA256(email)} for email in emails] # Add address info. members.append({ 'addressInfo': { # First and last name must be normalized and hashed. 'hashedFirstName': NormalizeAndSHA256('John'), 'hashedLastName': NormalizeAndSHA256('Doe'), # Country code and zip code are sent in plaintext. 'countryCode': 'US', 'zipCode': '10001' } }) mutate_members_operation = { 'operand': { 'userListId': user_list_id, 'membersList': members }, 'operator': 'ADD' } response = user_list_service.mutateMembers([mutate_members_operation]) if 'userLists' in response: for user_list in response['userLists']: print('User list with name "%s" and ID "%d" was added.' % (user_list['name'], user_list['id'])) def NormalizeAndSHA256(s): """Normalizes (lowercase, remove whitespace) and hashes a string with SHA-256. Args: s: The string to perform this operation on. Returns: A normalized and SHA-256 hashed string. """ return hashlib.sha256(s.strip().lower()).hexdigest() if __name__ == '__main__': # Initialize the client object with the config # file you created in the previous section. adwords_client = adwords.AdWordsClient.LoadFromStorage(ADWORDS_FILE) main(adwords_client)
Add a similar audience
Once you have the remarketing list, follow these instructions in the Google Ads documentation to Add similar audiences to your targeting.
Optional next steps
Use the information in the following sections to optionally improve model performance and automate the ML workflow.
Improve the model
This tutorial has shown you one possible approach to creating a basic model for predicting the future monetary value of your customers. If you decided to investigate this approach further, the following are suggestions for places to improve or experiment:
- If you have a lot of outliers in your data, do additional data engineering or find additional training data.
- Try different approaches to creating features by adjusting the time window boundaries, or changing the ratio of input to target transactions.
- When comparing models, make sure that you use the same set of test data in
order to fairly compare the models' performance. For example, this tutorial
prepares the data to best predict the orders for the next 30 days. If you want
a model to predict for the next quarter, you should update the
LENGTH_FUTURE
parameter and recompute the features. - The example dataset uses a limited number of fields. If you have other dimensions such as product categories, geographic regions, or demographic information in your own data, try to create additional features for your model.
Automate the ML workflow
In the previous steps, you learned how to use BigQuery and BigQuery ML to prepare data for machine learning, create and train a model, get predictions, and prepare data for Google Ads integration.
In a production use case, these would all be recurring tasks. You could run
these steps manually every time, but we recommend automating the process. This
tutorial has companion code in GitHub that provides a set of
BigQuery stored procedures
for automating these steps, and also a shell script,
run.sh
,
that you can use to run them.
You can use the following parameters when you call the run.sh
script to configure it
to work in your environment:
Variable | Description | Default |
---|---|---|
PROJECT_ID | Project ID | Required |
DATASET_ID | Dataset ID | Required |
TRAIN_MODEL_ID | Name of the trained model. Set to *null* if you do not want to train a model. | String based on time. |
USE_MODEL_ID | Name of the model to use for predictions. Must include dataset: [DATASET].[MODEL] | String based on time. |
WINDOW_LENGTH | Time range in days for input transactions. | 0 |
WINDOW_STEP | Time in days between two windows. Equivalent to the time between two threshold dates. | 30 |
WINDOW_STEP_INITIAL | Initial time in days before setting the first threshold date. | 90 |
LENGTH_FUTURE | Time in days for which to make a prediction. | 30 |
MAX_STDV_MONETARY | Standard deviation of the monetary value per customer above which the script removes transactions. | 500 |
MAX_STDV_QTY | Standard deviation of the quantity value per customer above which the script removes transactions. | 100 |
TOP_LTV_RATIO | Percentage of the top customers to extract. | 0.2 |
Follow these steps to use the run.sh
script for automation:
Clone the GitHub repo:
git clone https://github.com/GoogleCloudPlatform/analytics-componentized-patterns.git
Update
run.sh
script to set the parameter values to match your environment.If you use your own transaction data, update 10_procedure_match.sql to specify the appropriate columns from your transactions table.
Adjust permissions on the
run.sh
script so you can run it:chmod +x run.sh
Review the command-line help for information on how to set the parameters:
./run.sh --help
Run the script:
./run.sh --project-id [YOUR_PROJECT_ID] --dataset-id [YOUR_DATASET_ID]
In a production use case, you could check with your data engineers and your DevOps team to see if they can integrate the stored procedures into their tooling for ease of use.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.
Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.
Delete the project
The easiest way to eliminate billing is to delete the project you created for the tutorial.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the components
If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.
Delete the BigQuery dataset and tables
- Open the BigQuery console
- In the Resources section, expand the project in which you are completing this tutorial and select the clv dataset.
- Click Delete dataset in the header of the dataset pane.
- In the overlay window that appears, type
clv
and then click Delete.
Delete the user-managed notebooks instance
- Open the User-managed notebooks tab
- Select the checkbox for the clv notebook instance.
- Click Delete.
- In the overlay window that appears, click Delete.
What's next
- Review the sample code in the Activate on LTV predictions repo on GitHub.
- Learn about other predictive forecasting solutions.
- Learn more about BigQuery ML.
- For an in-depth discussion of MLOps, see MLOps: Continuous delivery and automation pipelines in machine learning.