Building new audiences based on existing customer lifetime value

Learn how to predict customer lifetime value (CLV), and then use this data to identify potential new customers by building similar audiences in Google Ads.

Overview

This tutorial shows you how to predict the likely monetary value per customer for a specified timeframe, based on the customer's transaction history. You accomplish this through the following steps:

  • Importing transaction and customer data into BigQuery.
  • Processing the data for use with a machine learning (ML) model.
  • Building the ML model by using BigQuery ML.
  • Getting predictions from the model to identify top customers by CLV.
  • Retrieving email addresses for these customers from a customer data table.
  • Using this combined data to create a remarketing list for use in Google Ads.
  • Using this remarketing list as the basis for generating a similar audience for use in ad targeting.

This tutorial walks you through creating a Vertex AI Workbench user-managed notebooks instance to accomplish these objectives, with prescriptive guidance and step-by-step procedures. If you just want a completed notebook to get started with instead, you can use bqml_automl_ltv_activate_lookalike.ipynb on the notebook server of your choice.

This tutorial is intended for data engineers, data scientists, and data analysts who build ML datasets and models to support business decisions. It assumes that you have basic knowledge of the following:

  • Basic understanding of machine learning concepts and Python
  • Standard SQL

This tutorial refactors the example used in Predicting Customer Lifetime Value with AutoML Tables, which is the fourth and last part of another tutorial series on predicting CLV. Consider reading the first tutorial in that series, Predicting Customer Lifetime Value with AI Platform: introduction, if you want to better understand key decisions points specific to doing feature engineering and model development for CLV discovery.

The example in this tutorial differs from the one in Predicting Customer Lifetime Value with AutoML Tables in the following ways:

  • It creates multiple records per customer to use in training the model. It does this by segmenting customer transactions into overlapping windows of time.
  • It enables you to use the model to predict CLV for different periods of time, for example one month or one quarter, by simply changing model input parameters.
  • It minimizes development time by using AutoML directly from BigQuery ML.
  • It uses different datasets for the sales and customer data.
  • It shows how to use the CLV predictions to create similar audiences in Google Ads.

Datasets

To create a list of high-CLV customers that you can build a similar audience from, you need two sets of data:

  • Sales transactions for the customers you want to evaluate. You use computations based this data to train the CLV model. In a production use case, you would pull this information from your order processing system.
  • Customer email addresses. In a production use case, you would pull this information from your customer relationship management (CRM) system.

This tutorial provides sample data in two CSV files that you load into BigQuery tables. Both of these datasets contain a customer ID, which you use to join tables to associate the customer email address with their CLV information.

Model

This tutorial shows you how to create a regression model, to use for predicting the future monetary value of your customers, because this type of model is good for predicting a continuous variable (also called a continuous value) like CLV amount.

Specifically, this tutorial uses BigQuery ML to create an AutoML regression model. Using AutoML helps minimize development overhead in the following ways:

  • Feature engineering — AutoML automatically handles several common feature engineering tasks.
  • BigQuery ML integration — You can create AutoML models by using BigQuery ML SQL statements, which limits the need to switch back and forth between different development contexts.
  • Splitting the training dataset — When training a model, it is best practice to split your data into three non-overlapping datasets: training, validation, and test. AutoML handles this by default, although you can also specify a specific split if you want to.

Calculating features and labels

In this tutorial, you use threshold dates to segment customer transactions over time into multiple overlapping windows. A threshold date separates input transactions from target transactions. Input transactions occur before the threshold date, and are used to calculate the features that you input to the model. Target transactions occur after the threshold date and are used to compute the CLV for that time period, which is used as the label you are training the model to predict.

RFM features

Three important features that are used in CLV models are the recency, frequency, and monetary (RFM) values, which you calculate from customer transaction history data:

  • Recency: When was the customer's last order?
  • Frequency: How often do they buy?
  • Monetary: What amount do they spend?

The following diagram shows a succession of past sales for a set of four customers.

Sales history for 4 customers

The diagram illustrates the RFM values for the customers, showing for each customer:

  • Recency: The time between the last purchase and today, represented by the distance between the rightmost circle and the vertical dotted line that's labeled Now.
  • Frequency: The time between purchases, represented by the distance between the circles on a single line.
  • Monetary: The amount of money spent on each purchase, represented by the size of the circle.

Requirements for Google Ads integration

The example code in the sections of this tutorial on Google Ads integration requires access to a working Google Ads environment to run. To make the example code runnable with your own Google Ads environment, you must have the following items:

  • A Google Ads Manager account and access to the Google Ads API. If you don't already have these, follow the instructions in the Sign Up procedure to get them.
  • A Google Ads developer token, an OAuth client ID and client secret, and an OAuth refresh token for the Google Ads Python client library. If you don't already have these items, follow the instructions in Set up AdWords API access to get them. For step 5 in that procedure, you can use googleads-python-lib/examples/adwords/authentication/generate_refresh_token.py to see how to get a refresh token.
  • Transaction data that you can use to calculate CLV, and customer data that includes the customer email address. Use this in place of the sample data provided in the Import the data section. You will need to make sure your data uses the same table names and schemas as the sample data. The following tables describe the expected BigQuery tables:

Table clv.sales

Column Data type
customer_id INTEGER
order_id STRING
transaction_date DATE
product_sku STRING
qty INTEGER
unit_price FLOAT

Table clv.customers

Column Data type
customer_id INTEGER
full_name STRING
age INTEGER
job_title STRING
email STRING

Objectives

  1. Prepare and explore training data by using BigQuery.
  2. Build, train, and deploy an ML model by using BigQuery ML.
  3. Get CLV predictions from the deployed model.
  4. Create a list of the customers that represent the top 20% of CLV.
  5. Create a Google Ads remarketing list based on the list of top CLV customers, which you can then use to generate a similar audience to those customers.

Costs

This tutorial uses the following billable components of Google Cloud:

  • AI Platform
  • BigQuery
  • BigQuery ML
  • Compute Engine

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Compute Engine, BigQuery, and Notebooks APIs.

    Enable the APIs

  5. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  6. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  7. Enable the Compute Engine, BigQuery, and Notebooks APIs.

    Enable the APIs

Create a notebook

  1. In the Google Cloud Console, go to the Notebooks page.

    Go to Notebooks

  2. Click  New notebook.

  3. Choose Python 3

  4. For Instance name, type clv.

  5. Click Create. It takes a few minutes for the notebook instance to be created.

  6. When the instance is available, click Open JupyterLab.

  7. In the Notebook section of the JupyterLab Launcher, click Python 3.

Install libraries

Install googleads and other libraries needed for this tutorial:

  1. Copy the following code into the first cell of the notebook:

    # Install libraries.
    %pip install -q googleads
    %pip install -q -U kfp matplotlib Faker --user
    
    # Restart kernel after installs
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    
  2. Click Run in the menu bar.

Import packages

Install the packages needed for this tutorial by copying the following code into the next empty cell of the notebook and then running it:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os, json, random
import hashlib, uuid
import time, calendar, math
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from datetime import datetime
from google.cloud import bigquery
from googleads import adwords

Create a BigQuery client

Create a BigQuery client for the Python client library for BigQuery by copying the following code into the next empty cell of the notebook and then running it. Replace myProject with the ID of the project you are using to complete this tutorial. You can find the project ID in the Project info card on your Google Cloud project dashboard.

bq_client = bigquery.Client(project="myProject")

Import the data

Use the instructions in the following sections to import the sample data into BigQuery and then process it for use training the ML model.

If you have a working Google Ads environment, you can use your own data instead of importing the sample. Populate the clv.sales table with transaction data that you can use to calculate CLV, and populate the clv.customers table with customer data that includes the customer email address.

Create a dataset

Before you can create BigQuery tables for the example data, you must create a dataset to contain them.

Create the clv dataset by copying the following code into the next empty cell of the notebook and then running it. Replace myProject with the ID of the project you are using to complete this tutorial.

PROJECT_ID = "myProject"
! bq mk $PROJECT_ID:clv

Import the sales data

Create and populate the clv.sales table by copying the following code into the next empty cell of the notebook and then running it:

! bq load \
--project_id $PROJECT_ID \
--skip_leading_rows 1 \
--max_bad_records 100000 \
--replace \
--field_delimiter "," \
--autodetect \
clv.sales \
gs://solutions-public-assets/analytics-componentized-patterns/ltv/sales_*

Import the customer data

Create and populate the clv.customers table by copying the following code into the next empty cell of the notebook and then running it:

! bq load \
--project_id $PROJECT_ID \
--skip_leading_rows 1 \
--max_bad_records 100000 \
--replace \
--field_delimiter "," \
--autodetect \
clv.customers \
gs://solutions-public-assets/analytics-componentized-patterns/ltv/crm.csv

Prepare the data

To prepare the customer transaction data so you can use it to train the ML model, you must complete the following tasks:

  • Set parameters that determine the standard deviation allowed for order monetary value and item quantity. These are used to identify and remove outlier records for each customer when performing the initial aggregation of the customer transaction data.
  • Aggregate the customer transaction data.
  • Check the distribution of the aggregated data across criteria like transactions per customer and number of items per order to identify areas where you might want to further refine the data.
  • Define the features to compute from the aggregated data, such as the RFM values.
  • Set parameters that define the time windows to use for feature computation. For example, how many days there should be between threshold dates.
  • Compute the features needed to train the model.

Set aggregation parameters

Use the following parameters in the aggregation procedure to define the maximum standard deviation values:

  • MAX_STDV_MONETARY — The standard deviation of the monetary value per customer.
  • MAX_STDV_QTY — The standard deviation of the quantity of products per customer.

Set parameters to determine the standard variation values to use by copying the following code into the next empty cell of the notebook and then running it:

AGG_PARAMS = {
    'MAX_STDV_MONETARY': 500,
    'MAX_STDV_QTY': 100
}

In a production use case, you could modify these parameters to determine what ranges of values are acceptable for order value and item quantity.

Aggregate customer transaction data

The following query aggregates all orders per day per customer. Because the ML task is to predict a monetary value for a period of time like weeks or months, days works well as a time unit to group the transactions per customer.

This query also removes orders that are outliers for each customer. Orders where the order value or quantity of products fall outside the acceptable standard deviation values specified by MAX_STDV_MONETARY and MAX_STDV_QTY are filtered out.

  1. Create a table with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery --params $AGG_PARAMS --project $PROJECT_ID
    
    DECLARE MAX_STDV_MONETARY INT64 DEFAULT @MAX_STDV_MONETARY;
    DECLARE MAX_STDV_QTY INT64 DEFAULT @MAX_STDV_QTY;
    
    CREATE OR REPLACE TABLE `clv.aggregation` AS
    SELECT
      customer_id,
      order_day,
      ROUND(day_value_after_returns, 2) AS value,
      day_qty_after_returns as qty_articles,
      day_num_returns AS num_returns,
      CEIL(avg_time_to_return) AS time_to_return
    FROM (
      SELECT
        customer_id,
        order_day,
        SUM(order_value_after_returns) AS day_value_after_returns,
        STDDEV(SUM(order_value_after_returns)) OVER(PARTITION BY customer_id ORDER BY SUM(order_value_after_returns)) AS stdv_value,
        SUM(order_qty_after_returns) AS day_qty_after_returns,
        STDDEV(SUM(order_qty_after_returns)) OVER(PARTITION BY customer_id ORDER BY SUM(order_qty_after_returns)) AS stdv_qty,
        CASE
          WHEN MIN(order_min_qty) < 0 THEN count(1)
          ELSE 0
        END AS day_num_returns,
        CASE
          WHEN MIN(order_min_qty) < 0 THEN AVG(time_to_return)
          ELSE NULL
        END AS avg_time_to_return
      FROM (
        SELECT
          customer_id,
          order_id,
          -- Gives the order date vs return(s) dates.
          MIN(transaction_date) AS order_day,
          MAX(transaction_date) AS return_final_day,
          DATE_DIFF(MAX(transaction_date), MIN(transaction_date), DAY) AS time_to_return,
          -- Aggregates all products in the order
          -- and all products returned later.
          SUM(qty * unit_price) AS order_value_after_returns,
          SUM(qty) AS order_qty_after_returns,
          -- If negative, order has qty return(s).
          MIN(qty) order_min_qty
        FROM
          `clv.sales`
        GROUP BY
          customer_id,
          order_id)
      GROUP BY
        customer_id,
        order_day)
    WHERE
      -- [Optional] Remove dates with outliers per a customer.
      (stdv_value < MAX_STDV_MONETARY
        OR stdv_value IS NULL) AND
      (stdv_qty < MAX_STDV_QTY
        OR stdv_qty IS NULL);
    
  2. See a sample of the data in the resulting clv.aggregation table by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery
    
    SELECT * FROM clv.aggregation LIMIT 5;
    

    You should see results similar to the following:

    First 5 rows of aggregated data.

In a production use case, you would probably want to further refine the data by applying additional selection criteria, such as the following options:

  • Remove records that don't have positive item quantities and purchase amount.
  • Remove records that don't have a customer ID.
  • Keep only active customers, however you define that. For example, you might want to focus on customers who have purchased something in the past 90 days.

Check data distribution

This tutorial does minimal data cleansing, instead focusing on doing basic transformation of the transaction data so that it will work with the model. In this section, you check data distribution to determine where there are outliers and identify areas for further data refinement.

Check distribution by date

  1. Create and populate the df_dist_dates dataframe with date data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery df_dist_dates --project $PROJECT_ID
    
    SELECT count(1) c, SUBSTR(CAST(order_day AS STRING), 0, 7) as yyyy_mm
    FROM `clv.aggregation`
    WHERE qty_articles > 0
    GROUP BY yyyy_mm
    ORDER BY yyyy_mm
    
  2. Use seaborn to visualize the date data by copying the following code into the next empty cell of the notebook and then running it:

    plt.figure(figsize=(12,5))
    sns.barplot( x='yyyy_mm', y='c', data=df_dist_dates)
    

    You should see results similar to the following:

    Visualization of date data distribution.

    Orders are well distributed across the year, although a little lower in the first few months captured in this dataset. In a production use case, you could use data distribution by date as one factor to consider when setting the time window parameters for the feature computation procedure.

Check distribution by transactions per customer

  1. Create and populate the df_dist_customers dataframe with customer transaction counts from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery df_dist_customers --project $PROJECT_ID
    
    SELECT customer_id, count(1) c
    FROM `clv.aggregation`
    GROUP BY customer_id
    
  2. Use seaborn to visualize the customer transaction data by copying the following code into the next empty cell of the notebook and then running it:

    plt.figure(figsize=(12,4))
    sns.distplot(df_dist_customers['c'], hist_kws=dict(ec="k"), kde=False)
    

    You should see results similar to the following:

    Visualization of customer data distribution.

    This data has good distribution, with the number of transactions per customer distributed across a narrow range, and no clear outliers.

Check distribution by item quantity

  1. Create and populate the df_dist_qty dataframe with item quantity data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery df_dist_qty --project $PROJECT_ID
    
    SELECT qty_articles, count(1) c
    FROM `clv.aggregation`
    GROUP BY qty_articles
    
  2. Use seaborn to visualize the item quantity data by copying the following code into the next empty cell of the notebook and then running it:

    plt.figure(figsize=(12,4))
    sns.distplot(df_dist_qty['qty_articles'], hist_kws=dict(ec="k"), kde=False)
    

    You should see results similar to the following:

    Visualization of item quantity data distribution.

    A few customers have very large quantities in their orders, but the distribution is generally good. In a production use case, you could do more data engineering in this area. You want to remove transactions that are outliers for individual customers, for example removing a transaction for 20 items for a customer who usually purchases 1 or 2 items. You want to keep customers who are outliers in that they usually purchase many more items than others.

Check distribution by monetary value

  1. Create and populate the df_dist_values dataframe with monetary value data from the aggregated transactions by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery df_dist_values --project $PROJECT_ID
    
    SELECT value
    FROM `clv.aggregation`
    
  2. Use seaborn to visualize the monetary value data by copying the following code into the next empty cell of the notebook and then running it:

    axv = sns.violinplot(x=df_dist_values["value"])
    axv.set_xlim(-200, 3500)
    

    You should see results similar to the following:

    Visualization of monetary data distribution.

    The distribution shows a few outliers that spend significantly more than average. In a production use case, you could do more data engineering in this area. You want to remove transactions that are outliers for individual customers, for example removing a $1000 transaction for a customer who usually spends about $50 per transaction. You want to keep customers who are outliers in that they usually spend a lot more than others.

Define features

The following table describes the features that are computed to train the model:

Feature name Type Description
monetary FLOAT The sum of monetary values for all transactions for a given customer for a given time window.
frequency INTEGER The number of transactions placed by a given customer for a given time window.
recency INTEGER The time between the first and last transactions that were placed by a given customer for a given time window.
T INTEGER The time between the first transaction in the dataset and the end of the input transactions window.
time_between FLOAT The average time between transactions for a given customer for a given time window.
avg_basket_value FLOAT The average monetary value of the customer's basket for a given time window.
avg_basket_size FLOAT The average number of items that a given customer has in their basket during a given time window.
has_returns STRING Whether a given customer has returned items for at least one transaction during a given time window.
avg_time_to_return FLOAT The time that it takes for a given customer to return their first item for a transaction.
num_returns INTEGER The number of items that the customer has returned that are associated with transactions that are in a given time window.

Set feature computation parameters

The following parameters define the time windows to use for computing features:

  • WINDOW_STEP — Specifies the number of days between thresholds. This determines how often you compute features and labels.
  • WINDOW_STEP_INITIAL — Specifies the number of days between the first order date in your training data and the first threshold.
  • WINDOW_LENGTH — Specifies the number of days back to use for input transactions. The default is to use a value of 0, which takes all transactions before the given threshold.
  • LENGTH_FUTURE — Specifies the number of days in the future to predict the monetary value that serves as the ML label. At each threshold, BigQuery calculates the label value for all orders that happen LENGTH_FUTURE after the threshold date.

For example, assuming a dataset with an earliest transaction date of 1/1/2015 and the following parameter values:

  • WINDOW_STEP: 30
  • WINDOW_STEP_INITIAL: 90
  • WINDOW_LENGTH: 0
  • LENGTH_FUTURE: 30

The time windows for the first few loops of the feature creation procedure would work as follows:

First loop

Parameter How it is calculated Value
Threshold date The earliest transaction date + the WINDOW_STEP_INITIAL value, so 1/1/2015 + 90 days. 4/1/2015
Window start WINDOW_LENGTH is 0, so all transactions prior to the threshold are used. The earliest transaction is on 1/1/2015. 1/1/2015
Input transaction window From the window start date until the threshold date, inclusive. 1/1/2015 - 4/1/2015
Target transaction window From the threshold date (exclusive) until the threshold date + the number of days specified by LENGTH_FUTURE (4/1/2015 + 30 days). 4/2/2015 - 5/1/2015

Second loop

Parameter How it is calculated Value
Threshold date The threshold date from previous loop + the WINDOW_STEP value, so 4/1/2015 + 30 days 5/1/2015
Window start WINDOW_LENGTH is 0, so all transactions prior to the threshold are used. The earliest transaction is on 1/1/2015. 1/1/2015
Input transaction window From the window start date until the threshold date, inclusive. 1/1/2015 - 5/1/2015
Target transaction window From the threshold date (exclusive) until the threshold date + the number of days specified by LENGTH_FUTURE (5/1/2015 + 30 days). 5/2/2015 - 5/31/2015

If you were to modify WINDOW_LENGTH to have a value of 15, the time windows would instead work as follows:

First loop

Parameter How it is calculated Value
Threshold date The earliest transaction date + the WINDOW_STEP_INITIAL value, so 1/1/2015 + 90 days. 4/1/2015
Window start Threshold date - WINDOW_LENGTH value, so 4/1/2015 - 15 days. 3/17/2015
Input transaction window From the window start date until the threshold date, inclusive. 3/17/2015 - 4/1/2015
Target transaction window From the threshold date (exclusive) until the threshold date + the number of days specified by LENGTH_FUTURE (4/1/2015 + 30 days). 4/2/2015 - 5/1/2015

Second loop

Parameter How it is calculated Value
Threshold date Threshold date from previous loop + WINDOW_STEP value, so 4/1/2015 + 30 days 5/1/2015
Window start Threshold date - WINDOW_LENGTH value, so 5/1/2015 - 15 days 4/16/2015
Input transaction window From the window start date until the threshold date, inclusive. 4/16/2015 - 5/1/2015
Target transaction window From the threshold date (exclusive) until the threshold date + the number of days specified by LENGTH_FUTURE (5/1/2015 + 30 days). 5/2/2015 - 5/31/2015

Set the parameters to determine feature computation windows by copying the following code into the next empty cell of the notebook and then running it:

CLV_PARAMS = {
    'WINDOW_LENGTH': 0,
    'WINDOW_STEP': 30,
    'WINDOW_STEP_INITIAL': 90,
    'LENGTH_FUTURE': 30
}

In a production use case, you would alter these parameters to get results that best suit the data you are working with. For example, if your customers buy multiple times a week, you could compute features using weekly windows. Or if you have a lot of data, you could create more windows by decreasing their sizes and possibly reducing the number of days between threshold dates. Experimenting with these parameters, in addition to hyperparameter tuning can potentially improve the performance of your model.

Create features

  1. Create a table with computed features for training the model by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery --params $CLV_PARAMS --project $PROJECT_ID
    
    -- Length
    -- Date of the first order in the dataset.
    DECLARE MIN_DATE DATE;
    -- Date of the final order in the dataset.
    DECLARE MAX_DATE DATE;
    -- Date that separates inputs orders from target transactions.
    DECLARE THRESHOLD_DATE DATE;
    -- How many days back for inputs transactions. 0 means from the start.
    DECLARE WINDOW_LENGTH INT64 DEFAULT @WINDOW_LENGTH;
    -- Date at which an input transactions window starts.
    DECLARE WINDOW_START DATE;
    -- How many days between thresholds.
    DECLARE WINDOW_STEP INT64 DEFAULT @WINDOW_STEP;
    -- How many days for the first window.
    DECLARE WINDOW_STEP_INITIAL INT64 DEFAULT @WINDOW_STEP_INITIAL;
    -- Index of the window being run.
    DECLARE STEP INT64 DEFAULT 1;
    -- How many days to predict for.
    DECLARE LENGTH_FUTURE INT64 DEFAULT @LENGTH_FUTURE;
    
    SET (MIN_DATE, MAX_DATE) = (
      SELECT AS STRUCT
        MIN(order_day) AS min_days,
        MAX(order_day) AS max_days
      FROM
        `clv.aggregation`
    );
    
    SET THRESHOLD_DATE = MIN_DATE;
    
    -- For more information about the features of this table,
    -- see https://github.com/CamDavidsonPilon/lifetimes/blob/master/lifetimes/utils.py#L246
    -- and https://cloud.google.com/solutions/machine-learning/clv-prediction-with-offline-training-train#aggregating_data
    CREATE OR REPLACE TABLE clv.features
    (
      customer_id STRING,
      monetary FLOAT64,
      frequency INT64,
      recency INT64,
      T INT64,
      time_between FLOAT64,
      avg_basket_value FLOAT64,
      avg_basket_size FLOAT64,
      has_returns STRING,
      avg_time_to_return FLOAT64,
      num_returns INT64,
      -- threshold DATE,
      -- step INT64,
      target_monetary FLOAT64,
    );
    
    -- Using a BigQuery scripting loop
    -- (https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#loop)
    -- lets you enhance the logic of your query without the need of another
    -- programming language or client code.
    
    LOOP
      -- Can choose a longer original window in case
      -- there were not many orders in the early days.
      IF STEP = 1 THEN
        SET THRESHOLD_DATE = DATE_ADD(THRESHOLD_DATE, INTERVAL WINDOW_STEP_INITIAL DAY);
      ELSE
        SET THRESHOLD_DATE = DATE_ADD(THRESHOLD_DATE, INTERVAL WINDOW_STEP DAY);
      END IF;
      SET STEP = STEP + 1;
    
      IF THRESHOLD_DATE >= DATE_SUB(MAX_DATE, INTERVAL (WINDOW_STEP) DAY) THEN
        LEAVE;
      END IF;
    
      -- Takes all transactions before the threshold date unless you decide
      -- to use a different window length to test model performance.
      IF WINDOW_LENGTH != 0 THEN
        SET WINDOW_START = DATE_SUB(THRESHOLD_DATE, INTERVAL WINDOW_LENGTH DAY);
     ELSE
        SET WINDOW_START = MIN_DATE;
      END IF;
       INSERT clv.features
      SELECT
        CAST(tf.customer_id AS STRING),
        ROUND(tf.monetary_orders, 2) AS monetary,
        tf.cnt_orders AS frequency,
        tf.recency,
        tf.T,
        ROUND(tf.recency/cnt_orders, 2) AS time_between,
        ROUND(tf.avg_basket_value, 2) AS avg_basket_value,
        ROUND(tf.avg_basket_size, 2) AS avg_basket_size,
        has_returns,
        CEIL(avg_time_to_return) AS avg_time_to_return,
        num_returns,
        ROUND(tt.target_monetary, 2) AS target_monetary,
      FROM (
          -- This SELECT uses only data before THRESHOLD_DATE to make features.
          SELECT
            customer_id,
            SUM(value) AS monetary_orders,
            DATE_DIFF(MAX(order_day), MIN(order_day), DAY) AS recency,
            DATE_DIFF(THRESHOLD_DATE, MIN(order_day), DAY) AS T,
            COUNT(DISTINCT order_day) AS cnt_orders,
            AVG(qty_articles) avg_basket_size,
            AVG(value) avg_basket_value,
            CASE
              WHEN SUM(num_returns) > 0 THEN 'y'
              ELSE 'n'
            END AS has_returns,
            AVG(time_to_return) avg_time_to_return,
            THRESHOLD_DATE AS threshold,
            SUM(num_returns) num_returns,
          FROM
            `clv.aggregation`
          WHERE
            order_day <= THRESHOLD_DATE AND
            order_day >= WINDOW_START
          GROUP BY
            customer_id
        ) tf
          INNER JOIN (
        -- This SELECT uses all data after threshold as target.
        SELECT
          customer_id,
          SUM(value) target_monetary
        FROM
          `clv.aggregation`
        WHERE
          order_day <= DATE_ADD(THRESHOLD_DATE, INTERVAL LENGTH_FUTURE DAY)
          -- For the sample data, the overall value is similar to the value
          -- after threshold, and prediction performs better using the overall
          -- value. When using your own data, try uncommenting the following
          -- AND clause and see which version of the procedure gives you better
          -- results.
          -- AND order_day > THRESHOLD_DATE
        GROUP BY
          customer_id) tt
      ON
          tf.customer_id = tt.customer_id;
      END LOOP;
    
  2. See a sample of the data for one customer in the resulting clv.features table by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery
    
    SELECT * FROM `clv.features` WHERE customer_id = "10"
    UNION ALL
    (SELECT * FROM `clv.features` LIMIT 5)
    ORDER BY customer_id, frequency, T LIMIT 5
    
    

    You should see results similar to the following:

    First 5 rows of feature data.

Train the model

This tutorial uses the AUTOML_REGRESSOR in BigQuery ML to create, train, and deploy an AutoML regression model.

The tutorial uses the default settings to train the model; in a production use case, you could try additional feature engineering techniques and different training data splits in order to improve the model.

Create and populate the clv_model model by copying the following code into the next empty cell of the notebook and then running it:

%%bigquery

CREATE OR REPLACE MODEL `clv.clv_model`
       OPTIONS(MODEL_TYPE="AUTOML_REGRESSOR",
               INPUT_LABEL_COLS=["target_monetary"],
               OPTIMIZATION_OBJECTIVE="MINIMIZE_MAE")
AS SELECT
  * EXCEPT(customer_id)
FROM
  `clv.features`

This should give you a good base model to work from. To adapt the model to your own data and use case, you can optimize it using hyperparameter tuning, feature engineering, and other techniques.

Predict CLV

Your next step is to get CLV predictions from the model for each customer and then write this data to a table. Prediction records contain the fields described in the following table:

Field name Type Description
customer_id STRING Customer ID
monetary_so_far FLOAT Total amount of money spent by the customer before the prediction date.
monetary_predicted FLOAT Predicted total amount of money spent by the customer, which is the sum of the monetary_so_far and the monetary_future field values.
monetary_future FLOAT Predicted future amount of money that the customer will spend between the prediction date and the end of the prediction period.
  1. Create and populate the clv.predictions table by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery --params $CLV_PARAMS --project $PROJECT_ID
    
    -- How many days back for inputs transactions. 0 means from the start.
    DECLARE WINDOW_LENGTH INT64 DEFAULT @WINDOW_LENGTH;
    -- Date at which an input transactions window starts.
    DECLARE WINDOW_START DATE;
    
    -- Date of the first transaction in the dataset.
    DECLARE MIN_DATE DATE;
    -- Date of the final transaction in the dataset.
    DECLARE MAX_DATE DATE;
    -- Date from which you want to predict.
    DECLARE PREDICT_FROM_DATE DATE;
    
    SET (MIN_DATE, MAX_DATE) = (
      SELECT AS STRUCT
        MIN(order_day) AS min_days,
        MAX(order_day) AS max_days
      FROM
        `clv.aggregation`
    );
    
    -- You can set any date here. In production, it is generally today.
    SET PREDICT_FROM_DATE = MAX_DATE;
    IF WINDOW_LENGTH != 0 THEN
      SET WINDOW_START = DATE_SUB(PREDICT_FROM_DATE, INTERVAL WINDOW_LENGTH DAY);
    ELSE
      SET WINDOW_START = MIN_DATE;
    END IF;
    
    CREATE OR REPLACE TABLE `clv.predictions`
    AS (
    SELECT
      customer_id,
      monetary AS monetary_so_far,
      ROUND(predicted_target_monetary, 2) AS monetary_predicted,
      ROUND(predicted_target_monetary - monetary, 2) AS monetary_future
    FROM
      ML.PREDICT(
        -- To use your own model, set the model name here.
        MODEL clv.clv_model,
        (
          SELECT
            customer_id,
            ROUND(monetary_orders, 2) AS monetary,
            cnt_orders AS frequency,
            recency,
            T,
            ROUND(recency/cnt_orders, 2) AS time_between,
            ROUND(avg_basket_value, 2) AS avg_basket_value,
            ROUND(avg_basket_size, 2) AS avg_basket_size,
            has_returns,
            CEIL(avg_time_to_return) AS avg_time_to_return,
            num_returns
          FROM (
            SELECT
              customer_id,
              SUM(value) AS monetary_orders,
              DATE_DIFF(MAX(order_day), MIN(order_day), DAY) AS recency,
              DATE_DIFF(PREDICT_FROM_DATE, MIN(order_day), DAY) AS T,
              COUNT(DISTINCT order_day) AS cnt_orders,
              AVG(qty_articles) avg_basket_size,
              AVG(value) avg_basket_value,
              CASE
                WHEN SUM(num_returns) > 0 THEN 'y'
                ELSE 'n'
              END AS has_returns,
              AVG(time_to_return) avg_time_to_return,
              SUM(num_returns) num_returns,
            FROM
              `clv.aggregation`
            WHERE
              order_day <= PREDICT_FROM_DATE AND
              order_day >= WINDOW_START
            GROUP BY
              customer_id
          )
        )
      )
    )
    
  2. See a sample of the data in the resulting clv.features table by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery
    
    SELECT * FROM clv.predictions
    ORDER BY customer_id
    LIMIT 5;
    

    You should see results similar to the following:

    First 5 rows of prediction data.

Evaluate prediction data

Visualize and generate statistics on the prediction data to better understand data distribution and see if there are any clear trends.

  1. Create a dataframe based on the clv.predictions table:

    %%bigquery df_predictions --project $PROJECT_ID
    
    clv.predictions
    
  2. Use pandas.DataFrame.describe to generate descriptive statistics on the prediction data by copying the following code into the next empty cell of the notebook and then running it:

    df_predictions.describe()
    

    You should see results similar to the following:

    Statistics for prediction data.

  3. Use matplotlib.gridspec to visualize the prediction data by copying the following code into the next empty cell of the notebook and then running it:

    from matplotlib.gridspec import GridSpec
    
    fig = plt.figure(constrained_layout=True, figsize=(15, 5))
    gs = GridSpec(2, 2, figure=fig)
    
    sns.set(font_scale = 1)
    plt.tick_params(axis='x', labelsize=14)
    
    ax0 = plt.subplot(gs.new_subplotspec((0, 0), colspan=1))
    ax1 = plt.subplot(gs.new_subplotspec((0, 1), colspan=1))
    ax2 = plt.subplot(gs.new_subplotspec((1, 0), colspan=2))
    
    sns.violinplot(df_predictions['monetary_so_far'], ax=ax0, label='monetary_so_far')
    sns.violinplot(df_predictions['monetary_predicted'], ax=ax1, label='monetary_predicted')
    sns.violinplot(df_predictions['monetary_future'], ax=ax2, label='monetary_future')
    

    You should see results similar to the following:

    Visualization of prediction data distribution.

    The monetary distribution analysis shows small monetary amounts for the next month compare to the overall historical value. One reason is that the model is trained to predict the value for the next month, as specified in the LENGTH_FUTURE parameter. You can experiment with changing that value to train and predict for the next quarter (this would be LENGTH_FUTURE = 90) and see how the distribution changes.

Create a similar audience

Use the procedures in this section to create and export a list of customers with high CLV and then use that list to create a similar audience in Google Ads.

The code examples in the sections following Identify top customers by CLV are only runnable if you have access to a working Google Ads environment. Address the Requirements for Google Ads integration to make this example code runnable with your own Google Ads environment.

Identify top customers by CLV

As the first step to create a similar audience, you need to identify your top customers based on their predicted CLV, then associate email addresses with them. You do this by using the customer ID to join the predictions table with a customer data table that contains this information.

This tutorial uses the top 20% of customers, based on CLV. You can change the percentage of customers to select by modifying the TOP_CLV_RATIO parameter. The SQL statement to select the customers uses the PERCENT_RANK function to identify customers whose predicted future spending puts them at or above the percentage identified by the TOP_CLV_RATIO value.

  1. Set the TOP_CLV_RATIO parameter by copying the following code into the next empty cell of the notebook and then running it:

    CLV_PARAMS = {
        'TOP_CLV_RATIO': 0.2
    }
    
  2. Create and populate the df_top_ltv dataframe by copying the following code into the next empty cell of the notebook and then running it:

    %%bigquery df_top_ltv --params $CLV_PARAMS --project $PROJECT_ID
    
    DECLARE TOP_CLV_RATIO FLOAT64 DEFAULT @TOP_CLV_RATIO;
    
    SELECT
      p.customer_id,
      monetary_future,
      c.email AS email
    FROM (
      SELECT
        customer_id,
        monetary_future,
        PERCENT_RANK() OVER (ORDER BY monetary_future DESC) AS percent_rank_monetary
      FROM
        `clv.predictions` ) p
    INNER JOIN (
      SELECT
        customer_id,
        email
      FROM
        `clv.customers` ) c
    ON
      p.customer_id = c.customer_id
    WHERE
      -- Decides the size of your list of emails. For similar-audience use cases
      -- where you need to find a minimum of matching emails, 20% should provide
      -- enough potential emails.
      percent_rank_monetary <= TOP_CLV_RATIO
    ORDER BY monetary_future DESC
    
  3. See a sample of the df_top_ltv data by copying the following code into the next empty cell of the notebook and then running it:

    df_top_ltv.head(5)
    

    You should see results similar to the following:

    First 5 rows of top customers by CLV data.

Create the Google Ads configuration file

Create the configuration YAML file for the Google Ads client.

  1. Set the YAML file variables that control access to the Google Ads API by replacing the placeholder variables below with appropriate values for your environment. See Requirements for Google Ads integration for information on how to get the required tokens and OAuth credentials if you don't already have them.

    Copy the following code into the next empty cell of the notebook and then run it:

    DEVELOPER_TOKEN = "myDeveloperToken"
    OAUTH_2_CLIENT_ID = "myClientId"
    CLIENT_SECRET = "myClientSecret"
    REFRESH_TOKEN = "myRefreshToken"
    
  2. Create the YAML file content by copying the following code into the next empty cell of the notebook and then running it.

    ADWORDS_FILE = "/tmp/adwords.yaml"
    
    adwords_content = f"""
    # AdWordsClient configurations
    adwords:
      #############################################################################
      # Required Fields                                                           #
      #############################################################################
      developer_token: {DEVELOPER_TOKEN}
      #############################################################################
      # Optional Fields                                                           #
      #############################################################################
      # client_customer_id: INSERT_CLIENT_CUSTOMER_ID_HERE
      # user_agent: INSERT_USER_AGENT_HERE
      # partial_failure: True
      # validate_only: True
      #############################################################################
      # OAuth2 Configuration                                                      #
      # Below you may provide credentials for either the installed application or #
      # service account flows. Remove or comment the lines for the flow you're    #
      # not using.                                                                #
      #############################################################################
      # The following values configure the client for the installed application
      # flow.
      client_id: {OAUTH_2_CLIENT_ID}
      client_secret: {CLIENT_SECRET}
      refresh_token: {REFRESH_TOKEN}
      # The following values configure the client for the service account flow.
      # path_to_private_key_file: INSERT_PATH_TO_JSON_KEY_FILE_HERE
      # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT
      #############################################################################
      # ReportDownloader Headers                                                  #
      # Below you may specify boolean values for optional headers that will be    #
      # applied to all requests made by the ReportDownloader utility by default.  #
      #############################################################################
        # report_downloader_headers:
        # skip_report_header: False
        # skip_column_header: False
        # skip_report_summary: False
        # use_raw_enum_values: False
    
    # AdManagerClient configurations
    ad_manager:
      #############################################################################
      # Required Fields                                                           #
      #############################################################################
      application_name: INSERT_APPLICATION_NAME_HERE
      #############################################################################
      # Optional Fields                                                           #
      #############################################################################
      # The network_code is required for all services except NetworkService:
      # network_code: INSERT_NETWORK_CODE_HERE
      # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT
      #############################################################################
      # OAuth2 Configuration                                                      #
      # Below you may provide credentials for either the installed application or #
      # service account (recommended) flows. Remove or comment the lines for the  #
      # flow you're not using.                                                    #
      #############################################################################
      # The following values configure the client for the service account flow.
      path_to_private_key_file: INSERT_PATH_TO_JSON_KEY_FILE_HERE
      # delegated_account: INSERT_DOMAIN_WIDE_DELEGATION_ACCOUNT
      # The following values configure the client for the installed application
      # flow.
      # client_id: INSERT_OAUTH_2_CLIENT_ID_HERE
      # client_secret: INSERT_CLIENT_SECRET_HERE
      # refresh_token: INSERT_REFRESH_TOKEN_HERE
    
    # Common configurations:
    ###############################################################################
    # Compression (optional)                                                      #
    # Below you may specify whether to accept and automatically decompress gzip   #
    # encoded SOAP requests. By default, gzip compression is not enabled.         #
    ###############################################################################
    # enable_compression: False
    ###############################################################################
    # Logging configuration (optional)                                            #
    # Below you may specify the logging configuration. This will be provided as   #
    # an input to logging.config.dictConfig.                                      #
    ###############################################################################
    # logging:
      # version: 1
      # disable_existing_loggers: False
      # formatters:
        # default_fmt:
          # format: ext://googleads.util.LOGGER_FORMAT
      # handlers:
        # default_handler:
          # class: logging.StreamHandler
          # formatter: default_fmt
          # level: INFO
      # loggers:
        # Configure root logger
        # "":
          # handlers: [default_handler]
          # level: INFO
    ###############################################################################
    # Proxy configurations (optional)                                             #
    # Below you may specify an HTTP or HTTPS Proxy to be used when making API     #
    # requests. Note: You must specify the scheme used for the proxy endpoint.    #
    #                                                                             #
    # For additional information on configuring these values, see:                #
    # http://docs.python-requests.org/en/master/user/advanced/#proxies            #
    ###############################################################################
    # proxy_config:
      # http: INSERT_HTTP_PROXY_URI_HERE
      # https: INSERT_HTTPS_PROXY_URI_HERE
      # If specified, the given cafile will only be used if certificate validation
      # is not disabled.
      # cafile: INSERT_PATH_HERE
      # disable_certificate_validation: False
    ################################################################################
    # Utilities Included (optional)                                                #
    # Below you may specify whether the library will include utilities used in the #
    # user agent. By default, the library will include utilities used in the user  #
    # agent.                                                                       #
    ################################################################################
    # include_utilities_in_user_agent: True
    ################################################################################
    # Custom HTTP headers (optional)                                               #
    # Specify one or more custom headers to pass along with all requests to        #
    # the API.                                                                     #
    ################################################################################
    # custom_http_headers:
    #   X-My-Header: 'content'
    """
    
  3. Populate the YAML file by copying the following code into the next empty cell of the notebook and then running it.

    with open(ADWORDS_FILE, "w") as adwords_file:
        print(adwords_content, file=adwords_file)
    

Create a Google Ads remarketing list

Use the emails of the top CLV customers to create a Google Ads remarketing list.

  1. Create a list containing the emails of the top CLV customers by copying the following code into the next empty cell of the notebook and then running it:

    ltv_emails = list(set(df_top_ltv['email']))
    
  2. Create the remarketing list by copying the following code into the next empty cell of the notebook and then running it:

    """Adds a user list and populates it with hashed email addresses.
    
    Note: It may take several hours for the list to be populated with members. Email
    addresses must be associated with a Google account. For privacy purposes, the
    user list size will show as zero until the list has at least 1000 members. After
    that, the size will be rounded to the two most significant digits.
    """
    
    import hashlib
    import uuid
    
    # Import appropriate modules from the client library.
    from googleads import adwords
    
    def main(client):
      # Initialize appropriate services.
      user_list_service = client.GetService('AdwordsUserListService', 'v201809')
    
      user_list = {
          'xsi_type': 'CrmBasedUserList',
          'name': 'Customer relationship management list #%d' % uuid.uuid4(),
          'description': 'A list of customers that originated from email addresses',
          # CRM-based user lists can use a membershipLifeSpan of 10000 to indicate
          # unlimited; otherwise normal values apply.
          'membershipLifeSpan': 30,
          'uploadKeyType': 'CONTACT_INFO'
      }
    
      # Create an operation to add the user list.
      operations = [{
          'operator': 'ADD',
          'operand': user_list
      }]
    
      result = user_list_service.mutate(operations)
      user_list_id = result['value'][0]['id']
    
      emails = ['customer1@example.com', 'customer2@example.com',
                ' Customer3@example.com ']
      members = [{'hashedEmail': NormalizeAndSHA256(email)} for email in emails]
    
      # Add address info.
      members.append({
          'addressInfo': {
              # First and last name must be normalized and hashed.
              'hashedFirstName': NormalizeAndSHA256('John'),
              'hashedLastName': NormalizeAndSHA256('Doe'),
              # Country code and zip code are sent in plaintext.
              'countryCode': 'US',
              'zipCode': '10001'
          }
      })
    
      mutate_members_operation = {
          'operand': {
              'userListId': user_list_id,
              'membersList': members
          },
          'operator': 'ADD'
      }
    
      response = user_list_service.mutateMembers([mutate_members_operation])
    
      if 'userLists' in response:
        for user_list in response['userLists']:
          print('User list with name "%s" and ID "%d" was added.'
                % (user_list['name'], user_list['id']))
    
    def NormalizeAndSHA256(s):
      """Normalizes (lowercase, remove whitespace) and hashes a string with SHA-256.
    
      Args:
        s: The string to perform this operation on.
    
      Returns:
        A normalized and SHA-256 hashed string.
      """
      return hashlib.sha256(s.strip().lower()).hexdigest()
    
    if __name__ == '__main__':
      # Initialize the client object with the config
      # file you created in the previous section.
      adwords_client = adwords.AdWordsClient.LoadFromStorage(ADWORDS_FILE)
      main(adwords_client)
    

Add a similar audience

Once you have the remarketing list, follow these instructions in the Google Ads documentation to Add similar audiences to your targeting.

Optional next steps

Use the information in the following sections to optionally improve model performance and automate the ML workflow.

Improve the model

This tutorial has shown you one possible approach to creating a basic model for predicting the future monetary value of your customers. If you decided to investigate this approach further, the following are suggestions for places to improve or experiment:

  • If you have a lot of outliers in your data, do additional data engineering or find additional training data.
  • Try different approaches to creating features by adjusting the time window boundaries, or changing the ratio of input to target transactions.
  • When comparing models, make sure that you use the same set of test data in order to fairly compare the models' performance. For example, this tutorial prepares the data to best predict the orders for the next 30 days. If you want a model to predict for the next quarter, you should update the LENGTH_FUTURE parameter and recompute the features.
  • The example dataset uses a limited number of fields. If you have other dimensions such as product categories, geographic regions, or demographic information in your own data, try to create additional features for your model.

Automate the ML workflow

In the previous steps, you learned how to use BigQuery and BigQuery ML to prepare data for machine learning, create and train a model, get predictions, and prepare data for Google Ads integration.

In a production use case, these would all be recurring tasks. You could run these steps manually every time, but we recommend automating the process. This tutorial has companion code in GitHub that provides a set of BigQuery stored procedures for automating these steps, and also a shell script, run.sh, that you can use to run them.

You can use the following parameters when you call the run.sh script to configure it to work in your environment:

Variable Description Default
PROJECT_ID Project ID Required
DATASET_ID Dataset ID Required
TRAIN_MODEL_ID Name of the trained model. Set to *null* if you do not want to train a model. String based on time.
USE_MODEL_ID Name of the model to use for predictions. Must include dataset: [DATASET].[MODEL] String based on time.
WINDOW_LENGTH Time range in days for input transactions. 0
WINDOW_STEP Time in days between two windows. Equivalent to the time between two threshold dates. 30
WINDOW_STEP_INITIAL Initial time in days before setting the first threshold date. 90
LENGTH_FUTURE Time in days for which to make a prediction. 30
MAX_STDV_MONETARY Standard deviation of the monetary value per customer above which the script removes transactions. 500
MAX_STDV_QTY Standard deviation of the quantity value per customer above which the script removes transactions. 100
TOP_LTV_RATIO Percentage of the top customers to extract. 0.2

Follow these steps to use the run.sh script for automation:

  1. Clone the GitHub repo:

    git clone https://github.com/GoogleCloudPlatform/analytics-componentized-patterns.git
    
  2. Update run.sh script to set the parameter values to match your environment.

  3. If you use your own transaction data, update 10_procedure_match.sql to specify the appropriate columns from your transactions table.

  4. Adjust permissions on the run.sh script so you can run it:

    chmod +x run.sh
    
  5. Review the command-line help for information on how to set the parameters:

    ./run.sh --help
    
  6. Run the script:

    ./run.sh --project-id [YOUR_PROJECT_ID] --dataset-id [YOUR_DATASET_ID]
    

In a production use case, you could check with your data engineers and your DevOps team to see if they can integrate the stored procedures into their tooling for ease of use.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.

Delete the BigQuery dataset and tables

  1. Open the BigQuery console
  2. In the Resources section, expand the project in which you are completing this tutorial and select the clv dataset.
  3. Click Delete dataset in the header of the dataset pane.
  4. In the overlay window that appears, type clv and then click Delete.

Delete the user-managed notebooks instance

  1. Open the User-managed notebooks tab
  2. Select the checkbox for the clv notebook instance.
  3. Click Delete.
  4. In the overlay window that appears, click Delete.

What's next