Building a k-means clustering model for market segmentation by using BigQuery ML

Learn how to segment Google Analytics 360 audience data for marketing purposes by creating k-means clusters with BigQuery ML.

Overview

A common marketing analytics challenge you might have is determining themes in your consumer behavior, that you can then use to develop customer profiles or personas. These profiles can help you incorporate additional customer knowledge into your marketing campaigns.

To quickly build customer segments from your data, you can use a technique called clustering. Clustering lets you group customers with similar behavior together to build different audiences to use in marketing. You can personalize your outreach to customers by tailoring ads and other communications to them, based on the preferences and behaviors indicated by the cluster each customer belongs to.

In this tutorial, you learn how to create customer clusters by using a k-means model in BigQuery ML. You train the model on sample customer data that contains transaction and demographic information, and the model returns information about the customer clusters it identifies in that data. You then get predictions from the model that let you assign each customer to their appropriate cluster based on their demographic data and transaction history. After you have assigned each customer to a cluster, you can send the updated customer data back to Google Analytics 360 for use in marketing activation.

This tutorial walks you through creating an AI Platform notebook to complete these steps, providing prescriptive guidance and step-by-step procedures. If you just want a completed notebook to get started with instead, you can use the How to build k-means clustering models for market segmentation using BigQuery ML notebook on the notebook server of your choice.

This tutorial is intended for data engineers, data scientists, and data analysts who build ML datasets and models to support business decisions. It assumes that you have basic knowledge of the following:

  • Machine learning concepts
  • Python
  • Standard SQL

How do clustering algorithms work?

Clustering is a type of unsupervised machine learning. You run an algorithm, k-means clustering in this case, to identify how data is logically grouped together. The algorithm evaluates the training data for similarities and groups it based on that, rather than depending on a human-provided label to group the data. For example, suppose you want to group your customers by age and estimated income. You can use clustering to help you determine how many clusters exist within your data based on these attributes. While clustering might seem simple when you are considering just one or two customer attributes, it becomes increasingly difficult to do manually as you take more customer attributes into consideration.

One challenge with using clustering is determining whether you have the "right" number of clusters. This tutorial shows you how to improve the model after building an initial version, so that you end up with a model that creates an appropriate number of clusters for your use case.

Dataset

This tutorial uses the Google Analytics Sample dataset, which is hosted publicly on BigQuery. This dataset provides 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.

If you have your own Analytics 360 data, you can use that instead if you want to. There is information throughout the tutorial on what to do differently in that case.

Objectives

  • Process customer data into the correct format for creating a k-means clustering model.
  • Create, train, and deploy the k-means model by using BigQuery ML.
  • Iterate on the model to develop a version that creates the optimal number of clusters.
  • Analyze the clusters that the model has produced to understand what they tell you about your customer segments.
  • Get predictions from the deployed model to assign customers to their appropriate clusters.
  • Export updated customer data from BigQuery so you can use it in marketing activation in Analytics 360.

Costs

This tutorial uses the following billable components of Google Cloud:

  • AI Platform
  • BigQuery
  • BigQuery ML

To generate a cost estimate based on your projected usage, use the pricing calculator. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Notebooks and Compute Engine APIs.

    Enable the APIs

When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Cleaning up.

Create a notebook

  1. Open the Notebooks console
  2. Click New Instance.
  3. Choose Python 2 and 3.
  4. For Instance name, type kmeans.
  5. Click Create. It takes a few minutes for the notebook instance to be created.
  6. When the instance is available, click Open JupyterLab.
  7. In the Notebook section of the JupyterLab Launcher, click Python 3.

Install libraries

Install the libraries needed for this tutorial.

  1. Copy the following code into the first cell of the notebook:

    # Install libraries.
    !pip install pandas-gbq
    !pip install google-cloud-bigquery
    !pip install google-cloud-bigquery-storage
    # Needed to setup flex slots for flat-rate pricing
    !pip install google-cloud-bigquery-reservation
    
    # Automatically restart kernel after installs
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)
    
  2. Click Run in the menu bar.

Import packages

Install the packages needed for this tutorial by copying the following code into the next empty cell of the notebook and then running it:

from google.cloud import bigquery
import numpy as np
import pandas as pd
import pandas_gbq
import matplotlib.pyplot as plt

# used to display float format
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Create a BigQuery client

Create a BigQuery client for the Python client library for BigQuery by copying the following code into the next empty cell of the notebook and then running it. Replace myProject with the ID of the project you are using to complete this tutorial.

client = bigquery.Client(project="myProject")

Process the data

In this section, you prepare the data you need to train the k-means model. Preparing the data includes aggregating customer transaction data from Analytics 360, generating sample customer demographic data, and then joining these two datasets to create the training dataset.

In this tutorial, you use the Google Analytics Sample data, but in a production use case, you would use your own Analytics 360 data.

Create a BigQuery dataset

Create a BigQuery dataset to contain the customer data and the k-means model by copying the following code into the next empty cell of the notebook and then running it:

! bq mk myProject:kmeans

Aggregate transaction data

In this section, you aggregate customer transaction data from the sample Analytics 360 dataset. The aggregated data gives you a clearer picture of how often customers purchase items, what types of items they purchase, and how much they spend.

In this tutorial, we assume that a purchase occurs when the pagetype = "EVENT". In a production use case, you might want to identify purchases in some other way, depending on your Analytics 360 configuration.

This tutorial also uses the fullVisitorID as the key for the customer data. In a production use case, you should replace instances of fullVisitorID with clientId in this and subsequent queries. This is because you must use clientId as the key field when importing audience data into Analytics 360. clientId is normally a hashed version of fullVisitorId, but this column isn't populated in the Google Analytics Sample dataset. To use clientId in your own Analytics 360 data, you would create a custom dimension and populate it.

  1. Create a view with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:

    ga360_only_view = 'GA360_View'
    shared_dataset_ref = client.dataset('kmeans')
    ga360_view_ref = shared_dataset_ref.table(ga360_only_view)
    ga360_view = bigquery.Table(ga360_view_ref)
    
    ga360_query = '''
    SELECT
      fullVisitorID,
      # This will be used to generate random data.
      ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID,
      # You can aggregate this because an
      # operating system is tied to a fullVisitorID.
      MAX(device.operatingSystem) AS OS,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Apparel' THEN 1 ELSE 0 END) AS Apparel,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Office' THEN 1 ELSE 0 END) AS Office,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Electronics' THEN 1 ELSE 0 END) AS Electronics,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Limited Supply' THEN 1 ELSE 0 END) AS LimitedSupply,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Accessories' THEN 1 ELSE 0 END) AS Accessories,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Shop by Brand' THEN 1 ELSE 0 END) AS ShopByBrand,
      SUM (CASE
           WHEN REGEXP_EXTRACT (v2ProductCategory,
                               r'^(?:(?:.*?)Home/)(.*?)/')
                               = 'Bags' THEN 1 ELSE 0 END) AS Bags,
      ROUND (SUM (productPrice/1000000),2) AS totalSpent_USD
      FROM
      `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
      UNNEST(hits) AS hits,
      UNNEST(hits.product) AS hits_product
    WHERE
      _TABLE_SUFFIX BETWEEN '20160801'
      AND '20160831'
      AND geoNetwork.country = 'United States'
      AND type = 'EVENT'
    GROUP BY
      1,
      2
    '''
    
    ga360_view.view_query = ga360_query.format('myProject')
    ga360_view = client.create_table(ga360_view)
    
    
  2. See a sample of the aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:

    # Show a sample of GA360 data
    
    ga360_query_df = f'''
    SELECT * FROM {ga360_view.full_table_id.replace(":", ".")}
    ORDER BY fullVisitorID
    LIMIT 5
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(ga360_query_df, job_config=job_config) #API Request
    df_ga360 = query_job.result()
    df_ga360 = df_ga360.to_dataframe()
    
    df_ga360
    

    You should see results similar to the following:

    First 5 rows of aggregated transaction data.

Generate demographic data

Use the following procedure to generate synthetic customer demographic data.

In a production use case, you would want to use your own data from a customer relationship management (CRM) system. You would skip this step and proceed to the next, which is joining the CRM data with the aggregated customer transaction data.

  1. Create a view with generated customer data by copying the following code into the next empty cell of the notebook and then running it:

    CRM_only_view = 'CRM_View'
    shared_dataset_ref = client.dataset('kmeans')
    CRM_view_ref = shared_dataset_ref.table(CRM_only_view)
    CRM_view = bigquery.Table(CRM_view_ref)
    
    # The query below works by hashing the fullVisitorID, which creates a
    # random distribution. It uses modulo to artificially split gender and
    # household income distribution.
    CRM_query = '''
    SELECT
      fullVisitorID,
    IF
      (MOD(Hashed_fullVisitorID,2) = 0,
        "M",
        "F") AS gender,
      # Generate household income
      CASE
        WHEN MOD(Hashed_fullVisitorID,10) = 0 THEN 55000
        WHEN MOD(Hashed_fullVisitorID,10) < 3 THEN 65000
        WHEN MOD(Hashed_fullVisitorID,10) < 7 THEN 75000
        WHEN MOD(Hashed_fullVisitorID,10) < 9 THEN 85000
        WHEN MOD(Hashed_fullVisitorID,10) = 9 THEN 95000
      ELSE
      Hashed_fullVisitorID
    END
      AS hhi
    FROM (
      SELECT
        fullVisitorID,
        ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID,
      FROM
        `bigquery-public-data.google_analytics_sample.ga_sessions_*`,
        UNNEST(hits) AS hits,
        UNNEST(hits.product) AS hits_product
          WHERE
        _TABLE_SUFFIX BETWEEN '20160801'
        AND '20160831'
        AND geoNetwork.country = 'United States'
        AND type = 'EVENT'
      GROUP BY
        1,
        2)
    '''
    
    CRM_view.view_query = CRM_query.format('myProject')
    CRM_view = client.create_table(CRM_view)
    
  2. See a sample of the generated customer data by copying the following code into the next empty cell of the notebook and then running it:

    # See an output of the synthetic CRM data
    
    CRM_query_df = f'''
    SELECT * FROM {CRM_view.full_table_id.replace(":", ".")}
    ORDER BY fullVisitorID
    LIMIT 5
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(CRM_query_df, job_config=job_config)
    df_CRM = query_job.result()
    df_CRM = df_CRM.to_dataframe()
    
    df_CRM
    

    You should see results similar to the following:

    First 5 rows of generated customer data.

Create the training dataset

Use the following procedure to create the training dataset by joining the aggregated customer transaction data and the generated customer demographic data.

  1. Create a view with joined customer and transaction data by copying the following code into the next empty cell of the notebook and then running it:

    # Build a final view, which joins GA360 data with CRM data
    
    final_data_view = 'Final_View'
    shared_dataset_ref = client.dataset('kmeans')
    final_view_ref = shared_dataset_ref.table(final_data_view)
    final_view = bigquery.Table(final_view_ref)
    
    final_data_query = f'''
    SELECT
        g.*,
        c.* EXCEPT(fullVisitorId)
    FROM {ga360_view.full_table_id.replace(":", ".")} g
    JOIN {CRM_view.full_table_id.replace(":", ".")} c
    ON g.fullVisitorId = c.fullVisitorId
    '''
    
    final_view.view_query = final_data_query.format('myProject')
    final_view = client.create_table(final_view)
    
    print(f"Successfully created view at {final_view.full_table_id}")
    
  2. See a sample of the joined data by copying the following code into the next empty cell of the notebook and then running it:

    # Show final data used before modeling
    
    sql_demo = f'''
    SELECT * FROM {final_view.full_table_id.replace(":", ".")}
    ORDER BY fullVisitorID
    LIMIT 5
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(sql_demo, job_config=job_config)
    df_demo = query_job.result()
    df_demo = df_demo.to_dataframe()
    
    df_demo
    

    You should see results similar to the following:

    First 5 rows of joined customer and transaction data.

Create, train, and deploy an initial model

In this section, you create a Python function that builds the k-means model. Creating a function rather than just using SQL allows you to define multiple models and let BigQuery build them in parallel. You will use this function again in the Improve the model section to create several versions of the model for comparison.

Note that you don't use fullVisitorId as an input to the model, because you don't need that data to create a feature for clustering. In general, unique or uncommon data isn't useful for that purpose.

Typically, you would also need to normalize numerical features, but you can skip that step here because BigQuery ML takes care of that automatically.

  1. Define the makeModel function by copying the following code into the next empty cell of the notebook and then running it:

    PROJECT_ID = "myProject"
    
    def makeModel (n_Clusters, Model_Name):
        sql =f'''
        CREATE OR REPLACE MODEL `{PROJECT_ID}.kmeans.{Model_Name}`
        OPTIONS(model_type='kmeans',
        kmeans_init_method = 'KMEANS++',
        num_clusters={n_Clusters}) AS
        SELECT * except(fullVisitorID, Hashed_fullVisitorID) FROM `{final_view.full_table_id.replace(":", ".")}`
        '''
        job_config = bigquery.QueryJobConfig()
        client.query(sql, job_config=job_config)
    
  2. Test the function by creating the test model that segments the data into three clusters. Note that the model takes approximately 5 minutes to build.

    Copy the following code into the next empty cell of the notebook and then run it:

    model_test_name = "test"
    makeModel(3, model_test_name)
    
  3. Confirm that the model has been created by copying the following code into the next empty cell of the notebook and then running it:

    for model in client.list_models('kmeans'):
      print(model)
    

    You should see results similar to the following:

    Model(reference=ModelReference(project_id='myProject', dataset_id='kmeans', model_id='test'))
    
  4. Delete the test model by copying the following code into the next empty cell of the notebook and then running it:

    model_id = "kmeans."+model_test_name
    client.delete_model(model_id)
    print(f"Deleted model '{model_id}'")
    

Improve the model

This section shows you how to improve the model by tuning it to provide the optimal number of clusters for your use case.

Determining the right number of clusters — this is the "k" in "k-means" — depends on your use case. It is sometimes easy to determine how many clusters you need. For example, if you are processing images of handwritten single digits, you need 10 clusters, one each for digits 0-9. For use cases that are less obvious, you can experiment with multiple versions of the model that use different numbers of clusters. Comparing different versions of the model lets you see which performs best in terms of grouping your data while minimizing error within each cluster.

You can determine which model version provides the best groupings by using the elbow method, which is a way of charting loss against the number of clusters in the model, in combination with the model's Davies–Bouldin score. This score indicates how different a cluster's data points are from one another. It is defined as the ratio between the scatter of the data points within a given cluster and the scatter of the clusters themselves. A lower value inidcates better clustering.

Create multiple models for comparison

Generate multiple versions of the k-means model, from one that creates three clusters up to one that creates fifteen clusters.

  1. Generate multiple versions of the k-means model. Note that the models take approximately 7 minutes to build.

    Copy the following code into the next empty cell of the notebook and then run it:

    low_k = 3
    high_k = 15
    model_prefix_name = 'kmeans_clusters_'
    
    lst = list(range (low_k, high_k+1)) #build list to iterate through k values
    
    for k in lst:
        model_name = model_prefix_name + str(k)
        makeModel(k, model_name)
    
  2. Get a list of the generated models by copying the following code into the next empty cell of the notebook and then running it:

    models = client.list_models('kmeans')
    print("Listing current models:")
    for model in models:
        full_model_id = f"{model.dataset_id}.{model.model_id}"
        print(full_model_id)
    

    You should see results similar to the following:

    Listing of the different model versions.

Compare model performance

Use the folowing procedure to compare the performance of the different models by reviewing the mean squared distance, which identifies the loss for the model, and the Davies–Bouldin score for each model.

  1. Use the ML.EVALUATE SQL statement to generate the mean squared distance and Davies–Bouldin score for each model. This might take a minute or so to complete.

    Copy the following code into the next empty cell of the notebook and then run it:

    # This creates a dataframe with each model name, the Davies Bouldin Index, and Loss.
    
    df = pd.DataFrame(columns=['davies_bouldin_index', 'mean_squared_distance'])
    models = client.list_models('kmeans')
    for model in models:
        full_model_id = f"{model.dataset_id}.{model.model_id}"
        sql =f'''
            SELECT
                davies_bouldin_index,
                mean_squared_distance
            FROM ML.EVALUATE(MODEL `{full_model_id}`)
        '''
    
        job_config = bigquery.QueryJobConfig()
    
        # Start the query, passing in the extra configuration.
        query_job = client.query(sql, job_config=job_config)
        df_temp = query_job.to_dataframe()  # Wait for the job to complete.
        df_temp['model_name'] = model.model_id
        df =  pd.concat([df, df_temp], axis=0)
    
  2. Update the dataframe by adding a column that identifies the number of clusters used by the model, and also sorting the data for graphing. To do so, copy the following code into the next empty cell of the notebook and then run it:

    df['n_clusters'] = df['model_name'].str.split('_').map(lambda x: x[2])
    df['n_clusters'] = df['n_clusters'].apply(pd.to_numeric)
    df = df.sort_values(by='n_clusters', ascending=True)
    df
    

    You should see results similar to the following table:

    Mean squared distance and Davies–Bouldin score information for each model.

  3. Create a graph of the data by copying the following code into the next empty cell of the notebook and then running it:

    df.plot.line(x='n_clusters', y=['davies_bouldin_index', 'mean_squared_distance'])
    

    You should see results similar to the following table:

    Graph plotting mean squared distance and Davies–Bouldin score against number of clusters for each model.

    There is no perfect approach to determining the optimal number of clusters for a use case. We recommend taking the following approach when making this decision:

    1. Apply the elbow method, and see if it shows you a clear place where loss evens out.
    2. Consider the Davies-Bouldin scores for the models.
    3. Consider business rules or requirements that might affect how many clusters you want to work with.
    4. Choose the model that meets your business requirements and has the best combination of low loss and low Davies-Bouldin score.

    In the preceding graph, the orange line represents each model's loss, as represented by mean square distance, charted against the number of clusters in the model. The trend of this line illustrates that increasing the number of clusters decreases the loss, which is expected. The line also decreases steadily rather than hitting a point where loss evens out and creates an elbow shape. This indicates that for this data set, using the elbow method alone isn't sufficient to determine the optimal number of clusters.

    The blue line represents each model's Davies-Bouldin score, charted against the number of clusters in the model. Since the elbow method didn't clearly identify the top performing model, you can alternatively identify that model by choosing one with a low Davies-Bouldin score.

    The model with 11 clusters has fairly low loss and the best Davies-Bouldin score, so if your business rules don't give you a reason to work with fewer clusters, it is your best choice. If your business rules do give you a reason to work with fewer clusters, the model with 4 clusters has the second best Davies-Bouldin score. For the sake of simplicity, this tutorial uses the model with 4 clusters.

Analyze the data clusters in the selected model

Develop a understanding of how the data has been clustered in this model by looking at the model's centroids, and then analyze the data in each cluster to understand how many users are in it and what the clustering indicates about customer behavior.

  1. Use the ML.CENTROIDS SQL statement to get information about the centroids in each cluster for the gender feature by copying the following code into the next empty cell of the notebook and then running it:

    model_to_use = 'kmeans_clusters_4' # Edit this to use a different model
    final_model = 'kmeans.'+model_to_use
    
    pd.set_option('max_colwidth', 400)
    
    sql_get_attributes = f'''
    SELECT
      centroid_id,
      feature,
      categorical_value
    FROM
      ML.CENTROIDS(MODEL {final_model})
    WHERE
      feature IN ('gender')
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(sql_get_attributes, job_config=job_config)
    df_attributes = query_job.result()
    df_attributes = df_attributes.to_dataframe()
    df_attributes
    

    You should see results similar to the following:

    Centroids for the gender feature of the kmeans model.

  2. Calculate summary statistics on the cluster data by using information from ML.PREDICT and ML.CENTROIDS by copying the following code into the next empty cell of the notebook and then running it:

    sql_get_numerical_attributes = f'''
    WITH T AS (
    SELECT
      centroid_id,
      ARRAY_AGG(STRUCT(feature AS name,
                       ROUND(numerical_value,1) AS value)
                       ORDER BY centroid_id)
                       AS cluster
    FROM ML.CENTROIDS(MODEL {final_model})
    GROUP BY centroid_id
    ),
    
    Users AS(
    SELECT
      centroid_id,
      COUNT(*) AS Total_Users
    FROM(
    SELECT
      * EXCEPT(nearest_centroids_distance)
    FROM
      ML.PREDICT(MODEL {final_model},
        (
        SELECT
          *
        FROM
          {final_view.full_table_id.replace(":", ".")}
          )))
    GROUP BY centroid_id
    )
    
    SELECT
      centroid_id,
      Total_Users,
      (SELECT value from unnest(cluster) WHERE name = 'Apparel') AS Apparel,
      (SELECT value from unnest(cluster) WHERE name = 'Office') AS Office,
      (SELECT value from unnest(cluster) WHERE name = 'Electronics') AS Electronics,
      (SELECT value from unnest(cluster) WHERE name = 'LimitedSupply') AS LimitedSupply,
      (SELECT value from unnest(cluster) WHERE name = 'Accessories') AS Accessories,
      (SELECT value from unnest(cluster) WHERE name = 'Bags') AS Bags,
      (SELECT value from unnest(cluster) WHERE name = 'ShopByBrand') AS ShopByBrand,
      (SELECT value from unnest(cluster) WHERE name = 'totalSpent_USD') AS totalSpent_USD,
      (SELECT value from unnest(cluster) WHERE name = 'hhi') AS HouseholdIncome
    
    FROM T LEFT JOIN Users USING(centroid_id)
    ORDER BY centroid_id ASC
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(sql_get_numerical_attributes, job_config=job_config)
    df_numerical_attributes = query_job.result()
    df_numerical_attributes = df_numerical_attributes.to_dataframe()
    df_numerical_attributes.head()
    

    You should see results similar to the following:

    Summary statistics for cluster data.

Considering both the centroid and summarized statistics data, you can determine the following characteristics for the model's clusters:

  • Cluster 1 — Customers in this cluster buy a lot in apparel. They purchase the most items, and have the second highest spending per item. They are not strongly brand motivated. The cluster skews slightly female.
  • Cluster 2 — Customers in this cluster buy across all categories but buy most in apparel. They purchase fewer items than customers in Cluster 1, but spend significantly more per item. Thay are very likely to shop by brand. This cluster skews more significantly female.
  • Cluster 3 — This cluster is an outlier with only one person in it.
  • Cluster 4 — This is the most populated cluster. Customers in this cluster make small numbers of purchases and spend less on average. They are typically one time buyers rather than brand loyalists. Gender skew is minimal.

Get predictions

Use your selected model to assign users into clusters that you can then use for marketing segmentation.

  1. Create a table with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:

    sql_score = f'''
    SELECT * EXCEPT(nearest_centroids_distance)
    FROM
      ML.PREDICT(MODEL {final_model},
        (
        SELECT
          *
        FROM
          {final_view.full_table_id.replace(":", ".")}
          LIMIT 10))
    '''
    
    job_config = bigquery.QueryJobConfig()
    
    # Start the query
    query_job = client.query(sql_score, job_config=job_config)
    df_score = query_job.result()
    df_score = df_score.to_dataframe()
    
    df_score
    

    You should see results similar to the following:

    First 10 rows of processed training data.

Export data to Analytics 360

After you have cluster information for your customers from your k-means model, you can export it to Analytics 360 for use in marketing campaigns.

To do this, use the BigQuery console to run a query that returns the data you want to export, as shown in the following SQL pseudocode:

CREATE OR REPLACE TABLE myDataset.myCustomerDataTable AS (
    SELECT * EXCEPT(nearest_centroids_distance)
    FROM
      ML.PREDICT(MODEL myDataset.myModel,
        (
        SELECT
          *
        FROM
          myDataset.myTable)))

Save the query results as a CSV file, and then use Data Import to import the data into Analytics 360. Note that the column names in your exported recommendations data must map to the Analytics 360 data import schema. For example, if the data import schema is ga:clientId, ga:dimension1, ga:dimension2 then the column names in your data should be ga:clientId, ga:dimension1, ga:dimension2. BigQuery doesn't allow the use of colons in column names, so you must update the column names in the exported CSV file before you import it.

If you want, you can use the MoDeM (Model Deployment for Marketing) reference implementation for BigQuery ML models to make loading data into Analytics 360 easier. Use the interactive instructions in the BQML Deployment Template notebook to get started.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.

Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.

Delete the project

The easiest way to eliminate billing is to delete the project you created for the tutorial.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Delete the components

If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.

Delete the BigQuery dataset

  1. Open the BigQuery console
  2. In the Resources section, expand the project in which you are completing this tutorial and select the kmeans dataset.
  3. Click Delete dataset in the header of the dataset pane.
  4. In the overlay window that appears, type kmeans and then click Delete.

Delete the AI Platform notebook

  1. Open the AI Platform Notebooks page
  2. Select the checkbox for the kmeans notebook instance.
  3. Click Delete.
  4. In the overlay window that appears, click Delete.

What's next