Learn how to segment Google Analytics 360 audience data for marketing purposes by creating k-means clusters with BigQuery ML.
Overview
A common marketing analytics challenge you might have is determining themes in your consumer behavior, that you can then use to develop customer profiles or personas. These profiles can help you incorporate additional customer knowledge into your marketing campaigns.
To quickly build customer segments from your data, you can use a technique called clustering. Clustering lets you group customers with similar behavior together to build different audiences to use in marketing. You can personalize your outreach to customers by tailoring ads and other communications to them, based on the preferences and behaviors indicated by the cluster each customer belongs to.
In this tutorial, you learn how to create customer clusters by using a k-means model in BigQuery ML. You train the model on sample customer data that contains transaction and demographic information, and the model returns information about the customer clusters it identifies in that data. You then get predictions from the model that let you assign each customer to their appropriate cluster based on their demographic data and transaction history. After you have assigned each customer to a cluster, you can send the updated customer data back to Google Analytics 360 for use in marketing activation.
This tutorial walks you through creating a Vertex AI Workbench user-managed notebooks instance to complete these steps, providing prescriptive guidance and step-by-step procedures. If you just want a completed notebook to get started with instead, you can use the How to build k-means clustering models for market segmentation using BigQuery ML notebook on the notebook server of your choice.
This tutorial is intended for data engineers, data scientists, and data analysts who build ML datasets and models to support business decisions. It assumes that you have basic knowledge of the following:
- Machine learning concepts
- Python
- Standard SQL
How do clustering algorithms work?
Clustering is a type of unsupervised machine learning. You run an algorithm, k-means clustering in this case, to identify how data is logically grouped together. The algorithm evaluates the training data for similarities and groups it based on that, rather than depending on a human-provided label to group the data. For example, suppose you want to group your customers by age and estimated income. You can use clustering to help you determine how many clusters exist within your data based on these attributes. While clustering might seem simple when you are considering just one or two customer attributes, it becomes increasingly difficult to do manually as you take more customer attributes into consideration.
One challenge with using clustering is determining whether you have the "right" number of clusters. This tutorial shows you how to improve the model after building an initial version, so that you end up with a model that creates an appropriate number of clusters for your use case.
Dataset
This tutorial uses the Google Analytics Sample dataset, which is hosted publicly on BigQuery. This dataset provides 12 months (August 2016 to August 2017) of obfuscated Analytics 360 data from the Google Merchandise Store, a real e-commerce store that sells Google-branded merchandise.
If you have your own Analytics 360 data, you can use that instead if you want to. There is information throughout the tutorial on what to do differently in that case.
Objectives
- Process customer data into the correct format for creating a k-means clustering model.
- Create, train, and deploy the k-means model by using BigQuery ML.
- Iterate on the model to develop a version that creates the optimal number of clusters.
- Analyze the clusters that the model has produced to understand what they tell you about your customer segments.
- Get predictions from the deployed model to assign customers to their appropriate clusters.
- Export updated customer data from BigQuery so you can use it in marketing activation in Analytics 360.
Costs
This tutorial uses the following billable components of Google Cloud:
- AI Platform
- BigQuery
- BigQuery ML
To generate a cost estimate based on your projected usage,
use the pricing calculator.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Notebooks and Compute Engine APIs.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
-
Enable the Notebooks and Compute Engine APIs.
When you finish this tutorial, you can avoid continued billing by deleting the resources you created. For more information, see Clean up.
Create a notebook
In the Google Cloud console, go to the Notebooks page.
On the User-managed notebooks tab, click
New notebook.Choose Python 3.
For Instance name, type
kmeans
.Click Create. It takes a few minutes for the notebook instance to be created.
When the instance is available, click Open JupyterLab.
In the Notebook section of the JupyterLab Launcher, click Python 3.
Install libraries
Install the libraries needed for this tutorial.
Copy the following code into the first cell of the notebook:
# Install libraries. !pip install pandas-gbq !pip install google-cloud-bigquery !pip install google-cloud-bigquery-storage # Needed to setup flex slots for flat-rate pricing !pip install google-cloud-bigquery-reservation # Automatically restart kernel after installs import IPython app = IPython.Application.instance() app.kernel.do_shutdown(True)
Click Run
in the menu bar.
Import packages
Install the packages needed for this tutorial by copying the following code into the next empty cell of the notebook and then running it:
from google.cloud import bigquery
import numpy as np
import pandas as pd
import pandas_gbq
import matplotlib.pyplot as plt
# used to display float format
pd.set_option('display.float_format', lambda x: '%.3f' % x)
Create a BigQuery client
Create a BigQuery client for the Python client library for BigQuery by copying the following code into the next empty cell of the notebook and then running it. Replace myProject with the ID of the project you are using to complete this tutorial.
client = bigquery.Client(project="myProject")
Process the data
In this section, you prepare the data you need to train the k-means model. Preparing the data includes aggregating customer transaction data from Analytics 360, generating sample customer demographic data, and then joining these two datasets to create the training dataset.
In this tutorial, you use the Google Analytics Sample data, but in a production use case, you would use your own Analytics 360 data.
Create a BigQuery dataset
Create a BigQuery dataset to contain the customer data and the k-means model by copying the following code into the next empty cell of the notebook and then running it:
! bq mk myProject:kmeans
Aggregate transaction data
In this section, you aggregate customer transaction data from the sample Analytics 360 dataset. The aggregated data gives you a clearer picture of how often customers purchase items, what types of items they purchase, and how much they spend.
In this tutorial, we assume that a purchase occurs when the pagetype = "EVENT". In a production use case, you might want to identify purchases in some other way, depending on your Analytics 360 configuration.
This tutorial also uses the fullVisitorID
as the key for the customer data.
In a production use case, you should replace instances of fullVisitorID
with
clientId
in this and subsequent queries. This is because
you must use clientId
as the key field when importing audience data into
Analytics 360. clientId
is normally a hashed version of
fullVisitorId
, but this column isn't populated in the Google Analytics Sample
dataset. To use clientId
in your own Analytics 360 data,
you would create a
custom dimension
and populate it.
Create a view with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:
ga360_only_view = 'GA360_View' shared_dataset_ref = client.dataset('kmeans') ga360_view_ref = shared_dataset_ref.table(ga360_only_view) ga360_view = bigquery.Table(ga360_view_ref) ga360_query = ''' SELECT fullVisitorID, # This will be used to generate random data. ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID, # You can aggregate this because an # operating system is tied to a fullVisitorID. MAX(device.operatingSystem) AS OS, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Apparel' THEN 1 ELSE 0 END) AS Apparel, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Office' THEN 1 ELSE 0 END) AS Office, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Electronics' THEN 1 ELSE 0 END) AS Electronics, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Limited Supply' THEN 1 ELSE 0 END) AS LimitedSupply, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Accessories' THEN 1 ELSE 0 END) AS Accessories, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Shop by Brand' THEN 1 ELSE 0 END) AS ShopByBrand, SUM (CASE WHEN REGEXP_EXTRACT (v2ProductCategory, r'^(?:(?:.*?)Home/)(.*?)/') = 'Bags' THEN 1 ELSE 0 END) AS Bags, ROUND (SUM (productPrice/1000000),2) AS totalSpent_USD FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, UNNEST(hits) AS hits, UNNEST(hits.product) AS hits_product WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160831' AND geoNetwork.country = 'United States' AND type = 'EVENT' GROUP BY 1, 2 ''' ga360_view.view_query = ga360_query.format('myProject') ga360_view = client.create_table(ga360_view)
See a sample of the aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:
# Show a sample of GA360 data ga360_query_df = f''' SELECT * FROM {ga360_view.full_table_id.replace(":", ".")} ORDER BY fullVisitorID LIMIT 5 ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(ga360_query_df, job_config=job_config) #API Request df_ga360 = query_job.result() df_ga360 = df_ga360.to_dataframe() df_ga360
You should see results similar to the following:
Generate demographic data
Use the following procedure to generate synthetic customer demographic data.
In a production use case, you would want to use your own data from a customer relationship management (CRM) system. You would skip this step and proceed to the next, which is joining the CRM data with the aggregated customer transaction data.
Create a view with generated customer data by copying the following code into the next empty cell of the notebook and then running it:
CRM_only_view = 'CRM_View' shared_dataset_ref = client.dataset('kmeans') CRM_view_ref = shared_dataset_ref.table(CRM_only_view) CRM_view = bigquery.Table(CRM_view_ref) # The query below works by hashing the fullVisitorID, which creates a # random distribution. It uses modulo to artificially split gender and # household income distribution. CRM_query = ''' SELECT fullVisitorID, IF (MOD(Hashed_fullVisitorID,2) = 0, "M", "F") AS gender, # Generate household income CASE WHEN MOD(Hashed_fullVisitorID,10) = 0 THEN 55000 WHEN MOD(Hashed_fullVisitorID,10) < 3 THEN 65000 WHEN MOD(Hashed_fullVisitorID,10) < 7 THEN 75000 WHEN MOD(Hashed_fullVisitorID,10) < 9 THEN 85000 WHEN MOD(Hashed_fullVisitorID,10) = 9 THEN 95000 ELSE Hashed_fullVisitorID END AS hhi FROM ( SELECT fullVisitorID, ABS(farm_fingerprint(fullVisitorID)) AS Hashed_fullVisitorID, FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, UNNEST(hits) AS hits, UNNEST(hits.product) AS hits_product WHERE _TABLE_SUFFIX BETWEEN '20160801' AND '20160831' AND geoNetwork.country = 'United States' AND type = 'EVENT' GROUP BY 1, 2) ''' CRM_view.view_query = CRM_query.format('myProject') CRM_view = client.create_table(CRM_view)
See a sample of the generated customer data by copying the following code into the next empty cell of the notebook and then running it:
# See an output of the synthetic CRM data CRM_query_df = f''' SELECT * FROM {CRM_view.full_table_id.replace(":", ".")} ORDER BY fullVisitorID LIMIT 5 ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(CRM_query_df, job_config=job_config) df_CRM = query_job.result() df_CRM = df_CRM.to_dataframe() df_CRM
You should see results similar to the following:
Create the training dataset
Use the following procedure to create the training dataset by joining the aggregated customer transaction data and the generated customer demographic data.
Create a view with joined customer and transaction data by copying the following code into the next empty cell of the notebook and then running it:
# Build a final view, which joins GA360 data with CRM data final_data_view = 'Final_View' shared_dataset_ref = client.dataset('kmeans') final_view_ref = shared_dataset_ref.table(final_data_view) final_view = bigquery.Table(final_view_ref) final_data_query = f''' SELECT g.*, c.* EXCEPT(fullVisitorId) FROM {ga360_view.full_table_id.replace(":", ".")} g JOIN {CRM_view.full_table_id.replace(":", ".")} c ON g.fullVisitorId = c.fullVisitorId ''' final_view.view_query = final_data_query.format('myProject') final_view = client.create_table(final_view) print(f"Successfully created view at {final_view.full_table_id}")
See a sample of the joined data by copying the following code into the next empty cell of the notebook and then running it:
# Show final data used before modeling sql_demo = f''' SELECT * FROM {final_view.full_table_id.replace(":", ".")} ORDER BY fullVisitorID LIMIT 5 ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(sql_demo, job_config=job_config) df_demo = query_job.result() df_demo = df_demo.to_dataframe() df_demo
You should see results similar to the following:
Create, train, and deploy an initial model
In this section, you create a Python function that builds the k-means model. Creating a function rather than just using SQL allows you to define multiple models and let BigQuery build them in parallel. You will use this function again in the Improve the model section to create several versions of the model for comparison.
Note that you don't use fullVisitorId
as an input to the model, because you
don't need that data to create a feature for clustering. In general, unique or
uncommon data isn't useful for that purpose.
Typically, you would also need to normalize numerical features, but you can skip that step here because BigQuery ML takes care of that automatically.
Define the
makeModel
function by copying the following code into the next empty cell of the notebook and then running it:PROJECT_ID = "myProject" def makeModel (n_Clusters, Model_Name): sql =f''' CREATE OR REPLACE MODEL `{PROJECT_ID}.kmeans.{Model_Name}` OPTIONS(model_type='kmeans', kmeans_init_method = 'KMEANS++', num_clusters={n_Clusters}) AS SELECT * except(fullVisitorID, Hashed_fullVisitorID) FROM `{final_view.full_table_id.replace(":", ".")}` ''' job_config = bigquery.QueryJobConfig() client.query(sql, job_config=job_config)
Test the function by creating the
test
model that segments the data into three clusters. Note that the model takes approximately 5 minutes to build.Copy the following code into the next empty cell of the notebook and then run it:
model_test_name = "test" makeModel(3, model_test_name)
Confirm that the model has been created by copying the following code into the next empty cell of the notebook and then running it:
for model in client.list_models('kmeans'): print(model)
You should see results similar to the following:
Model(reference=ModelReference(project_id='myProject', dataset_id='kmeans', model_id='test'))
Delete the
test
model by copying the following code into the next empty cell of the notebook and then running it:model_id = "kmeans."+model_test_name client.delete_model(model_id) print(f"Deleted model '{model_id}'")
Improve the model
This section shows you how to improve the model by tuning it to provide the optimal number of clusters for your use case.
Determining the right number of clusters — this is the "k" in "k-means" — depends on your use case. It is sometimes easy to determine how many clusters you need. For example, if you are processing images of handwritten single digits, you need 10 clusters, one each for digits 0-9. For use cases that are less obvious, you can experiment with multiple versions of the model that use different numbers of clusters. Comparing different versions of the model lets you see which performs best in terms of grouping your data while minimizing error within each cluster.
You can determine which model version provides the best groupings by using the elbow method, which is a way of charting loss against the number of clusters in the model, in combination with the model's Davies–Bouldin score. This score indicates how different a cluster's data points are from one another. It is defined as the ratio between the scatter of the data points within a given cluster and the scatter of the clusters themselves. A lower value inidcates better clustering.
Create multiple models for comparison
Generate multiple versions of the k-means model, from one that creates three clusters up to one that creates fifteen clusters.
Generate multiple versions of the k-means model. Note that the models take approximately 7 minutes to build.
Copy the following code into the next empty cell of the notebook and then run it:
low_k = 3 high_k = 15 model_prefix_name = 'kmeans_clusters_' lst = list(range (low_k, high_k+1)) #build list to iterate through k values for k in lst: model_name = model_prefix_name + str(k) makeModel(k, model_name)
Get a list of the generated models by copying the following code into the next empty cell of the notebook and then running it:
models = client.list_models('kmeans') print("Listing current models:") for model in models: full_model_id = f"{model.dataset_id}.{model.model_id}" print(full_model_id)
You should see results similar to the following:
Compare model performance
Use the folowing procedure to compare the performance of the different models by reviewing the mean squared distance, which identifies the loss for the model, and the Davies–Bouldin score for each model.
Use the ML.EVALUATE SQL statement to generate the mean squared distance and Davies–Bouldin score for each model. This might take a minute or so to complete.
Copy the following code into the next empty cell of the notebook and then run it:
# This creates a dataframe with each model name, the Davies Bouldin Index, and Loss. df = pd.DataFrame(columns=['davies_bouldin_index', 'mean_squared_distance']) models = client.list_models('kmeans') for model in models: full_model_id = f"{model.dataset_id}.{model.model_id}" sql =f''' SELECT davies_bouldin_index, mean_squared_distance FROM ML.EVALUATE(MODEL `{full_model_id}`) ''' job_config = bigquery.QueryJobConfig() # Start the query, passing in the extra configuration. query_job = client.query(sql, job_config=job_config) df_temp = query_job.to_dataframe() # Wait for the job to complete. df_temp['model_name'] = model.model_id df = pd.concat([df, df_temp], axis=0)
Update the dataframe by adding a column that identifies the number of clusters used by the model, and also sorting the data for graphing. To do so, copy the following code into the next empty cell of the notebook and then run it:
df['n_clusters'] = df['model_name'].str.split('_').map(lambda x: x[2]) df['n_clusters'] = df['n_clusters'].apply(pd.to_numeric) df = df.sort_values(by='n_clusters', ascending=True) df
You should see results similar to the following table:
Create a graph of the data by copying the following code into the next empty cell of the notebook and then running it:
df.plot.line(x='n_clusters', y=['davies_bouldin_index', 'mean_squared_distance'])
You should see results similar to the following table:
There is no perfect approach to determining the optimal number of clusters for a use case. We recommend taking the following approach when making this decision:
- Apply the elbow method, and see if it shows you a clear place where loss evens out.
- Consider the Davies-Bouldin scores for the models.
- Consider business rules or requirements that might affect how many clusters you want to work with.
- Choose the model that meets your business requirements and has the best combination of low loss and low Davies-Bouldin score.
In the preceding graph, the orange line represents each model's loss, as represented by mean square distance, charted against the number of clusters in the model. The trend of this line illustrates that increasing the number of clusters decreases the loss, which is expected. The line also decreases steadily rather than hitting a point where loss evens out and creates an elbow shape. This indicates that for this data set, using the elbow method alone isn't sufficient to determine the optimal number of clusters.
The blue line represents each model's Davies-Bouldin score, charted against the number of clusters in the model. Since the elbow method didn't clearly identify the top performing model, you can alternatively identify that model by choosing one with a low Davies-Bouldin score.
The model with 11 clusters has fairly low loss and the best Davies-Bouldin score, so if your business rules don't give you a reason to work with fewer clusters, it is your best choice. If your business rules do give you a reason to work with fewer clusters, the model with 4 clusters has the second best Davies-Bouldin score. For the sake of simplicity, this tutorial uses the model with 4 clusters.
Analyze the data clusters in the selected model
Develop a understanding of how the data has been clustered in this model by looking at the model's centroids, and then analyze the data in each cluster to understand how many users are in it and what the clustering indicates about customer behavior.
Use the ML.CENTROIDS SQL statement to get information about the centroids in each cluster for the
gender
feature by copying the following code into the next empty cell of the notebook and then running it:model_to_use = 'kmeans_clusters_4' # Edit this to use a different model final_model = 'kmeans.'+model_to_use pd.set_option('max_colwidth', 400) sql_get_attributes = f''' SELECT centroid_id, feature, categorical_value FROM ML.CENTROIDS(MODEL {final_model}) WHERE feature IN ('gender') ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(sql_get_attributes, job_config=job_config) df_attributes = query_job.result() df_attributes = df_attributes.to_dataframe() df_attributes
You should see results similar to the following:
Calculate summary statistics on the cluster data by using information from ML.PREDICT and ML.CENTROIDS by copying the following code into the next empty cell of the notebook and then running it:
sql_get_numerical_attributes = f''' WITH T AS ( SELECT centroid_id, ARRAY_AGG(STRUCT(feature AS name, ROUND(numerical_value,1) AS value) ORDER BY centroid_id) AS cluster FROM ML.CENTROIDS(MODEL {final_model}) GROUP BY centroid_id ), Users AS( SELECT centroid_id, COUNT(*) AS Total_Users FROM( SELECT * EXCEPT(nearest_centroids_distance) FROM ML.PREDICT(MODEL {final_model}, ( SELECT * FROM {final_view.full_table_id.replace(":", ".")} ))) GROUP BY centroid_id ) SELECT centroid_id, Total_Users, (SELECT value from unnest(cluster) WHERE name = 'Apparel') AS Apparel, (SELECT value from unnest(cluster) WHERE name = 'Office') AS Office, (SELECT value from unnest(cluster) WHERE name = 'Electronics') AS Electronics, (SELECT value from unnest(cluster) WHERE name = 'LimitedSupply') AS LimitedSupply, (SELECT value from unnest(cluster) WHERE name = 'Accessories') AS Accessories, (SELECT value from unnest(cluster) WHERE name = 'Bags') AS Bags, (SELECT value from unnest(cluster) WHERE name = 'ShopByBrand') AS ShopByBrand, (SELECT value from unnest(cluster) WHERE name = 'totalSpent_USD') AS totalSpent_USD, (SELECT value from unnest(cluster) WHERE name = 'hhi') AS HouseholdIncome FROM T LEFT JOIN Users USING(centroid_id) ORDER BY centroid_id ASC ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(sql_get_numerical_attributes, job_config=job_config) df_numerical_attributes = query_job.result() df_numerical_attributes = df_numerical_attributes.to_dataframe() df_numerical_attributes.head()
You should see results similar to the following:
Considering both the centroid and summarized statistics data, you can determine the following characteristics for the model's clusters:
- Cluster 1 — Customers in this cluster buy a lot in apparel. They purchase the most items, and have the second highest spending per item. They are not strongly brand motivated. The cluster skews slightly female.
- Cluster 2 — Customers in this cluster buy across all categories but buy most in apparel. They purchase fewer items than customers in Cluster 1, but spend significantly more per item. Thay are very likely to shop by brand. This cluster skews more significantly female.
- Cluster 3 — This cluster is an outlier with only one person in it.
- Cluster 4 — This is the most populated cluster. Customers in this cluster make small numbers of purchases and spend less on average. They are typically one time buyers rather than brand loyalists. Gender skew is minimal.
Get predictions
Use your selected model to assign users into clusters that you can then use for marketing segmentation.
Create a table with aggregated customer transaction data by copying the following code into the next empty cell of the notebook and then running it:
sql_score = f''' SELECT * EXCEPT(nearest_centroids_distance) FROM ML.PREDICT(MODEL {final_model}, ( SELECT * FROM {final_view.full_table_id.replace(":", ".")} LIMIT 10)) ''' job_config = bigquery.QueryJobConfig() # Start the query query_job = client.query(sql_score, job_config=job_config) df_score = query_job.result() df_score = df_score.to_dataframe() df_score
You should see results similar to the following:
Export data to Analytics 360
After you have cluster information for your customers from your k-means model, you can export it to Analytics 360 for use in marketing campaigns.
To do this, use the BigQuery console to run a query that returns the data you want to export, as shown in the following SQL pseudocode:
CREATE OR REPLACE TABLE myDataset.myCustomerDataTable AS (
SELECT * EXCEPT(nearest_centroids_distance)
FROM
ML.PREDICT(MODEL myDataset.myModel,
(
SELECT
*
FROM
myDataset.myTable)))
Save the query results
as a CSV file, and then use
Data Import
to import the data into Analytics 360. Note that the
column names in your exported recommendations data must map to
the Analytics 360 data import schema. For example, if the
data import schema is ga:clientId, ga:dimension1, ga:dimension2
then the
column names in your data should be ga:clientId, ga:dimension1, ga:dimension2
.
BigQuery doesn't allow the use of colons in column names, so you
must update the column names in the exported CSV file before you import it.
If you want, you can use the MoDeM (Model Deployment for Marketing) reference implementation for BigQuery ML models to make loading data into Analytics 360 easier. Use the interactive instructions in the BQML Deployment Template notebook to get started.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project containing the resources, or keep the project but delete just those resources.
Either way, you should remove those resources so you won't be billed for them in the future. The following sections describe how to delete these resources.
Delete the project
The easiest way to eliminate billing is to delete the project you created for the tutorial.
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
Delete the components
If you don't want to delete the project, use the following sections to delete the billable components of this tutorial.
Delete the BigQuery dataset
- Open the BigQuery console
- In the Resources section, expand the project in which you are completing this tutorial and select the kmeans dataset.
- Click Delete dataset in the header of the dataset pane.
- In the overlay window that appears, type kmeans and then click Delete.
Delete the user-managed notebooks instance
- Open the User-managed notebooks tab
- Select the checkbox for the kmeans notebook instance.
- Click Delete.
- In the overlay window that appears, click Delete.
What's next
- Learn about other predictive forecasting solutions.
- Learn more about BigQuery ML.
- Learn more about Analytics 360.