BigQuery ML supports unsupervised learning . You can apply the k-means algorithm to group your data into clusters. Unlike supervised machine learning, which is about predictive analytics, unsupervised learning is about descriptive analytics. It's about understanding your data so that you can make data-driven decisions.
In this tutorial, you use a k-means model in BigQuery ML to build clusters of data in the London Bicycle Hires public dataset. The London Bicycle Hires data contains the number of hires of London's Santander Cycle Hire Scheme from 2011 to the present. The data includes start and stop timestamps, station names and ride duration.
The queries in this tutorial use Geography functions available in geospatial analytics. For more information on geospatial analytics, see Introduction to geospatial analytics.
Objectives
In this tutorial you:- Create a k-means clustering model.
- Make data-driven decisions based on BigQuery ML's visualization of the clusters.
Costs
This tutorial uses billable components of Google Cloud, including the following:
- BigQuery
- BigQuery ML
For information on BigQuery costs, see the BigQuery pricing page.
For information on BigQuery ML costs, see BigQuery ML pricing.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
- BigQuery is automatically enabled in new projects.
To activate BigQuery in a pre-existing project, go to
Enable the BigQuery API.
Introduction
Your data might contain natural groupings or clusters of data, and you might want to identify these groupings descriptively in order to make data-driven decisions. For example, if you are a retailer you might want to identify natural groupings of customers who have similar purchasing habits or locations. This process is known as customer segmentation.
Data you use to perform customer segmentation might include the store they visited, what items they bought, and how much they paid. You would create a model to try to understand what these groups of customer personas are like so that you can design items that appeal to group members.
You could also find product groups among the items purchased. In this case, you would cluster items based on who purchased them, when they were purchased, where they were purchased, and other similar characteristics. You would create a model to determine the characteristics of a product group so that you can make informed decisions such as how to improve cross-selling.
In this tutorial, you use BigQuery ML to create a k-means model that clusters London Bicycle Hires data based on bike station attributes.
Creating your k-means model consists of the following steps.
- Step one: Create a dataset to store your model.
- The first step is to create a dataset that stores your model.
- Step two: Examine your training data.
- The next step is to examine the data you use to train your clustering model
by running a query against the
london_bicycles
table. Because k-means is an unsupervised learning technique, model training does not require labels nor does it require you to split the data into training data and evaluation data.
- Step three: Create a k-means model.
- Step three is to create your k-means model. When you create the model, the
clustering field is
station_name
, and you cluster the data based on station attribute, for example the distance of the station from the city center.
- Step four: Use the
ML.PREDICT
function to predict a station's cluster. - Next, you use the
ML.PREDICT
function to predict the cluster for a given set of stations. You predict clusters for all station names that contain the stringKennington
.
- Step four: Use the
- Step five: Use your model to make data-driven decisions.
- The final step is to use the model to make data-driven decisions. For example, based on the model results, you can determine which stations would benefit from extra capacity.
Step one: Create your dataset
Create a BigQuery dataset to store your ML model:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, click your project name.
Click
View actions > Create dataset.On the Create dataset page, do the following:
For Dataset ID, enter
bqml_tutorial
.For Location type, select Multi-region, and then select EU (multiple regions in European Union).
The London Bicycle Hires public dataset is stored in the
EU
multi-region. Your dataset must be in the same location.Leave the remaining default settings as they are, and click Create dataset.
Step two: Examine your training data
Next, examine the data used to train your k-means model. In this tutorial, you cluster bike stations based on the following attributes:
- Duration of rentals
- Number of trips per day
- Distance from city center
SQL
The following GoogleSQL query is used to examine the data used to train your k-means model.
#standardSQL WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday ) SELECT * FROM stationstats ORDER BY distance_from_city_center ASC
Query details
This query extracts data on cycle hires, including start_station_name
, and
duration
and joins it against station information, including distance-from-city-center
. Then, it computes attributes of
the station in stationstats
, including the average duration of rides and the
number of trips, and passes through the station attribute distance_from_city_center
.
This query uses the
WITH
clause
to define subqueries. The query also uses the ST_DISTANCE
and ST_GEOGPOINT
geospatial analytics functions. For more information about these functions, see
Geography
functions. For more
information about geospatial analytics, see Introduction to
geospatial analytics.
Run the query
The following query compiles your training data, and is also used in the
CREATE MODEL
statement later in this tutorial.
To run the query:
- Go to the BigQuery page.
In the editor pane, run the following SQL statement:
WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday ) SELECT * FROM stationstats ORDER BY distance_from_city_center ASC
When the query is complete, click the Results tab below the query text area. The results tab shows the columns you queried that are used to train your model:
station_name
,duration
,num_trips
,distance_from_city_center
. The results should look like the following.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Step three: Create a k-means model
Now that you have examined your training data, the next step is to create a k-means model using the data.
SQL
You can create and train a k-means model using the
CREATE MODEL
statement with the option model_type=kmeans
.
Query details
The CREATE MODEL
statement specifies the number of clusters to use —
four. In the SELECT
statement, the EXCEPT
clause excludes the station_name
column because station_name
is not a feature. The query creates a unique row
per station_name, and only the features are mentioned in the SELECT
statement.
If you omit the num_clusters
option, BigQuery ML chooses a
reasonable default based on the total number of rows in the training data. You
could also perform hyperparameter tuning to find a good number. To determine an
optimal number of clusters, you would run the CREATE MODEL
query for different
values of num_clusters
, find the error measure, and pick the point at which
the error measure is at its minimum value. You can obtain the error measure by
selecting your model and clicking the Evaluation tab. This tab shows
the Davies–Bouldin index.
Run the query
The following query adds a CREATE MODEL
statement to the query you used to
examine the training data, and also removes the id
fields in the data.
To run the query and create a k-means model:
- Go to the BigQuery page.
In the editor pane, run the following SQL statement:
CREATE OR REPLACE MODEL `bqml_tutorial.london_station_clusters` OPTIONS(model_type='kmeans', num_clusters=4) AS WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday) SELECT * EXCEPT(station_name, isweekday) FROM stationstats
In the navigation panel, in the Resources section, expand your project name, click bqml_tutorial, and then click london_station_clusters.
Click the Schema tab. The model schema lists the four station attributes that BigQuery ML used to perform clustering. The schema should look like the following.
- Click the Evaluation tab. This tab displays visualizations of the clusters identified by the k-means model. Under Numeric features, bar graphs display up to 10 of the most important numeric feature values for each centroid. You can select which features to visualize from the drop-down menu.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Step four: Use the ML.PREDICT
function to predict a station's cluster
To identify the cluster to which a particular station belongs, use the
ML.PREDICT
SQL function
function or or the predict
BigQuery DataFrames function
SQL
Query details
This query uses the REGEXP_CONTAINS
function to find all entries in the station_name
column that contain the
string "Kennington". The ML.PREDICT
function uses those values to predict
which clusters would contain those stations.
Run the query
The following query predicts the cluster of every station that has the string "Kennington" in its name.
To run the ML.PREDICT
query:
- Go to the BigQuery page.
In the editor pane, run the following SQL statement:
WITH hs AS ( SELECT h.start_station_name AS station_name, IF (EXTRACT(DAYOFWEEK FROM h.start_date) = 1 OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7, "weekend", "weekday") AS isweekday, h.duration, ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5))/1000 AS distance_from_city_center FROM `bigquery-public-data.london_bicycles.cycle_hire` AS h JOIN `bigquery-public-data.london_bicycles.cycle_stations` AS s ON h.start_station_id = s.id WHERE h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP) AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ), stationstats AS ( SELECT station_name, isweekday, AVG(duration) AS duration, COUNT(duration) AS num_trips, MAX(distance_from_city_center) AS distance_from_city_center FROM hs GROUP BY station_name, isweekday ) SELECT * EXCEPT(nearest_centroids_distance) FROM ML.PREDICT( MODEL `bqml_tutorial.london_station_clusters`, ( SELECT * FROM stationstats WHERE REGEXP_CONTAINS(station_name, 'Kennington')))
When the query is complete, click the Results tab below the query text area. The results should look like the following.
BigQuery DataFrames
Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.
To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Step five: Use your model to make data-driven decisions
Evaluation results can help you to interpret the different clusters. In the following example, centroid 3 shows a busy city station that is close to the city center. Centroid 2 shows the second city station which is less busy and used for longer duration rentals. Centroid 1 shows a less busy city station, with shorter duration rentals. Centroid 4 shows a suburban station with trips that are longer.
Based on these results, you can use the data to inform your decisions. For example:
Assume that you need to experiment with a new type of lock. Which cluster of stations should you choose as a subject for this experiment? The stations in centroid 1, centroid 2 or centroid 4 seem like logical choices because they are not the busiest stations.
Assume that you want to stock some stations with racing bikes. Which stations should you choose? Centroid 4 is the group of stations that are far from the city center, and they have the longest trips. These are likely candidates for racing bikes.
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.
- You can delete the project you created.
- Or you can keep the project and delete the dataset.
Delete your dataset
Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:
If necessary, open the BigQuery page in the Google Cloud console.
In the navigation, click the bqml_tutorial dataset you created.
Click Delete dataset on the right side of the window. This action deletes the dataset and the model.
In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (
bqml_tutorial
) and then click Delete.
Delete your project
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
- For an overview of BigQuery ML, see Introduction to BigQuery ML.
- For information about creating models, see the
CREATE MODEL
syntax page.