Creating a k-means clustering model

BigQuery ML supports unsupervised learning . You can apply the k-means algorithm to group your data into clusters. Unlike supervised machine learning, which is about predictive analytics, unsupervised learning is about descriptive analytics. It's about understanding your data so that you can make data-driven decisions.

In this tutorial, you will use a k-means model in BigQuery ML to build clusters of data in the London Bicycle Hires public dataset. The London Bicycle Hires data contains the number of hires of London's Santander Cycle Hire Scheme from 2011 to present. The data includes start and stop timestamps, station names and ride duration.

The queries in this tutorial use Geography Functions. available in BigQuery GIS. For more information on BigQuery GIS, see Introduction to BigQuery GIS.

Objectives

In this tutorial you:

  • Create a k-means clustering model.
  • Use Google Data Studio to visualize your model.
  • Make data-driven decisions based on your model.

Costs

This tutorial uses billable components of Cloud Platform, including:

  • BigQuery
  • BigQuery ML

For more information on BigQuery costs, see the BigQuery pricing page.

For more information on BigQuery ML costs, see the BigQuery ML pricing page.

Before you begin

  1. Sign in to your Google Account.

    If you don't already have one, sign up for a new account.

  2. Select or create a GCP project.

    Go to the project selector page

  3. Projeniz için faturalandırmanın etkinleştirildiğinden emin olun.

    Faturalandırmayı etkinleştirmeyi öğren

  4. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to Enable the BigQuery API.

    Enable the API

Introduction

Your data may contain natural groupings or clusters of data. You may want to identify these groupings descriptively in order to make data-driven decisions. For example, a retailer may want to identify natural groupings of customers who have similar purchasing habits or locations. This process is known as customer segmentation.

Data you use to perform customer segmentation might include the store they visited, what items they bought, how much they paid, and so on. You would create a model to try to understand what these groups of customer personas are like so that you can design items that appeal to group members.

You could also find product groups among the items purchased. In this case, you would cluster items based on who purchased them, when they were purchased, where they were purchased, and so on. You would create a model to determine the characteristics of a product group so that you can make informed decisions such as how to improve cross-selling.

In this tutorial, you use BigQuery ML to create a k-means model that clusters London Bicycle Hires data based on bike station attributes.

Creating your k-means model consists of the following steps.

  • Step one: Create a dataset to store your model.
    The first step is to create a dataset that stores your model. Because the London Bicycle Hires dataset is stored in the EU multi-region location, your dataset must also reside in the EU.
  • Step two: Examine your training data.
    The next step is to examine the data you use to train your clustering model by running a query against the london_bicycles table. Because k-means is an unsupervised learning technique, model training does not require labels nor does it require you to split the data into training data and evaluation data.
  • Step three: Create a k-means model.
    Step three is to create your k-means model. When you create the model, the clustering field is station_name, and you cluster the data based on station attributes such as the number of bike racks and the distance of the station from the city center.
  • Step four: Use the ML.PREDICT function to predict a station's cluster.
    Next, you use the ML.PREDICT function to predict the cluster for a given set of stations. You predict clusters for all station names that contain the string Kennington.
  • Step five: Visualize your model.
    Step five is to examine the characteristics of each cluster using a visualization tool. In this tutorial, you visualize the clusters by using Google Data Studio.
  • Step six: Use your model to make data-driven decisions.
    The final step is to use the model to make data-driven decisions. For example, based on the model results, you can determine which stations would benefit from extra capacity.

Step one: Create your dataset

The first step is to create a BigQuery dataset to store your model. To create your dataset:

  1. Go to the BigQuery web UI in the GCP Console.

    Go to the BigQuery web UI

  2. In the navigation panel, in the Resources section, click your project name.

  3. On the right side, in the details panel, click Create dataset.

    Create dataset

  4. On the Create dataset page:

    • For Dataset ID, enter bqml_tutorial.
    • For Data location, choose European Union (EU). The London Bicycle Hires public dataset is stored in the EU multi-region location. Your dataset should be in the same location.

      Create dataset page

  5. Leave all of the other default settings in place and click Create dataset.

Step two: Examine your training data

Next, you examine the data used to train your k-means model. In this tutorial, you cluster bike stations based on the following attributes:

  • Duration of rentals
  • Number of trips per day
  • Number of bike racks
  • Distance from city center

The following query compiles your training data. This query is included in your CREATE MODEL statement later in this tutorial.

WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
    h.duration,
    s.bikes_count,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(bikes_count) AS bikes_count,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name )
SELECT
  *
FROM
  stationstats
ORDER BY
  distance_from_city_center ASC

Query details

This query extracts data on cycle hires, including start_station_name, and duration and joins it against station information, including bike_count and distance-from-city-center. Then, it computes attributes of the station in stationstats, including the average duration of rides and the number of trips, and passes through the station attributes bike_count and distance_from_city_center.

This query uses the WITH clause to define subqueries. The query also uses the ST_DISTANCE and ST_GEOGPOINT BigQuery GIS functions. For more information on these functions, see Geography Functions. For more information on BigQuery GIS, see Introduction to BigQuery GIS.

Run the query

To run the query that compiles the training data for your model:

  1. In the BigQuery web UI, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
        h.duration,
        s.bikes_count,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(bikes_count) AS bikes_count,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name )
    SELECT
      *
    FROM
      stationstats
    ORDER BY
      distance_from_city_center ASC
    
  3. Click Run.

  4. When the query is complete, click the Results tab below the query text area. The results tab shows the columns you queried that are used to train your model: station_name, duration, num_trips, bikes_count, distance_from_city_center. The results should look like the following.

    Query results

Step three: Create a k-means model

Now that you have examined your training data, the next step is to create a k-means model using the data.

You can create and train a k-means model using the CREATE MODEL statement with the option model_type=kmeans. The following query adds a CREATE MODEL statement to the previous query and removes the id fields in the data.

CREATE OR REPLACE MODEL
  bqml_tutorial.london_station_clusters
OPTIONS
  (model_type='kmeans',
    num_clusters=4,
    standardize_features = TRUE) AS
WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
    h.duration,
    s.bikes_count,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(bikes_count) AS bikes_count,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name )
SELECT
  * EXCEPT(station_name,
    isweekday)
FROM
  stationstats

Query details

The CREATE MODEL statement specifies the desired number of clusters — four. In the SELECT statement, the EXCEPT clause excludes the station_name column because station_name is not a feature. The query creates a unique row per station_name, and only the features are mentioned in the SELECT statement.

Standardizing features is a good idea if your features have different dynamic ranges. Here, bikes_count is in the 10–50 range while num_trips is in the thousands so the standardize_features option is set to true. Setting standardize_features to true is the default setting so setting this is optional.

If you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. You could also perform hyperparameter tuning to find a good number. To determine an optimal number of clusters, you would run the CREATE MODEL query for different values of num_clusters, find the error measure, and pick the point at which the error measure is at its minimum value. You can obtain the error measure by selecting your model and clicking the Training tab. The training tab shows the Davies–Bouldin index.

Training tab

Run the CREATE MODEL query

To run the query that creates your k-means model:

  1. In the BigQuery web UI, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    CREATE OR REPLACE MODEL
      bqml_tutorial.london_station_clusters
    OPTIONS
      (model_type='kmeans',
        num_clusters=4,
        standardize_features = TRUE) AS
    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
        h.duration,
        s.bikes_count,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(bikes_count) AS bikes_count,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name )
    SELECT
      * EXCEPT(station_name)
    FROM
      stationstats
    
  3. Click Run.

  4. In the BigQuery web UI, in the Resources section, expand [PROJECT_ID] > bqml_tutorial and then click london_station_clusters.

  5. Click the Schema tab. The model schema lists the four station attributes that BigQuery ML used to perform clustering. The schema should look like the following.

    Cluster schema info

Step four: Use the ML.PREDICT function to predict a station's cluster

To identify the cluster to which a particular station belongs, use the ML.PREDICT function. The following query predicts the cluster of every station that has the string "Kennington" in its name.

WITH
  hs AS (
  SELECT
    h.start_station_name AS station_name,
    h.duration,
    s.bikes_count,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude,
        s.latitude),
      ST_GEOGPOINT(-0.1,
        51.5))/1000 AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
  ON
    h.start_station_id = s.id
  WHERE
    h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
  stationstats AS (
  SELECT
    station_name,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(bikes_count) AS bikes_count,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name )
SELECT
  * EXCEPT(nearest_centroids_distance)
FROM
  ML.PREDICT( MODEL bqml_tutorial.london_station_clusters,
    (
    SELECT
      *
    FROM
      stationstats
    WHERE
      REGEXP_CONTAINS(station_name, 'Kennington')))

Query details

This query uses the REGEXP_CONTAINS function to find all entries in the station_name column that contain the string "Kennington". The ML.PREDICT function uses those values to predict which clusters would contain those stations.

Run the ML.PREDICT query

To run the ML.PREDICT query:

  1. In the BigQuery web UI, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
        h.duration,
        s.bikes_count,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(bikes_count) AS bikes_count,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name )
    SELECT
      * EXCEPT(nearest_centroids_distance)
    FROM
      ML.PREDICT( MODEL bqml_tutorial.london_station_clusters,
        (
        SELECT
          *
        FROM
          stationstats
        WHERE
          REGEXP_CONTAINS(station_name, 'Kennington')))
    
  3. Click Run.

  4. When the query is complete, click the Results tab below the query text area. The results should look like the following.

    ML.PREDICT results

Step five: Visualize your model

Each cluster has a value for each of the columns used as inputs. To retrieve these values, use the ML.CENTROIDS function, as in the following standard SQL query.

SELECT
  *
FROM
  ML.CENTROIDS(MODEL bqml_tutorial.london_station_clusters)
ORDER BY
  centroid_id

To run the ML.CENTROIDS query:

  1. In the BigQuery web UI, click the Compose new query button.

  2. Enter the following standard SQL query in the Query editor text area.

    SELECT
      *
    FROM
      ML.CENTROIDS(MODEL bqml_tutorial.london_station_clusters)
    ORDER BY
      centroid_id
    
  3. When the query is complete, click the Results tab below the query text area. The results should look like the following.

    ML.CENTROIDS results

  4. To more easily compare the respective values for each cluster, pivot the results of the above query using the UNNEST operator. Enter the following standard SQL query in the Query editor text area.

    WITH
      T AS (
      SELECT
        centroid_id,
        ARRAY_AGG(STRUCT(feature AS name,
            ROUND(numerical_value,1) AS value)
        ORDER BY
          centroid_id) AS cluster
      FROM
        ML.CENTROIDS(MODEL bqml_tutorial.london_station_clusters)
      GROUP BY
        centroid_id )
    SELECT
      CONCAT('Cluster#', CAST(centroid_id AS STRING)) AS centroid,
      (
      SELECT
        value
      FROM
        UNNEST(cluster)
      WHERE
        name = 'duration') AS duration,
      (
      SELECT
        value
      FROM
        UNNEST(cluster)
      WHERE
        name = 'num_trips') AS num_trips,
      (
      SELECT
        value
      FROM
        UNNEST(cluster)
      WHERE
        name = 'bikes_count') AS bikes_count,
      (
      SELECT
        value
      FROM
        UNNEST(cluster)
      WHERE
        name = 'distance_from_city_center') AS distance_from_city_center
    FROM
      T
    ORDER BY
      duration DESC
    
  5. Click Run.

  6. When the query is complete, click the Results tab below the query text area. The results should look like the following.

    Pivot results

Visualize results in Google Data Studio

To visualize the results in Google Data Studio:

  1. Above the query results, click Explore in Data Studio.

  2. Click Get Started to agree to the terms of service. You may not see this dialog if you have previously used Google Data Studio.

  3. In the chart picker, click Table with bars.

    Bar chart icon in the chart picker

  4. On the Data tab, verify that the Dimension is set to the centroid column.

    Dimension section in Google Data Studio

  5. In the Metric section, hover over the default metric, likely Record Count, and then click the X icon to delete it.

  6. Drag the following from the Available Fields section onto the Add Metric box: bikes_count, distance_from_city_center, duration, and num_trips. Your chart should look like the following.

    Completed chart

For help configuring chart components in Google Data Studio, see Add and configure components in the Google Data Studio Help Center.

Step six: Use your model to make data-driven decisions

Visualizing the results can help you to interpret the different clusters. In this example, Cluster#4 shows a relatively small number of trips, a moderate distance from the city center, and long rides. This indicates that the stations in Cluster#4 are less busy suburban stations.

Cluster#1 shows a greater distance from the city center and long rides but fewer trips. This indicates that these are other, less busy suburban stations.

Cluster#2 shows a larger number of trips and a closer distance to the city center. This indicates that these stations are relatively busy urban stations.

Cluster#3 shows a close distance to the city center and a very large number of trips. This indicates that these stations are very busy urban stations.

Based on these results, you can use the data to inform your decisions. For example:

  • Assume that you need to experiment with a new type of lock. Which cluster of stations should you choose as a subject for this experiment? The stations in Cluster#4 seem like a logical choice because those are the stations which have high numbers of bikes capable of supporting an A/B test.

  • Assume that you want to stock some stations with racing bikes. Which stations should you choose? Cluster#1 is the group of stations that are the greatest distance from the city center, and they have the longest trips. These are likely candidates for racing bikes.

  • You have received additional funding and can add capacity to your stations. Which stations should you target for additional bikes? Cluster#2 has the fewest number of bikes and the second highest number of trips. That makes these stations a likely target for additional capacity.

Cleaning up

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial:

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Deleting your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery web UI.

    Go to the BigQuery web UI

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset and the model.

  4. In the Delete dataset dialog box, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Deleting your project

To delete the project:

  1. In the GCP Console, go to the Projects page.

    Go to the Projects page

  2. In the project list, select the project you want to delete and click Delete .
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Bu sayfayı yararlı buldunuz mu? Lütfen görüşünüzü bildirin:

Şunun hakkında geri bildirim gönderin...