Create a k-means model to cluster London bicycle hires dataset


BigQuery ML supports unsupervised learning . You can apply the k-means algorithm to group your data into clusters. Unlike supervised machine learning, which is about predictive analytics, unsupervised learning is about descriptive analytics. It's about understanding your data so that you can make data-driven decisions.

In this tutorial, you use a k-means model in BigQuery ML to build clusters of data in the London Bicycle Hires public dataset. The London Bicycle Hires data contains the number of hires of London's Santander Cycle Hire Scheme from 2011 to the present. The data includes start and stop timestamps, station names and ride duration.

The queries in this tutorial use Geography functions available in geospatial analytics. For more information on geospatial analytics, see Introduction to geospatial analytics.

Objectives

In this tutorial you:

  • Create a k-means clustering model.
  • Make data-driven decisions based on BigQuery ML's visualization of the clusters.

Costs

This tutorial uses billable components of Google Cloud, including the following:

  • BigQuery
  • BigQuery ML

For information on BigQuery costs, see the BigQuery pricing page.

For information on BigQuery ML costs, see BigQuery ML pricing.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Google Cloud project.

  4. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  5. Make sure that billing is enabled for your Google Cloud project.

  6. BigQuery is automatically enabled in new projects. To activate BigQuery in a pre-existing project, go to

    Enable the BigQuery API.

    Enable the API

Introduction

Your data might contain natural groupings or clusters of data, and you might want to identify these groupings descriptively in order to make data-driven decisions. For example, if you are a retailer you might want to identify natural groupings of customers who have similar purchasing habits or locations. This process is known as customer segmentation.

Data you use to perform customer segmentation might include the store they visited, what items they bought, and how much they paid. You would create a model to try to understand what these groups of customer personas are like so that you can design items that appeal to group members.

You could also find product groups among the items purchased. In this case, you would cluster items based on who purchased them, when they were purchased, where they were purchased, and other similar characteristics. You would create a model to determine the characteristics of a product group so that you can make informed decisions such as how to improve cross-selling.

In this tutorial, you use BigQuery ML to create a k-means model that clusters London Bicycle Hires data based on bike station attributes.

Creating your k-means model consists of the following steps.

  • Step one: Create a dataset to store your model.
    The first step is to create a dataset that stores your model.
  • Step two: Examine your training data.
    The next step is to examine the data you use to train your clustering model by running a query against the london_bicycles table. Because k-means is an unsupervised learning technique, model training does not require labels nor does it require you to split the data into training data and evaluation data.
  • Step three: Create a k-means model.
    Step three is to create your k-means model. When you create the model, the clustering field is station_name, and you cluster the data based on station attribute, for example the distance of the station from the city center.
  • Step four: Use the ML.PREDICT function to predict a station's cluster.
    Next, you use the ML.PREDICT function to predict the cluster for a given set of stations. You predict clusters for all station names that contain the string Kennington.
  • Step five: Use your model to make data-driven decisions.
    The final step is to use the model to make data-driven decisions. For example, based on the model results, you can determine which stations would benefit from extra capacity.

Step one: Create your dataset

Create a BigQuery dataset to store your ML model:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to the BigQuery page

  2. In the Explorer pane, click your project name.

  3. Click View actions > Create dataset.

    Create dataset.

  4. On the Create dataset page, do the following:

    • For Dataset ID, enter bqml_tutorial.

    • For Location type, select Multi-region, and then select EU (multiple regions in European Union).

      The London Bicycle Hires public dataset is stored in the EU multi-region. Your dataset must be in the same location.

    • Leave the remaining default settings as they are, and click Create dataset.

      Create dataset page.

Step two: Examine your training data

Next, examine the data used to train your k-means model. In this tutorial, you cluster bike stations based on the following attributes:

  • Duration of rentals
  • Number of trips per day
  • Distance from city center

SQL

The following GoogleSQL query is used to examine the data used to train your k-means model.

#standardSQL
WITH
hs AS (
SELECT
  h.start_station_name AS station_name,
  IF
  (EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 1
    OR EXTRACT(DAYOFWEEK
    FROM
      h.start_date) = 7,
    "weekend",
    "weekday") AS isweekday,
  h.duration,
  ST_DISTANCE(ST_GEOGPOINT(s.longitude,
      s.latitude),
    ST_GEOGPOINT(-0.1,
      51.5))/1000 AS distance_from_city_center
FROM
  `bigquery-public-data.london_bicycles.cycle_hire` AS h
JOIN
  `bigquery-public-data.london_bicycles.cycle_stations` AS s
ON
  h.start_station_id = s.id
WHERE
  h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
  AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
stationstats AS (
SELECT
  station_name,
  isweekday,
  AVG(duration) AS duration,
  COUNT(duration) AS num_trips,
  MAX(distance_from_city_center) AS distance_from_city_center
FROM
  hs
GROUP BY
  station_name, isweekday )
SELECT
*
FROM
stationstats
ORDER BY
distance_from_city_center ASC

Query details

This query extracts data on cycle hires, including start_station_name, and duration and joins it against station information, including distance-from-city-center. Then, it computes attributes of the station in stationstats, including the average duration of rides and the number of trips, and passes through the station attribute distance_from_city_center.

This query uses the WITH clause to define subqueries. The query also uses the ST_DISTANCE and ST_GEOGPOINT geospatial analytics functions. For more information about these functions, see Geography functions. For more information about geospatial analytics, see Introduction to geospatial analytics.

Run the query

The following query compiles your training data, and is also used in the CREATE MODEL statement later in this tutorial.

To run the query:

  1. Go to the BigQuery page.

Go to BigQuery

  1. In the editor pane, run the following SQL statement:

    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
        IF
        (EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 1
          OR EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 7,
          "weekend",
          "weekday") AS isweekday,
        h.duration,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        isweekday,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name, isweekday )
    SELECT
      *
    FROM
      stationstats
    ORDER BY
      distance_from_city_center ASC
  2. When the query is complete, click the Results tab below the query text area. The results tab shows the columns you queried that are used to train your model: station_name, duration, num_trips, distance_from_city_center. The results should look like the following.

Query results

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.

import datetime

import pandas as pd

import bigframes
import bigframes.pandas as bpd

bigframes.options.bigquery.project = your_gcp_project_id
# Compute in the EU multi-region to query the London bicycles dataset.
bigframes.options.bigquery.location = "EU"

# Extract the information you'll need to train the k-means model in this
# tutorial. Use the read_gbq function to represent cycle hires
# data as a DataFrame.
h = bpd.read_gbq(
    "bigquery-public-data.london_bicycles.cycle_hire",
    col_order=["start_station_name", "start_station_id", "start_date", "duration"],
).rename(
    columns={
        "start_station_name": "station_name",
        "start_station_id": "station_id",
    }
)

s = bpd.read_gbq(
    # Use ST_GEOPOINT and ST_DISTANCE to analyze geographical
    # data. These functions determine spatial relationships between
    # geographical features.
    """
    SELECT
    id,
    ST_DISTANCE(
        ST_GEOGPOINT(s.longitude, s.latitude),
        ST_GEOGPOINT(-0.1, 51.5)
    ) / 1000 AS distance_from_city_center
    FROM
    `bigquery-public-data.london_bicycles.cycle_stations` s
    """
)

# Define Python datetime objects in the UTC timezone for range comparison,
# because BigQuery stores timestamp data in the UTC timezone.
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)

h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)]

# Replace each day-of-the-week number with the corresponding "weekday" or
# "weekend" label by using the Series.map method.
h = h.assign(
    isweekday=h.start_date.dt.dayofweek.map(
        {
            0: "weekday",
            1: "weekday",
            2: "weekday",
            3: "weekday",
            4: "weekday",
            5: "weekend",
            6: "weekend",
        }
    )
)

# Supplement each trip in "h" with the station distance information from
# "s" by merging the two DataFrames by station ID.
merged_df = h.merge(
    right=s,
    how="inner",
    left_on="station_id",
    right_on="id",
)

# Engineer features to cluster the stations. For each station, find the
# average trip duration, number of trips, and distance from city center.
stationstats = merged_df.groupby(["station_name", "isweekday"]).agg(
    {"duration": ["mean", "count"], "distance_from_city_center": "max"}
)
stationstats.columns = pd.Index(
    ["duration", "num_trips", "distance_from_city_center"]
)
stationstats = stationstats.sort_values(
    by="distance_from_city_center", ascending=True
).reset_index()

# Expected output results: >>> stationstats.head(3)
# station_name	isweekday duration  num_trips	distance_from_city_center
# Borough Road...	weekday	    1110	    5749	    0.12624
# Borough Road...	weekend	    2125	    1774	    0.12624
# Webber Street...	weekday	    795	        6517	    0.164021
#   3 rows × 5 columns

Step three: Create a k-means model

Now that you have examined your training data, the next step is to create a k-means model using the data.

SQL

You can create and train a k-means model using the CREATE MODEL statement with the option model_type=kmeans.

Query details

The CREATE MODEL statement specifies the number of clusters to use — four. In the SELECT statement, the EXCEPT clause excludes the station_name column because station_name is not a feature. The query creates a unique row per station_name, and only the features are mentioned in the SELECT statement.

If you omit the num_clusters option, BigQuery ML chooses a reasonable default based on the total number of rows in the training data. You could also perform hyperparameter tuning to find a good number. To determine an optimal number of clusters, you would run the CREATE MODEL query for different values of num_clusters, find the error measure, and pick the point at which the error measure is at its minimum value. You can obtain the error measure by selecting your model and clicking the Evaluation tab. This tab shows the Davies–Bouldin index.

Training tab

Run the query

The following query adds a CREATE MODEL statement to the query you used to examine the training data, and also removes the id fields in the data.

To run the query and create a k-means model:

  1. Go to the BigQuery page.

Go to BigQuery

  1. In the editor pane, run the following SQL statement:

    CREATE OR REPLACE MODEL `bqml_tutorial.london_station_clusters`
      OPTIONS(model_type='kmeans', num_clusters=4) AS
    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
      IF
        (EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 1
          OR EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 7,
          "weekend",
          "weekday") AS isweekday,
        h.duration,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        isweekday,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name, isweekday)
    SELECT
      * EXCEPT(station_name, isweekday)
    FROM
      stationstats
     
  2. In the navigation panel, in the Resources section, expand your project name, click bqml_tutorial, and then click london_station_clusters.

  3. Click the Schema tab. The model schema lists the four station attributes that BigQuery ML used to perform clustering. The schema should look like the following.

Cluster schema info

  1. Click the Evaluation tab. This tab displays visualizations of the clusters identified by the k-means model. Under Numeric features, bar graphs display up to 10 of the most important numeric feature values for each centroid. You can select which features to visualize from the drop-down menu.

Numeric feature graphs

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


from bigframes.ml.cluster import KMeans

# To determine an optimal number of clusters, construct and fit several
# K-Means objects with different values of num_clusters, find the error
# measure, and pick the point at which the error measure is at its minimum
# value.
cluster_model = KMeans(n_clusters=4)
cluster_model.fit(stationstats)
cluster_model.to_gbq(
    your_model_id,  # For example: "bqml_tutorial.london_station_clusters"
    replace=True,
)

Step four: Use the ML.PREDICT function to predict a station's cluster

To identify the cluster to which a particular station belongs, use the ML.PREDICT SQL function function or or the predict BigQuery DataFrames function

SQL

Query details

This query uses the REGEXP_CONTAINS function to find all entries in the station_name column that contain the string "Kennington". The ML.PREDICT function uses those values to predict which clusters would contain those stations.

Run the query

The following query predicts the cluster of every station that has the string "Kennington" in its name.

To run the ML.PREDICT query:

  1. Go to the BigQuery page.

Go to BigQuery

  1. In the editor pane, run the following SQL statement:

    WITH
      hs AS (
      SELECT
        h.start_station_name AS station_name,
        IF
        (EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 1
          OR EXTRACT(DAYOFWEEK
          FROM
            h.start_date) = 7,
          "weekend",
          "weekday") AS isweekday,
        h.duration,
        ST_DISTANCE(ST_GEOGPOINT(s.longitude,
            s.latitude),
          ST_GEOGPOINT(-0.1,
            51.5))/1000 AS distance_from_city_center
      FROM
        `bigquery-public-data.london_bicycles.cycle_hire` AS h
      JOIN
        `bigquery-public-data.london_bicycles.cycle_stations` AS s
      ON
        h.start_station_id = s.id
      WHERE
        h.start_date BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
        AND CAST('2016-01-01 00:00:00' AS TIMESTAMP) ),
      stationstats AS (
      SELECT
        station_name,
        isweekday,
        AVG(duration) AS duration,
        COUNT(duration) AS num_trips,
        MAX(distance_from_city_center) AS distance_from_city_center
      FROM
        hs
      GROUP BY
        station_name, isweekday )
    SELECT
      * EXCEPT(nearest_centroids_distance)
    FROM
      ML.PREDICT( MODEL `bqml_tutorial.london_station_clusters`,
        (
        SELECT
          *
        FROM
          stationstats
        WHERE
          REGEXP_CONTAINS(station_name, 'Kennington')))
  2. When the query is complete, click the Results tab below the query text area. The results should look like the following.

    ML.PREDICT results

BigQuery DataFrames

Before trying this sample, follow the BigQuery DataFrames setup instructions in the BigQuery quickstart using BigQuery DataFrames. For more information, see the BigQuery DataFrames reference documentation.

To authenticate to BigQuery, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.


# Select model you'll use for predictions. `read_gbq_model` loads model
# data from BigQuery, but you could also use the `cluster_model` object
# from previous steps.
cluster_model = bpd.read_gbq_model(
    your_model_id,
    # For example: "bqml_tutorial.london_station_clusters",
)

# Use 'contains' function to filter by stations containing the string
# "Kennington".
stationstats = stationstats.loc[
    stationstats["station_name"].str.contains("Kennington")
]

result = cluster_model.predict(stationstats)

# Expected output results:   >>>results.peek(3)
# CENTROID...	NEAREST...	station_name  isweekday	 duration num_trips dist...
# 	1	[{'CENTROID_ID'...	Borough...	  weekday	  1110	    5749	0.13
# 	2	[{'CENTROID_ID'...	Borough...	  weekend	  2125      1774	0.13
# 	1	[{'CENTROID_ID'...	Webber...	  weekday	  795	    6517	0.16
#   3 rows × 7 columns

Step five: Use your model to make data-driven decisions

Evaluation results can help you to interpret the different clusters. In the following example, centroid 3 shows a busy city station that is close to the city center. Centroid 2 shows the second city station which is less busy and used for longer duration rentals. Centroid 1 shows a less busy city station, with shorter duration rentals. Centroid 4 shows a suburban station with trips that are longer.

Numeric feature graphs

Based on these results, you can use the data to inform your decisions. For example:

  • Assume that you need to experiment with a new type of lock. Which cluster of stations should you choose as a subject for this experiment? The stations in centroid 1, centroid 2 or centroid 4 seem like logical choices because they are not the busiest stations.

  • Assume that you want to stock some stations with racing bikes. Which stations should you choose? Centroid 4 is the group of stations that are far from the city center, and they have the longest trips. These are likely candidates for racing bikes.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

  • You can delete the project you created.
  • Or you can keep the project and delete the dataset.

Delete your dataset

Deleting your project removes all datasets and all tables in the project. If you prefer to reuse the project, you can delete the dataset you created in this tutorial:

  1. If necessary, open the BigQuery page in the Google Cloud console.

    Go to the BigQuery page

  2. In the navigation, click the bqml_tutorial dataset you created.

  3. Click Delete dataset on the right side of the window. This action deletes the dataset and the model.

  4. In the Delete dataset dialog, confirm the delete command by typing the name of your dataset (bqml_tutorial) and then click Delete.

Delete your project

To delete the project:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next