此页面由 Cloud Translation API 翻译。

创建 k-means 模型以对伦敦自行车租赁数据集进行聚类

本教程将介绍如何在 BigQuery ML 中使用 k-means 模型识别一组数据中的集群。

用于将数据分为不同聚类的 k-means 算法是一种非监督式机器学习。监督式机器学习与预测分析有关，与此不同的是，非监督式机器学习与描述性分析有关。非监督式机器学习可以帮助您了解数据，从而根据数据做出决策。

本教程中的查询使用地理空间分析中提供的地理位置函数。如需了解详情，请参阅地理空间分析简介。

本教程使用的是伦敦自行车租赁公共数据集。数据包括起始和停止时间戳、车站名称和骑行时长。

目标

本教程将指导您完成以下任务：

检查用于训练模型的数据。
创建 k-means 聚类模型。
使用 BigQuery ML 对聚类的可视化结果来解读生成的数据集群。
对 k-means 模型运行 ML.PREDICT 函数，以预测一组自行车租赁站的可能聚类。

费用

本教程使用 Google Cloud的收费组件，包括以下组件：

BigQuery
BigQuery ML

如需了解 BigQuery 费用，请参阅 BigQuery 价格页面。

如需了解 BigQuery ML 费用，请参阅 BigQuery ML 价格。

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Make sure that billing is enabled for your Google Cloud project.

新项目会自动启用 BigQuery。如需在现有项目中激活 BigQuery，请前往
Enable the BigQuery API.
Enable the API

所需权限

如需创建数据集，您需要拥有 bigquery.datasets.create IAM 权限。
如需创建模型，您需要以下权限：
- bigquery.jobs.create
- bigquery.models.create
- bigquery.models.getData
- bigquery.models.updateData
如需运行推理，您需要以下权限：
- bigquery.models.getData
- bigquery.jobs.create

如需详细了解 BigQuery 中的 IAM 角色和权限，请参阅 IAM 简介。

创建数据集

创建 BigQuery 数据集来存储 k-means 模型：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery 页面
在探索器窗格中，点击您的项目名称。
点击 查看操作 > 创建数据集。
在创建数据集页面上，执行以下操作：
- 在数据集 ID 部分，输入 bqml_tutorial。
- 在位置类型部分，选择多区域，然后选择 EU (multiple regions in European Union)（欧盟[欧盟的多个区域]）。
  
  伦敦自行车租赁公共数据集存储在 EU 多区域。数据集必须位于同一位置。
- 保持其余默认设置不变，然后点击创建数据集。

检查训练数据

检查您将用于训练 k-means 模型的数据。在本教程中，您根据以下属性为自行车站划分聚类：

租赁时长
每天的行程数量
与市中心的距离

SQL BigQuery DataFrame

此查询提取有关自行车租赁的数据（包括 start_station_name 和 duration 列），并将这些数据与车站信息联接。这包括创建一个包含相应车站距离市中心的计算列。然后，查询会在 stationstats 列中计算车站的属性（包括平均骑行时长和行程数量），以及计算出的 distance_from_city_center 列。

请按照以下步骤检查训练数据：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，粘贴以下查询，然后点击运行：

WITH
hs AS (
  SELECT
    h.start_station_name AS station_name,
    IF(
      EXTRACT(DAYOFWEEK FROM h.start_date) = 1
        OR EXTRACT(DAYOFWEEK FROM h.start_date) = 7,
      'weekend',
      'weekday') AS isweekday,
    h.duration,
    ST_DISTANCE(ST_GEOGPOINT(s.longitude, s.latitude), ST_GEOGPOINT(-0.1, 51.5)) / 1000
      AS distance_from_city_center
  FROM
    `bigquery-public-data.london_bicycles.cycle_hire` AS h
  JOIN
    `bigquery-public-data.london_bicycles.cycle_stations` AS s
    ON
      h.start_station_id = s.id
  WHERE
    h.start_date
    BETWEEN CAST('2015-01-01 00:00:00' AS TIMESTAMP)
    AND CAST('2016-01-01 00:00:00' AS TIMESTAMP)
),
stationstats AS (
  SELECT
    station_name,
    isweekday,
    AVG(duration) AS duration,
    COUNT(duration) AS num_trips,
    MAX(distance_from_city_center) AS distance_from_city_center
  FROM
    hs
  GROUP BY
    station_name, isweekday
)
SELECT *
FROM
stationstats
ORDER BY
distance_from_city_center ASC;

结果应如下所示：

查询结果

在尝试此示例之前，请按照《BigQuery 快速入门：使用 BigQuery DataFrames》中的 BigQuery DataFrames 设置说明进行操作。如需了解详情，请参阅 BigQuery DataFrames 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭据。如需了解详情，请参阅为本地开发环境设置 ADC。

import datetime

import pandas as pd

import bigframes
import bigframes.pandas as bpd

bigframes.options.bigquery.project = your_gcp_project_id
# Compute in the EU multi-region to query the London bicycles dataset.
bigframes.options.bigquery.location = "EU"

# Extract the information you'll need to train the k-means model in this
# tutorial. Use the read_gbq function to represent cycle hires
# data as a DataFrame.
h = bpd.read_gbq(
    "bigquery-public-data.london_bicycles.cycle_hire",
    col_order=["start_station_name", "start_station_id", "start_date", "duration"],
).rename(
    columns={
        "start_station_name": "station_name",
        "start_station_id": "station_id",
    }
)

s = bpd.read_gbq(
    # Use ST_GEOPOINT and ST_DISTANCE to analyze geographical
    # data. These functions determine spatial relationships between
    # geographical features.
    """
    SELECT
    id,
    ST_DISTANCE(
        ST_GEOGPOINT(s.longitude, s.latitude),
        ST_GEOGPOINT(-0.1, 51.5)
    ) / 1000 AS distance_from_city_center
    FROM
    `bigquery-public-data.london_bicycles.cycle_stations` s
    """
)

# Define Python datetime objects in the UTC timezone for range comparison,
# because BigQuery stores timestamp data in the UTC timezone.
sample_time = datetime.datetime(2015, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)
sample_time2 = datetime.datetime(2016, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)

h = h.loc[(h["start_date"] >= sample_time) & (h["start_date"] <= sample_time2)]

# Replace each day-of-the-week number with the corresponding "weekday" or
# "weekend" label by using the Series.map method.
h = h.assign(
    isweekday=h.start_date.dt.dayofweek.map(
        {
            0: "weekday",
            1: "weekday",
            2: "weekday",
            3: "weekday",
            4: "weekday",
            5: "weekend",
            6: "weekend",
        }
    )
)

# Supplement each trip in "h" with the station distance information from
# "s" by merging the two DataFrames by station ID.
merged_df = h.merge(
    right=s,
    how="inner",
    left_on="station_id",
    right_on="id",
)

# Engineer features to cluster the stations. For each station, find the
# average trip duration, number of trips, and distance from city center.
stationstats = merged_df.groupby(["station_name", "isweekday"]).agg(
    {"duration": ["mean", "count"], "distance_from_city_center": "max"}
)
stationstats.columns = pd.Index(
    ["duration", "num_trips", "distance_from_city_center"]
)
stationstats = stationstats.sort_values(
    by="distance_from_city_center", ascending=True
).reset_index()

# Expected output results: >>> stationstats.head(3)
# station_name	isweekday duration  num_trips	distance_from_city_center
# Borough Road...	weekday	    1110	    5749	    0.12624
# Borough Road...	weekend	    2125	    1774	    0.12624
# Webber Street...	weekday	    795	        6517	    0.164021
#   3 rows × 5 columns