使用 SQL 在 BigQuery ML 中创建机器学习模型

本教程介绍如何使用 BigQuery ML SQL 查询创建逻辑回归模型。

借助 BigQuery ML，您可以使用 SQL 查询在 BigQuery 中创建和训练机器学习模型。这有助于您使用 BigQuery SQL 编辑器等熟悉的工具，从而让机器学习变得更加易于使用；此外，由于无需将数据移至单独的机器学习环境，因此还能加快开发速度。

在本教程中，您可利用适用于 BigQuery 的 Google Analytics 示例数据集创建一个模型，预测网站访客是否会进行交易。如需了解 Analytics 数据集架构，请访问 Analytics 帮助中心，然后查看 BigQuery Export 架构。

如需了解如何使用 Google Cloud 控制台界面创建模型，请参阅使用界面处理模型。（预览版）

目标

本教程介绍了如何执行以下任务：

使用 CREATE MODEL 语句创建二元逻辑回归模型。
使用 ML.EVALUATE 函数评估该模型。
使用 ML.PREDICT 函数通过该模型进行预测。

费用

本教程使用 Google Cloud的可计费组件，包括以下组件：

BigQuery
BigQuery ML

如需了解有关 BigQuery 费用的更多信息，请参阅 BigQuery 价格页面。

如需详细了解 BigQuery ML 费用，请参阅 BigQuery ML 价格。

所需的角色

如需创建模型并运行推理，您必须获得以下角色：
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery User (roles/bigquery.user)

准备工作

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
选择项目。
点击 授予访问权限。
在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
在选择角色列表中，选择一个角色。
如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
点击 Save（保存）。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
前往 IAM
选择项目。
点击 授予访问权限。
在新的主账号字段中，输入您的用户标识符。这通常是 Google 账号的电子邮件地址。
在选择角色列表中，选择一个角色。
如需授予其他角色，请点击 添加其他角色，然后添加其他各个角色。
点击 Save（保存）。

新项目会自动启用 BigQuery。如需在预先存在的项目中激活 BigQuery，请前往
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

创建数据集

创建 BigQuery 数据集以存储机器学习模型。

控制台

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery 页面
在探索器窗格中，点击您的项目名称。
点击 查看操作 > 创建数据集
在 创建数据集 页面上，执行以下操作：
- 在数据集 ID 部分，输入 bqml_tutorial。
- 在位置类型部分，选择多区域，然后选择 US (multiple regions in United States)（美国[美国的多个区域]）。
- 保持其余默认设置不变，然后点击创建数据集。

bq

如需创建新数据集，请使用带有 --location 标志的 bq mk 命令。如需查看完整的潜在参数列表，请参阅 bq mk --dataset 命令参考文档。

创建一个名为 bqml_tutorial 的数据集，并将数据位置设置为 US，说明为 BigQuery ML tutorial dataset：
```
bq --location=US mk -d \
 --description "BigQuery ML tutorial dataset." \
 bqml_tutorial
```
该命令使用的不是 --dataset 标志，而是 -d 快捷方式。如果省略 -d 和 --dataset，该命令会默认创建一个数据集。
确认已创建数据集：
```
bq ls
```

API

使用已定义的数据集资源调用 datasets.insert 方法。

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}

BigQuery DataFrame

在尝试此示例之前，请按照《BigQuery 快速入门：使用 BigQuery DataFrames》中的 BigQuery DataFrames 设置说明进行操作。如需了解详情，请参阅 BigQuery DataFrames 参考文档。

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。

import google.cloud.bigquery

bqclient = google.cloud.bigquery.Client()
bqclient.create_dataset("bqml_tutorial", exists_ok=True)

创建逻辑回归模型

使用适用于 BigQuery 的 Analytics 样本数据集创建逻辑回归模型。

SQL

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，运行以下语句：

CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

查询需要几分钟才能完成。第一次迭代完成后，您的模型 (sample_model) 会显示在导航面板中。由于查询使用 CREATE MODEL 语句来创建模型，因此您看不到查询结果。

查询详情

CREATE MODEL 语句会创建模型，然后使用查询的 SELECT 语句检索到的数据来训练模型。

OPTIONS(model_type='logistic_reg') 子句用于创建逻辑回归模型。逻辑回归模型将输入数据拆分为两类，然后估算数据属于其中一个类的概率。您尝试检测的内容（例如电子邮件是否为垃圾邮件）由 1 表示，其他值均由 0 表示。给定值属于您尝试检测的类的概览由一个介于 0 到 1 之间的值表示。例如，如果某电子邮件收到的概率估计值为 0.9，则表示该电子邮件是垃圾邮件的概率为 90%。

此查询的 SELECT 语句检索模型使用的以下列来预测客户完成交易的概率：

totals.transactions：会话中的电子商务交易总数。如果交易数为 NULL，则 label 列中的值会设置为 0。否则，该值会设置为 1。这些值表示可能的结果。创建名为 label 的别名是在 CREATE MODEL 语句中设置 input_label_cols= 选项的替代方法。
device.operatingSystem：访问者设备的操作系统。
device.isMobile - 指示访问者的设备是否为移动设备。
geoNetwork.country：会话源自的国家/地区（根据 IP 地址）。
totals.pageviews：会话中的网页浏览总量。

FROM 子句，指示查询使用 bigquery-public-data.google_analytics_sample.ga_sessions 样本表训练模型。这些表按日期分片，因此您可以使用表名称 google_analytics_sample.ga_sessions_* 中的通配符对它们进行聚合。

WHERE 子句 - _TABLE_SUFFIX BETWEEN '20160801' AND '20170630' - 限制查询扫描的表的数量。扫描的日期范围是 2016 年 8 月 1 日至 2017 年 6 月 30 日。

BigQuery DataFrame

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。

from bigframes.ml.linear_model import LogisticRegression
import bigframes.pandas as bpd

# Start by selecting the data you'll use for training. `read_gbq` accepts
# either a SQL query or a table ID. Since this example selects from multiple
# tables via a wildcard, use SQL to define this data. Watch issue
# https://github.com/googleapis/python-bigquery-dataframes/issues/169
# for updates to `read_gbq` to support wildcard tables.

df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20160801"),
        ("_table_suffix", "<=", "20170630"),
    ],
)

# Extract the total number of transactions within
# the Google Analytics session.
#
# Because the totals column is a STRUCT data type, call
# Series.struct.field("transactions") to extract the transactions field.
# See the reference documentation below:
# https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor#bigframes_operations_structs_StructAccessor_field
transactions = df["totals"].struct.field("transactions")

# The "label" values represent the outcome of the model's
# prediction. In this case, the model predicts if there are any
# ecommerce transactions within the Google Analytics session.
# If the number of transactions is NULL, the value in the label
# column is set to 0. Otherwise, it is set to 1.
label = transactions.notnull().map({True: 1, False: 0}).rename("label")

# Extract the operating system of the visitor's device.
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")

# Extract whether the visitor's device is a mobile device.
is_mobile = df["device"].struct.field("isMobile")

# Extract the country from which the sessions originated, based on the IP address.
country = df["geoNetwork"].struct.field("country").fillna("")

# Extract the total number of page views within the session.
pageviews = df["totals"].struct.field("pageviews").fillna(0)

# Combine all the feature columns into a single DataFrame
# to use as training data.
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Logistic Regression model splits data into two classes, giving the
# a confidence score that the data is in one of the classes.
model = LogisticRegression()
model.fit(features, label)

# The model.fit() call above created a temporary model.
# Use the to_gbq() method to write to a permanent location.
model.to_gbq(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
    replace=True,
)

查看模型的损失统计信息

机器学习的目的是创建可利用数据进行预测的模型。该模型本质上是一个函数，该函数接受输入并对输入进行计算以产生输出（即预测）。

机器学习算法的工作原理是：采用预测结果已知的几个示例（例如用户购买行为的历史数据）并以迭代方式调整模型中的各种权重，以使模型的预测结果与真实值相符。该算法通过最大限度地降低模型的错误程度（采用称为损失的指标来衡量）来实现此目标。

期望的情况是每次迭代损失应该减少（理想情况下为零）。损失为零意味着模型 100% 准确。

在训练模型时，BigQuery ML 会自动将输入数据拆分为训练和评估集，以避免过拟合模型。这很有必要，这样训练算法不会对无法泛化到新样本的训练数据进行严密拟合。

使用 Google Cloud 控制台查看模型的损失在模型的训练迭代中的变化：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery
在左侧窗格中，点击 探索器：

如果您没有看到左侧窗格，请点击 展开左侧窗格以打开该窗格。
在探索器窗格中，展开您的项目，点击数据集，然后点击 bqml_tutorial 数据集。
点击模型标签页，然后点击 sample_model 模型。
点击训练标签页，然后查看损失图表。损失图表显示了损失指标在训练数据集上的迭代中的变化情况。如果您将光标悬停在图表上，可以看到训练损失和评估损失对应的线条。由于您执行了逻辑回归，因此训练损失值是使用训练数据计算出的对数损失函数。评估损失是根据评估数据计算出的对数损失函数。这两种损失类型都表示平均损失值，对每次迭代中各个数据集中的所有样本取平均值。

您还可以使用 ML.TRAINING_INFO 函数查看模型训练的结果。

评估模型

使用 ML.EVALUATE 函数评估模型的性能。 ML.EVALUATE 函数根据实际数据评估模型生成的预测值。如需计算逻辑回归特有指标，您可以使用 ML.ROC_CURVE SQL 函数或 bigframes.ml.metrics.roc_curve BigQuery DataFrames 函数。

在本教程中，您使用的是检测交易的二元分类模型。label 列中的值是模型生成的两个类别：0（无交易）和 1（有交易）。

SQL

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，运行以下语句：

SELECT
*
FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))

结果应如下所示：

  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  |     precision      |       recall        |      accuracy       |      f1_score       |      log_loss       | roc_auc                   |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  | 0.468503937007874  | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898    | 0.98174125874125873 |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+

由于您执行了逻辑回归，因此结果包含以下列：

precision：分类模型指标。精确率指模型正确预测正类别的频率。
recall：分类模型指标，用于回答以下问题：在所有可能的正例标签中，模型正确识别了多少个标签？
accuracy：准确率是分类模型做出正确预测的比例。
f1_score：用于衡量模型的准确率。f1 得分是精确率和召回率的调和平均数。f1 得分的最佳值为 1。最差值为 0。
log_loss：逻辑回归中所用的损失函数。该指标用于衡量模型的预测结果与正确标签之间的差距。
roc_auc：ROC 曲线下的面积。这指的是分类服务认为随机选择的正例比随机选择的负例为正的概率更高。如需了解详情，请参阅机器学习速成课程中的分类。

查询详情

初始 SELECT 语句从模型中检索列。

FROM 子句对模型使用 ML.EVALUATE 函数。

嵌套的 SELECT 语句和 FROM 子句与 CREATE MODEL 查询中的相同。

WHERE 子句 - _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' - 限制查询扫描的表的数量。扫描的日期范围是 2017 年 7 月 1 日到 2017 年 8 月 1 日。这数据供您评估模型预测性能。这是在训练数据跨越的时间段之后的一个月中收集的。

BigQuery DataFrame

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。

import bigframes.pandas as bpd

# Select model you'll use for evaluating. `read_gbq_model` loads model data from a
# BigQuery, but you could also use the `model` object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to evaluate the predictive performance of the model.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

transactions = df["totals"].struct.field("transactions")
label = transactions.notnull().map({True: 1, False: 0}).rename("label")
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Some models include a convenient .score(X, y) method for evaluation with a preset accuracy metric:

# Because you performed a logistic regression, the results include the following columns:

# - precision — A metric for classification models. Precision identifies the frequency with
# which a model was correct when predicting the positive class.

# - recall — A metric for classification models that answers the following question:
# Out of all the possible positive labels, how many did the model correctly identify?

# - accuracy — Accuracy is the fraction of predictions that a classification model got right.

# - f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of
# the precision and recall. An f1 score's best value is 1. The worst value is 0.

# - log_loss — The loss function used in a logistic regression. This is the measure of how far the
# model's predictions are from the correct labels.

# - roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that
# a randomly chosen positive example
# is actually positive than that a randomly chosen negative example is positive. For more information,
# see ['Classification']('https://developers.google.com/machine-learning/crash-course/classification/video-lecture')
# in the Machine Learning Crash Course.

model.score(features, label)
#    precision    recall  accuracy  f1_score  log_loss   roc_auc
# 0   0.412621  0.079143  0.985074  0.132812  0.049764  0.974285
# [1 rows x 6 columns]

使用模型预测结果

使用模型预测每个国家/地区的网站访问者的交易数量。

SQL

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，运行以下语句：

SELECT
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10

结果应如下所示：

+----------------+---------------------------+
|    country     | total_predicted_purchases |
+----------------+---------------------------+
| United States  |                       220 |
| Taiwan         |                         8 |
| Canada         |                         7 |
| India          |                         2 |
| Turkey         |                         2 |
| Japan          |                         2 |
| Italy          |                         1 |
| Brazil         |                         1 |
| Singapore      |                         1 |
| Australia      |                         1 |
+----------------+---------------------------+

查询详情

初始 SELECT 语句检索 country 列，并对 predicted_label 列求和。predicted_label 列由 ML.PREDICT 函数生成。当您使用 ML.PREDICT 函数时，模型的输出列名称为 predicted_<label_column_name>。对于线性回归模型，predicted_label 为 label 的估算值。对于逻辑回归模型，predicted_label 是最能描述给定输入数据值（即 0 或 1）的标签。

ML.PREDICT 函数用于通过模型预测结果。

嵌套的 SELECT 语句和 FROM 子句与 CREATE MODEL 查询中的相同。

WHERE 子句 - _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' - 限制查询扫描的表的数量。扫描的日期范围是 2017 年 7 月 1 日到 2017 年 8 月 1 日。这是您要进行预测的数据。这是在训练数据跨越的时间段之后的一个月中收集的。

GROUP BY 和 ORDER BY 子句按国家/地区对结果进行分组，并降序排列预测购买的总和。

此处所用的 LIMIT 子句仅显示前 10 个结果。

BigQuery DataFrame

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。

import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)
# Use Logistic Regression predict method to predict results
# using your model.
# Find more information here in
# [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.linear_model.LogisticRegression#bigframes_ml_linear_model_LogisticRegression_predict)

predictions = model.predict(features)

# Call groupby method to group predicted_label by country.
# Call sum method to get the total_predicted_label by country.
total_predicted_purchases = predictions.groupby(["country"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# country
# United States    220
# Taiwan             8
# Canada             7
# India              2
# Japan              2
# Turkey             2
# Australia          1
# Brazil             1
# Germany            1
# Guyana             1
# Name: predicted_label, dtype: Int64

预测每位用户的购买量

预测每个网站访问者将进行的交易数量。

SQL

除 GROUP BY 子句外，此查询与上一部分中的查询相同。GROUP BY 子句 - GROUP BY fullVisitorId - 在此用于按访客 ID 对结果进行分组。

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery

在查询编辑器中，运行以下语句：

SELECT
fullVisitorId,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10

结果应如下所示：

  +---------------------+---------------------------+
  |    fullVisitorId    | total_predicted_purchases |
  +---------------------+---------------------------+
  | 9417857471295131045 |                         4 |
  | 112288330928895942  |                         2 |
  | 2158257269735455737 |                         2 |
  | 489038402765684003  |                         2 |
  | 057693500927581077  |                         2 |
  | 2969418676126258798 |                         2 |
  | 5073919761051630191 |                         2 |
  | 7420300501523012460 |                         2 |
  | 0456807427403774085 |                         2 |
  | 2105122376016897629 |                         2 |
  +---------------------+---------------------------+

BigQuery DataFrame

如需向 BigQuery 进行身份验证，请设置应用默认凭证。如需了解详情，请参阅为本地开发环境设置 ADC。


import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
full_visitor_id = df["fullVisitorId"]

features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
        "fullVisitorId": full_visitor_id,
    }
)

predictions = model.predict(features)

# Call groupby method to group predicted_label by visitor.
# Call sum method to get the total_predicted_label by visitor.
total_predicted_purchases = predictions.groupby(["fullVisitorId"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# fullVisitorId
# 9417857471295131045    4
# 0376394056092189113    2
# 0456807427403774085    2
# 057693500927581077     2
# 112288330928895942     2
# 1280993661204347450    2
# 2105122376016897629    2
# 2158257269735455737    2
# 2969418676126258798    2
# 489038402765684003     2
# Name: predicted_label, dtype: Int64

清理

为避免因本页中使用的资源导致您的 Google Cloud 账号产生费用，请按照以下步骤操作。

您可以删除您创建的项目，也可以保留项目并删除数据集。

删除数据集

删除项目也将删除项目中的所有数据集和所有表。如果您希望重复使用该项目，则可以删除在本教程中创建的数据集：

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery
在左侧窗格中，点击 探索器：
在探索器窗格中，展开您的项目，点击数据集，然后点击您创建的 bqml_tutorial 数据集。
点击删除。
在删除数据集对话框中，输入 delete 以确认删除命令。
点击删除。

删除项目

如需删除项目，请执行以下操作：

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

后续步骤

要详细了解机器学习，请参阅机器学习速成课程。
如需大致了解 BigQuery ML，请参阅 BigQuery ML 简介。
如需详细了解 Google Cloud 控制台，请参阅使用 Google Cloud 控制台。