SQL を使用して BigQuery ML で ML モデルを作成する

このチュートリアルでは、BigQuery ML SQL クエリを使用してロジスティック回帰モデルを作成する方法について説明します。

BigQuery ML を使用すると、SQL クエリを使用して BigQuery で機械学習（ML）モデルを作成してトレーニングできます。これにより、BigQuery SQL エディタなどの使い慣れたツールを使用できるため、ML をより簡単に利用できるようになります。また、データを別の ML 環境に移動する必要がないため、開発速度も向上します。

このチュートリアルでは、BigQuery 用の Google アナリティクスサンプルデータセットを使用して、ウェブサイト訪問者がトランザクションを行うかどうかを予測するモデルを作成します。アナリティクスデータセットのスキーマについては、Google アナリティクスヘルプセンターで BigQuery Export スキーマをご覧ください。

Google Cloud コンソールのユーザーインターフェースを使用してモデルを作成する方法については、UI を使ったモデル操作（プレビュー）についての説明をご覧ください。

目標

このチュートリアルでは、次のタスクを行う方法について説明します。

CREATE MODEL ステートメントを使用して、2 項ロジスティック回帰モデルを作成する。
ML.EVALUATE 関数でモデルを評価する。
ML.PREDICT 関数でモデルを使用して予測を行う。

費用

このチュートリアルでは、 Google Cloudの課金対象となる以下のコンポーネントを使用しています。

BigQuery
BigQuery ML

BigQuery の費用の詳細については、BigQuery の料金ページをご覧ください。

BigQuery ML の費用の詳細については、BigQuery ML の料金をご覧ください。

必要なロール

モデルを作成して推論を実行するには、次のロールが付与されている必要があります。
- BigQuery データ編集者（roles/bigquery.dataEditor）
- BigQuery ユーザー（roles/bigquery.user）

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
[IAM] に移動
プロジェクトを選択します。
[ アクセスを許可] をクリックします。
[新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Google アカウントのメールアドレスです。
[ロールを選択] リストでロールを選択します。
追加のロールを付与するには、 [別のロールを追加] をクリックして各ロールを追加します。
[保存] をクリックします。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
[IAM] に移動
プロジェクトを選択します。
[ アクセスを許可] をクリックします。
[新しいプリンシパル] フィールドに、ユーザー ID を入力します。これは通常、Google アカウントのメールアドレスです。
[ロールを選択] リストでロールを選択します。
追加のロールを付与するには、 [別のロールを追加] をクリックして各ロールを追加します。
[保存] をクリックします。

新しいプロジェクトでは、BigQuery が自動的に有効になります。既存のプロジェクトで BigQuery を有効にするには、
Enable the BigQuery API.
Enable the API
に移動します。

データセットを作成する

ML モデルを保存する BigQuery データセットを作成します。

コンソール

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] ページに移動
[エクスプローラ] ペインで、プロジェクト名をクリックします。
[アクションを表示] > [データセットを作成] をクリックします。
[データセットを作成] ページで、次の操作を行います。
- [データセット ID] に「bqml_tutorial」と入力します。
- [ロケーションタイプ] で [マルチリージョン] を選択してから、[US（米国の複数のリージョン）] を選択します。
- 残りのデフォルトの設定は変更せず、[データセットを作成] をクリックします。

bq

新しいデータセットを作成するには、--location フラグを指定した bq mk コマンドを使用します。使用可能なパラメータの一覧については、bq mk --dataset コマンドのリファレンスをご覧ください。

データの場所が US に設定され、BigQuery ML tutorial dataset という説明の付いた、bqml_tutorial という名前のデータセットを作成します。
```
bq --location=US mk -d \
 --description "BigQuery ML tutorial dataset." \
 bqml_tutorial
```
このコマンドでは、--dataset フラグの代わりに -d ショートカットを使用しています。-d と --dataset を省略した場合、このコマンドはデフォルトでデータセットを作成します。
データセットが作成されたことを確認します。
```
bq ls
```

API

定義済みのデータセットリソースを使用して datasets.insert メソッドを呼び出します。

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}

BigQuery DataFrames

このサンプルを試す前に、BigQuery DataFrames を使用した BigQuery クイックスタートの手順に沿って BigQuery DataFrames を設定してください。詳細については、BigQuery DataFrames のリファレンスドキュメントをご覧ください。

BigQuery に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の ADC の設定をご覧ください。

import google.cloud.bigquery

bqclient = google.cloud.bigquery.Client()
bqclient.create_dataset("bqml_tutorial", exists_ok=True)

ロジスティック回帰モデルを作成する

BigQuery 用のアナリティクスサンプルデータセットを使用して、ロジスティック回帰モデルを作成します。

SQL

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

クエリエディタで、次のステートメントを実行します。

CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

クエリが完了するまでに数分かかります。最初のイテレーションが完了すると、モデル（sample_model）がナビゲーションパネルに表示されます。クエリは CREATE MODEL ステートメントを使用してモデルを作成するため、クエリの結果は表示されません。

クエリの詳細

CREATE MODEL ステートメントは、モデルを作成し、クエリの SELECT ステートメントによって取得されたデータを使用してモデルをトレーニングします。

OPTIONS(model_type='logistic_reg') 句は、ロジスティック回帰モデルを作成します。ロジスティック回帰モデルは、入力データを 2 つのクラスに分割し、データが各クラスに分類される確率を推定します。検出対象（たとえば、メールがスパムメールの場合）は 1 で表され、それ以外は 0 で表されます。検出しようとしているクラスに特定の値が属する可能性は、0～1 の値で示されます。たとえば、メールの確率推定値が 0.9 の場合、そのメールがスパムである確率は 90% です。

このクエリの SELECT ステートメントで次の列を取得します。作成したモデルは、この列を使用して訪問者がトランザクションを完了する確率を予測します。

totals.transactions: セッション内の e コマーストランザクションの合計数。トランザクション数が NULL の場合、label 列の値は 0 に設定されます。それ以外の場合は 1 に設定されます。これらの値は結果の確率を表します。CREATE MODEL ステートメントで input_label_cols= オプションを設定する代わりに、label というエイリアスを作成することもできます。
device.operatingSystem: 訪問者のデバイスのオペレーティングシステムです。
device.isMobile - 訪問者のデバイスがモバイルデバイスかどうかを示します。
geoNetwork.country: セッションが発生した国。IP アドレスに基づいて特定されます。
totals.pageviews: セッション内のページビューの合計数です。

FROM 句 - クエリが bigquery-public-data.google_analytics_sample.ga_sessions サンプルテーブルを使用してモデルをトレーニングします。これらのテーブルは日付別にシャーディングされているため、テーブル名にワイルドカード（google_analytics_sample.ga_sessions_*）を使用して集約します。

WHERE 句 - _TABLE_SUFFIX BETWEEN '20160801' AND '20170630' - クエリによってスキャンされるテーブルの数を制限します。スキャンする日付の範囲は 2016 年 8 月 1 日から 2017 年 6 月 30 日までです。

BigQuery DataFrames

from bigframes.ml.linear_model import LogisticRegression
import bigframes.pandas as bpd

# Start by selecting the data you'll use for training. `read_gbq` accepts
# either a SQL query or a table ID. Since this example selects from multiple
# tables via a wildcard, use SQL to define this data. Watch issue
# https://github.com/googleapis/python-bigquery-dataframes/issues/169
# for updates to `read_gbq` to support wildcard tables.

df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20160801"),
        ("_table_suffix", "<=", "20170630"),
    ],
)

# Extract the total number of transactions within
# the Google Analytics session.
#
# Because the totals column is a STRUCT data type, call
# Series.struct.field("transactions") to extract the transactions field.
# See the reference documentation below:
# https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor#bigframes_operations_structs_StructAccessor_field
transactions = df["totals"].struct.field("transactions")

# The "label" values represent the outcome of the model's
# prediction. In this case, the model predicts if there are any
# ecommerce transactions within the Google Analytics session.
# If the number of transactions is NULL, the value in the label
# column is set to 0. Otherwise, it is set to 1.
label = transactions.notnull().map({True: 1, False: 0}).rename("label")

# Extract the operating system of the visitor's device.
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")

# Extract whether the visitor's device is a mobile device.
is_mobile = df["device"].struct.field("isMobile")

# Extract the country from which the sessions originated, based on the IP address.
country = df["geoNetwork"].struct.field("country").fillna("")

# Extract the total number of page views within the session.
pageviews = df["totals"].struct.field("pageviews").fillna(0)

# Combine all the feature columns into a single DataFrame
# to use as training data.
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Logistic Regression model splits data into two classes, giving the
# a confidence score that the data is in one of the classes.
model = LogisticRegression()
model.fit(features, label)

# The model.fit() call above created a temporary model.
# Use the to_gbq() method to write to a permanent location.
model.to_gbq(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
    replace=True,
)

モデルの損失統計を表示する

ML では、データを使用して予測を行うモデルを作成します。モデルは、実質的には、入力データを取得して計算を行い、出力（予測）を生成する関数です。

ML アルゴリズムは、すでに判明している複数のデータ（ユーザーの購入履歴データなど）を取得し、モデルの予測結果が実際の値と一致するように、モデルのさまざまな重み付けを繰り返し調整します。具体的には、モデルで誤った指標の使用（損失）回数を減らしていくことで調整します。

繰り返し調整を行うことで、損失を減らしていきます（ゼロになるのが理想的です）。損失ゼロは、モデルが 100% 正確であることを意味します。

モデルをトレーニングする際、BigQuery ML は、モデルの過学習を防ぐため、入力データをトレーニングセットと評価セットに自動的に分割します。これは、トレーニングアルゴリズムがトレーニングデータに過度に適合し、新しいサンプルに一般化できなくなることを防ぐために必要です。

Google Cloud コンソールを使用して、モデルのトレーニングイテレーションにおけるモデルの損失の変化を確認します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動
[エクスプローラ] ペインで、[bqml_tutorial] > [モデル] を開き、[sample_model] をクリックします。
[トレーニング] タブをクリックし、[損失] グラフを確認します。[損失] グラフは、トレーニングデータセットでのイテレーションにおける損失指標の変化を示します。グラフにカーソルを合わせると、[トレーニングの損失] と [評価の損失] の線が表示されます。ロジスティック回帰を実行したため、トレーニングの損失値はトレーニングデータを使用してログ損失として計算されます。評価の損失は、評価データで計算されたログ損失です。どちらの損失タイプも、イテレーションごとの各データセットのすべてのサンプルの平均値になります。

ML.TRAINING_INFO 関数を使用して、モデルのトレーニング結果を確認することもできます。

モデルを評価する

ML.EVALUATE 関数を使用してモデルのパフォーマンスを評価します。ML.EVALUATE 関数は、モデルによって生成された予測値を実際のデータと比べて評価します。ロジスティック回帰固有の指標を計算するには、ML.ROC_CURVE SQL 関数または bigframes.ml.metrics.roc_curve BigQuery DataFrames 関数を使用します。

このチュートリアルでは、トランザクションを検出する 2 項分類モデルを使用しています。label 列の値は、モデルによって生成された 2 つのクラス、0（トランザクションなし）と 1（トランザクションあり）です。

SQL

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動
クエリエディタで、次のステートメントを実行します。
```
SELECT
*
FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
```
結果は次のようになります。
```
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  |     precision      |       recall        |      accuracy       |      f1_score       |      log_loss       | roc_auc                   |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  | 0.468503937007874  | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898    | 0.98174125874125873 |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  
```
ロジスティック回帰を使用しているため、結果には次の列が含まれます。
- precision: 分類モデルの指標。陽性のクラスの予測でモデルが正しかった確率を表します。
- recall: 「全陽性のラベルの中でモデルが正しく識別したラベルの数は？」という質問に回答する分類モデルの指標です。
- accuracy: 分類モデルの予測のうち、正解の割合です。
- f1_score: モデルの精度を表す尺度。f1 値は適合率と再現率の調和平均を表します。最も精度の高い f1 値は 1 で、最も精度の悪い値は 0 です。
- log_loss: ロジスティック回帰で使用される損失関数。モデルの予測が正しいラベルからどのくらい離れているかを表します。
- roc_auc: ROC 曲線の下の面積。これは、無作為に選択した陽性のサンプルが陽性に分類される確率が、無作為に選択した陰性のサンプルが陽性に分類される確率よりも高い可能性を表します。詳細については、ML 集中講座の分類をご覧ください。

クエリの詳細

最初の SELECT ステートメントで、モデルから列を取得します。

FROM 句で、モデルに対して ML.EVALUATE 関数を使用します。

ネストされた SELECT ステートメントと FROM 句は、CREATE MODEL クエリと同じです。

WHERE 句 - _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' - クエリによってスキャンされるテーブルの数を制限します。スキャンする日付の範囲は 2017 年 7 月 1 日から 2017 年 8 月 1 日までです。このデータは、モデルの予測性能の評価に使用します。これは、トレーニングデータの期間の直後の月に収集されています。

BigQuery DataFrames

import bigframes.pandas as bpd

# Select model you'll use for evaluating. `read_gbq_model` loads model data from a
# BigQuery, but you could also use the `model` object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to evaluate the predictive performance of the model.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

transactions = df["totals"].struct.field("transactions")
label = transactions.notnull().map({True: 1, False: 0}).rename("label")
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Some models include a convenient .score(X, y) method for evaluation with a preset accuracy metric:

# Because you performed a logistic regression, the results include the following columns:

# - precision — A metric for classification models. Precision identifies the frequency with
# which a model was correct when predicting the positive class.

# - recall — A metric for classification models that answers the following question:
# Out of all the possible positive labels, how many did the model correctly identify?

# - accuracy — Accuracy is the fraction of predictions that a classification model got right.

# - f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of
# the precision and recall. An f1 score's best value is 1. The worst value is 0.

# - log_loss — The loss function used in a logistic regression. This is the measure of how far the
# model's predictions are from the correct labels.

# - roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that
# a randomly chosen positive example
# is actually positive than that a randomly chosen negative example is positive. For more information,
# see ['Classification']('https://developers.google.com/machine-learning/crash-course/classification/video-lecture')
# in the Machine Learning Crash Course.

model.score(features, label)
#    precision    recall  accuracy  f1_score  log_loss   roc_auc
# 0   0.412621  0.079143  0.985074  0.132812  0.049764  0.974285
# [1 rows x 6 columns]

モデルを使用して結果を予測する

このモデルを使用して、ウェブサイトの訪問者が行ったトランザクションの数を国別に予測します。

SQL

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

クエリエディタで、次のステートメントを実行します。

SELECT
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10

結果は次のようになります。

+----------------+---------------------------+
|    country     | total_predicted_purchases |
+----------------+---------------------------+
| United States  |                       220 |
| Taiwan         |                         8 |
| Canada         |                         7 |
| India          |                         2 |
| Turkey         |                         2 |
| Japan          |                         2 |
| Italy          |                         1 |
| Brazil         |                         1 |
| Singapore      |                         1 |
| Australia      |                         1 |
+----------------+---------------------------+

クエリの詳細

最初の SELECT ステートメントは、country 列を取得し、predicted_label 列を合計します。predicted_label 列は、ML.PREDICT 関数によって生成されます。ML.PREDICT 関数を使用した場合は、モデルの出力列名は predicted_<label_column_name> です。線形回帰モデルの場合、predicted_label は label の推定値になります。ロジスティック回帰モデルの場合、predicted_label は、指定された入力データ値（0 または 1）を最もよく表すラベルです。

ML.PREDICT 関数は、モデルを使用して結果を予測するために使われます。

ネストされた SELECT ステートメントと FROM 句は、CREATE MODEL クエリと同じです。

WHERE 句 - _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' - クエリによってスキャンされるテーブルの数を制限します。スキャンする日付の範囲は 2017 年 7 月 1 日から 2017 年 8 月 1 日までです。このデータは予測に使用されます。これは、トレーニングデータの期間の直後の月に収集されています。

GROUP BY と ORDER BY 句は、結果を国別に分類し、予測された購入回数の合計を降順で並べ替えます。

ここでは、LIMIT 句を使用して上位 10 件のみを表示します。

BigQuery DataFrames

import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)
# Use Logistic Regression predict method to predict results
# using your model.
# Find more information here in
# [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.linear_model.LogisticRegression#bigframes_ml_linear_model_LogisticRegression_predict)

predictions = model.predict(features)

# Call groupby method to group predicted_label by country.
# Call sum method to get the total_predicted_label by country.
total_predicted_purchases = predictions.groupby(["country"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# country
# United States    220
# Taiwan             8
# Canada             7
# India              2
# Japan              2
# Turkey             2
# Australia          1
# Brazil             1
# Germany            1
# Guyana             1
# Name: predicted_label, dtype: Int64

ユーザーごとの購入数を予測する

ウェブサイトの訪問者が行うトランザクションの数を予測します。

SQL

GROUP BY 句を除くと、このクエリは前のセクションのクエリと同じです。ここでは、GROUP BY 句（GROUP BY fullVisitorId）を使用して、結果を訪問者 ID ごとにグループ化します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

クエリエディタで、次のステートメントを実行します。

SELECT
fullVisitorId,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10

結果は次のようになります。

  +---------------------+---------------------------+
  |    fullVisitorId    | total_predicted_purchases |
  +---------------------+---------------------------+
  | 9417857471295131045 |                         4 |
  | 112288330928895942  |                         2 |
  | 2158257269735455737 |                         2 |
  | 489038402765684003  |                         2 |
  | 057693500927581077  |                         2 |
  | 2969418676126258798 |                         2 |
  | 5073919761051630191 |                         2 |
  | 7420300501523012460 |                         2 |
  | 0456807427403774085 |                         2 |
  | 2105122376016897629 |                         2 |
  +---------------------+---------------------------+

BigQuery DataFrames


import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
full_visitor_id = df["fullVisitorId"]

features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
        "fullVisitorId": full_visitor_id,
    }
)

predictions = model.predict(features)

# Call groupby method to group predicted_label by visitor.
# Call sum method to get the total_predicted_label by visitor.
total_predicted_purchases = predictions.groupby(["fullVisitorId"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# fullVisitorId
# 9417857471295131045    4
# 0376394056092189113    2
# 0456807427403774085    2
# 057693500927581077     2
# 112288330928895942     2
# 1280993661204347450    2
# 2105122376016897629    2
# 2158257269735455737    2
# 2969418676126258798    2
# 489038402765684003     2
# Name: predicted_label, dtype: Int64

クリーンアップ

このページで使用したリソースについて、 Google Cloud アカウントに課金されないようにするには、次の手順を実施します。

作成したプロジェクトを削除するか、プロジェクトを保持したままデータセットを削除します。

データセットを削除する

プロジェクトを削除すると、プロジェクト内のデータセットとテーブルがすべて削除されます。プロジェクトを再利用する場合は、このチュートリアルで作成したデータセットを削除できます。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動
[エクスプローラ] ペインで、作成した bqml_tutorial データセットを選択します。
[アクション] > [削除] をクリックします。
[データセットの削除] ダイアログで「delete」と入力して削除コマンドを確認します。
[削除] をクリックします。

プロジェクトを削除する

プロジェクトを削除するには、次の操作を行います。

注意: プロジェクトを削除すると、次のような影響があります。

プロジェクト内のすべてのものが削除されます。このドキュメントのタスクで既存のプロジェクトを使用した場合、それを削除すると、そのプロジェクトで行った他の作業もすべて削除されます。
カスタムプロジェクト ID が失われます。このプロジェクトを作成したときに、将来使用するカスタムプロジェクト ID を作成した可能性があります。そのプロジェクト ID を使用した URL（たとえば、appspot.com）を保持するには、プロジェクト全体ではなくプロジェクト内の選択したリソースだけを削除します。

複数のアーキテクチャ、チュートリアル、クイックスタートを実施する予定がある場合は、プロジェクトを再利用すると、プロジェクトの割り当て上限の超過を回避できます。

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

次のステップ

機械学習集中講座で機械学習について学習する。
BigQuery ML の概要で BigQuery ML の概要を確認する。
Google Cloud コンソールの使用で Google Cloud コンソールの詳細を確認する。