このページは Cloud Translation API によって翻訳されました。

国勢調査データに基づいた分類モデルを構築して使用する

このチュートリアルでは、BigQuery ML のバイナリロジスティック回帰モデルを使用して、ユーザー属性データに基づいて個人の収入の範囲を予測します。バイナリロジスティック回帰モデルは、値が 2 つのカテゴリのいずれかに該当するかどうか（この場合は、個人の年収が $50,000 を上回っているか下回っているか）を予測します。

このチュートリアルでは、bigquery-public-data.ml_datasets.census_adult_income データセットを使用します。このデータセットには、2000 年と 2010 年の米国居住者のユーザー属性と収入情報が含まれています。

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

必要な権限

BigQuery ML を使用してモデルを作成するには、次の IAM 権限が必要です。

bigquery.jobs.create
bigquery.models.create
bigquery.models.getData
bigquery.models.updateData
bigquery.models.updateMetadata

推論を実行するには、次の権限が必要です。

モデルに対する bigquery.models.getData
bigquery.jobs.create

はじめに

ML の一般的な問題は、ラベルと呼ばれる 2 つのタイプのいずれかにデータを分類することです。たとえば、販売店は特定の顧客が新しい製品を購入するかどうかを、その顧客に関するその他の情報に基づいて予測したい場合があります。この場合、2 つのラベルは will buy と won't buy になります。販売店は、1 つの列が両方のラベルを表し、顧客のロケーション、以前の購入、報告された好みなどの顧客情報を含むデータセットを構築できます。販売店は、この顧客情報を使用するバイナリロジスティック回帰モデルを使用して、各顧客を最もよく表すラベルを予測できます。

このチュートリアルでは、米国国勢調査の回答者のユーザー属性に基づいて、その回答者の収入が 2 つの範囲のどちらに分類されるかを予測するバイナリロジスティック回帰モデルを作成します。

データセットを作成する

モデルを格納する BigQuery データセットを作成します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

BigQuery に移動
左側のペインで、 [エクスプローラ] をクリックします。

左側のペインが表示されていない場合は、 左側のペインを開くをクリックしてペインを開きます。
[エクスプローラ] ペインで、プロジェクト名をクリックします。
[アクションを表示] > [データセットを作成] をクリックします。
[データセットを作成する] ページで、次の操作を行います。
- [データセット ID] に「census」と入力します。
- [ロケーションタイプ] で [マルチリージョン] を選択してから、[US（米国の複数のリージョン）] を選択します。
  
  一般公開データセットは US マルチリージョンに保存されています。わかりやすくするため、データセットを同じロケーションに保存します。
- 残りのデフォルトの設定は変更せず、[データセットを作成] をクリックします。

データを検討する

データセットを確認して、ロジスティック回帰モデルのトレーニングデータとして使用する列を特定します。census_adult_income テーブルから 100 行を選択します。

SQL

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

クエリエディタで、次の GoogleSQL クエリを実行します。

SELECT
age,
workclass,
marital_status,
education_num,
occupation,
hours_per_week,
income_bracket,
functional_weight
FROM
`bigquery-public-data.ml_datasets.census_adult_income`
LIMIT
100;

結果は次のようになります。

BigQuery DataFrames

このサンプルを試す前に、BigQuery DataFrames を使用した BigQuery クイックスタートの手順に沿って BigQuery DataFrames を設定してください。詳細については、BigQuery DataFrames のリファレンスドキュメントをご覧ください。

BigQuery に対する認証を行うには、アプリケーションのデフォルト認証情報を設定します。詳細については、ローカル開発環境の ADC の設定をご覧ください。

import bigframes.pandas as bpd

df = bpd.read_gbq(
    "bigquery-public-data.ml_datasets.census_adult_income",
    columns=(
        "age",
        "workclass",
        "marital_status",
        "education_num",
        "occupation",
        "hours_per_week",
        "income_bracket",
        "functional_weight",
    ),
    max_results=100,
)
df.peek()
# Output:
# age      workclass       marital_status  education_num          occupation  hours_per_week income_bracket  functional_weight
#  47      Local-gov   Married-civ-spouse             13      Prof-specialty              40           >50K             198660
#  56        Private        Never-married              9        Adm-clerical              40          <=50K              85018
#  40        Private   Married-civ-spouse             12        Tech-support              40           >50K             285787
#  34   Self-emp-inc   Married-civ-spouse              9        Craft-repair              54           >50K             207668
#  23        Private   Married-civ-spouse             10   Handlers-cleaners              40          <=50K              40060

クエリ結果から、census_adult_income テーブルの income_bracket 列には <=50K または >50K の 2 つの値のどちらかしかないことがわかります。

サンプルデータを準備する

このチュートリアルでは、census_adult_income テーブルの次の列の値に基づいて、国勢調査回答者の収入を予測します。

age: 回答者の年齢。
workclass: 経験した仕事の種類（地方自治体、民間企業、自営業など）。
marital_status
education_num: 回答者の最終学歴。
occupation
hours_per_week: 週あたりの労働時間

データが重複している列を除外します。たとえば、education 列。education 列と education_num 列の値は同じデータを異なる形式で表しているためです。

functional_weight 列は、国勢調査機関が特定の行に対応すると考えている人数です。この列の値は、特定の行の income_bracket の値とは無関係であるため、この列の値を使用して、functional_weight 列から派生した新しい dataframe 列を作成し、データをトレーニングセット、評価セット、予測セットに分割します。データの 80% にはモデルのトレーニング用のラベルを付け、10% には評価用のラベルを付け、10% には予測用のラベルを付けます。

SQL

サンプルデータを使用してビューを作成します。このビューは、このチュートリアルの後半の CREATE MODEL ステートメントで使用されます。

サンプルデータを準備するクエリを実行します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

[クエリエディタ] ペインで、次のクエリを実行します。

CREATE OR REPLACE VIEW
`census.input_data` AS
SELECT
age,
workclass,
marital_status,
education_num,
occupation,
hours_per_week,
income_bracket,
CASE
  WHEN MOD(functional_weight, 10) < 8 THEN 'training'
  WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
  WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
END AS dataframe
FROM
`bigquery-public-data.ml_datasets.census_adult_income`;

サンプルデータを表示します。
```
SELECT * FROM `census.input_data`;
```

BigQuery DataFrames

input_data という DataFrame を作成します。このチュートリアルの後半では、input_data を使用して、モデルをトレーニングおよび評価し、予測を行います。

import bigframes.pandas as bpd

input_data = bpd.read_gbq(
    "bigquery-public-data.ml_datasets.census_adult_income",
    columns=(
        "age",
        "workclass",
        "marital_status",
        "education_num",
        "occupation",
        "hours_per_week",
        "income_bracket",
        "functional_weight",
    ),
)
input_data["dataframe"] = bpd.Series("training", index=input_data.index,).case_when(
    [
        (((input_data["functional_weight"] % 10) == 8), "evaluation"),
        (((input_data["functional_weight"] % 10) == 9), "prediction"),
    ]
)
del input_data["functional_weight"]

ロジスティック回帰モデルを作成する

前のセクションでラベルを付けたトレーニングデータを使用して、ロジスティック回帰モデルを作成します。

SQL

CREATE MODEL ステートメントを使用し、モデルタイプに LOGISTIC_REG を指定します。

CREATE MODEL ステートメントについては、次の点に注意してください。

input_label_cols オプションは、SELECT ステートメントでラベル列として使用する列を指定します。ここで、ラベル列は income_bracket であるため、モデルはその行の他の値に基づいて、特定の行に対して income_bracket の 2 つの値のどちらに分類される可能性が高いかを学習します。
ロジスティック回帰モデルがバイナリかマルチクラスかを指定する必要はありません。BigQuery ML は、ラベル列の一意の値の数に基づいて、トレーニングするモデルのタイプを決定します。
トレーニングデータ内のクラスラベルのバランスを取るために、auto_class_weights オプションは TRUE に設定されています。デフォルトでは、トレーニングデータは重み付けされません。トレーニングデータ内のラベルが不均衡である場合、モデルは最も出現回数の多いラベルクラスをより重視して予測するように学習することがあります。この場合、データセット内の回答者の大多数は低い方の収入階層に属します。このため、低い方の収入階層を過度に重視して予測するモデルになる可能性があります。クラスの重みは、各クラスの頻度に反比例した重みを計算して、クラスラベルのバランスを取ります。
enable_global_explain オプションは、チュートリアルの後半でモデルで ML.GLOBAL_EXPLAIN 関数を使用できるように TRUE に設定されています。
SELECT ステートメントは、サンプルデータを含む input_data ビューをクエリします。WHERE 句はその行をフィルタリングし、トレーニングデータとしてラベル付けされた行のみがモデルのトレーニングに使用されるようにします。

ロジスティック回帰モデルを作成するクエリを実行します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

[クエリエディタ] ペインで、次のクエリを実行します。

CREATE OR REPLACE MODEL
`census.census_model`
OPTIONS
( model_type='LOGISTIC_REG',
  auto_class_weights=TRUE,
  enable_global_explain=TRUE,
  data_split_method='NO_SPLIT',
  input_label_cols=['income_bracket'],
  max_iterations=15) AS
SELECT * EXCEPT(dataframe)
FROM
`census.input_data`
WHERE
dataframe = 'training'

左側のペインで、 [エクスプローラ] をクリックします。
[エクスプローラ] ペインで [データセット] をクリックします。
[データセット] ペインで census をクリックします。
[モデル] タブをクリックします。
[census_model] をクリックします。
[詳細] タブには、BigQuery ML がロジスティック回帰の実行に使用した属性が一覧表示されます。

BigQuery DataFrames

fit メソッドを使用してモデルをトレーニングし、to_gbq メソッドを使用してモデルをデータセットに保存します。

import bigframes.ml.linear_model

# input_data is defined in an earlier step.
training_data = input_data[input_data["dataframe"] == "training"]
X = training_data.drop(columns=["income_bracket", "dataframe"])
y = training_data["income_bracket"]

census_model = bigframes.ml.linear_model.LogisticRegression(
    # Balance the class labels in the training data by setting
    # class_weight="balanced".
    #
    # By default, the training data is unweighted. If the labels
    # in the training data are imbalanced, the model may learn to
    # predict the most popular class of labels more heavily. In
    # this case, most of the respondents in the dataset are in the
    # lower income bracket. This may lead to a model that predicts
    # the lower income bracket too heavily. Class weights balance
    # the class labels by calculating the weights for each class in
    # inverse proportion to the frequency of that class.
    class_weight="balanced",
    max_iterations=15,
)
census_model.fit(X, y)

census_model.to_gbq(
    your_model_id,  # For example: "your-project.census.census_model"
    replace=True,
)

モデルのパフォーマンスを評価する

モデルを作成したら、評価データに対するモデルのパフォーマンスを評価します。

SQL

ML.EVALUATE 関数は、モデルによって生成された予測値を評価データに照らして評価します。

入力では、ML.EVALUATE 関数はトレーニング済みモデルと、dataframe 列の値が evaluation である input_data ビューの行を取得します。この関数は、モデルに関する統計を単一行で返します。

ML.EVALUATE クエリを実行します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

[クエリエディタ] ペインで、次のクエリを実行します。

SELECT
*
FROM
ML.EVALUATE (MODEL `census.census_model`,
  (
  SELECT
    *
  FROM
    `census.input_data`
  WHERE
    dataframe = 'evaluation'
  )
);

結果は次のようになります。

BigQuery DataFrames

score メソッドを使用して、実際のデータに照らしてモデルを評価します。

# Select model you'll use for predictions. `read_gbq_model` loads model
# data from BigQuery, but you could also use the `census_model` object
# from previous steps.
census_model = bpd.read_gbq_model(
    your_model_id,  # For example: "your-project.census.census_model"
)

# input_data is defined in an earlier step.
evaluation_data = input_data[input_data["dataframe"] == "evaluation"]
X = evaluation_data.drop(columns=["income_bracket", "dataframe"])
y = evaluation_data["income_bracket"]

# The score() method evaluates how the model performs compared to the
# actual data. Output DataFrame matches that of ML.EVALUATE().
score = census_model.score(X, y)
score.peek()
# Output:
#    precision    recall  accuracy  f1_score  log_loss   roc_auc
# 0   0.685764  0.536685   0.83819  0.602134  0.350417  0.882953

Google Cloud コンソールのモデルの [評価] ペインで、トレーニング中に計算された評価指標を確認することもできます。

ML.EVALUATE の出力

収入階層を予測する

モデルを使用して、各回答者の最も可能性の高い収入階層を予測します。

SQL

ML.PREDICT 関数を使用して、可能性のある収入階層を予測します。入力では、ML.PREDICT 関数はトレーニング済みモデルと、dataframe 列の値が prediction である input_data ビューの行を取得します。

ML.PREDICT クエリを実行します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

[クエリエディタ] ペインで、次のクエリを実行します。

SELECT
*
FROM
ML.PREDICT (MODEL `census.census_model`,
  (
  SELECT
    *
  FROM
    `census.input_data`
  WHERE
    dataframe = 'prediction'
  )
);

結果は次のようになります。

predicted_income_bracket 列には、回答者の予測収入階層が含まれます。

BigQuery DataFrames

predict メソッドを使用して、可能性のある収入階層を予測します。

# Select model you'll use for predictions. `read_gbq_model` loads model
# data from BigQuery, but you could also use the `census_model` object
# from previous steps.
census_model = bpd.read_gbq_model(
    your_model_id,  # For example: "your-project.census.census_model"
)

# input_data is defined in an earlier step.
prediction_data = input_data[input_data["dataframe"] == "prediction"]

predictions = census_model.predict(prediction_data)
predictions.peek()
# Output:
#           predicted_income_bracket                     predicted_income_bracket_probs  age workclass  ... occupation  hours_per_week income_bracket   dataframe
# 18004                    <=50K  [{'label': ' >50K', 'prob': 0.0763305999358786...   75         ?  ...          ?               6          <=50K  prediction
# 18886                    <=50K  [{'label': ' >50K', 'prob': 0.0448866871906495...   73         ?  ...          ?              22           >50K  prediction
# 31024                    <=50K  [{'label': ' >50K', 'prob': 0.0362982319421936...   69         ?  ...          ?               1          <=50K  prediction
# 31022                    <=50K  [{'label': ' >50K', 'prob': 0.0787836112058324...   75         ?  ...          ?               5          <=50K  prediction
# 23295                    <=50K  [{'label': ' >50K', 'prob': 0.3385373037905673...   78         ?  ...          ?              32          <=50K  prediction

予測結果について説明する

モデルがこれらの予測結果を生成する理由を理解するには、ML.EXPLAIN_PREDICT 関数を使用します。

ML.EXPLAIN_PREDICT は、ML.PREDICT 関数の拡張バージョンです。ML.EXPLAIN_PREDICT は、予測結果だけでなく、予測結果の説明に使用する追加の列も出力します。説明可能性の詳細については、BigQuery ML Explainable AI の概要をご覧ください。

ML.EXPLAIN_PREDICT クエリを実行します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動

[クエリエディタ] ペインで、次のクエリを実行します。

SELECT
*
FROM
ML.EXPLAIN_PREDICT(MODEL `census.census_model`,
  (
  SELECT
    *
  FROM
    `census.input_data`
  WHERE
    dataframe = 'evaluation'),
  STRUCT(3 as top_k_features));

結果は次のようになります。

ロジスティック回帰モデルでは、Shapley 値を使用して、モデル内の各特徴の相対的な特徴アトリビューションを決定します。クエリで top_k_features オプションが 3 に設定されているため、ML.EXPLAIN_PREDICT は input_data ビューの各行の上位 3 つの特徴アトリビューションを出力します。これらのアトリビューションは、アトリビューションの絶対値の降順で表示されます。

モデルをグローバルに説明する

収入階層を判断するうえで最も重要な特徴を特定するには、ML.GLOBAL_EXPLAIN 関数を使用します。

モデルのグローバルな説明を取得します。

Google Cloud コンソールで、[BigQuery] ページに移動します。

[BigQuery] に移動
クエリエディタで次のクエリを実行して、グローバルな説明を取得します。
```
SELECT
  *
FROM
  ML.GLOBAL_EXPLAIN(MODEL `census.census_model`)
```
結果は次のようになります。