Halaman ini diterjemahkan oleh Cloud Translation API.

Membuat model machine learning di BigQuery ML menggunakan SQL

Tutorial ini menunjukkan cara membuat model regresi logistik menggunakan kueri SQL BigQuery ML.

BigQuery ML memungkinkan Anda membuat dan melatih model machine learning di BigQuery menggunakan kueri SQL. Hal ini membantu membuat machine learning lebih mudah dipahami dengan memungkinkan Anda menggunakan alat yang sudah dikenal seperti editor SQL BigQuery, dan juga meningkatkan kecepatan pengembangan dengan meniadakan kebutuhan untuk memindahkan data ke lingkungan machine learning yang terpisah.

Dalam tutorial ini, Anda akan menggunakan contoh set data contoh Google Analytics untuk BigQuery untuk membuat model yang memprediksi apakah pengunjung situs akan melakukan transaksi atau tidak. Untuk mendapatkan informasi tentang skema set data Analytics, lihat skema BigQuery Export di Pusat Bantuan Analytics.

Untuk mempelajari cara membuat model menggunakan antarmuka pengguna konsol, lihat bekerja dengan model menggunakan UI. Google Cloud (Pratinjau)

Tujuan

Tutorial ini menunjukkan cara melakukan tugas-tugas berikut:

Menggunakan pernyataan CREATE MODEL untuk membuat model regresi logistik biner.
Menggunakan fungsi ML.EVALUATE untuk mengevaluasi model.
Menggunakan fungsi ML.PREDICT untuk membuat prediksi menggunakan model.

Biaya

Tutorial ini menggunakan komponen yang dapat ditagih dari Google Cloud, termasuk:

BigQuery
BigQuery ML

Untuk informasi selengkapnya tentang biaya BigQuery, lihat halaman harga BigQuery.

Untuk informasi selengkapnya tentang biaya BigQuery ML, lihat harga BigQuery ML.

Peran yang diperlukan

Untuk membuat model dan menjalankan inferensi, Anda harus diberi peran berikut:
- BigQuery Data Editor (roles/bigquery.dataEditor)
- BigQuery User (roles/bigquery.user)

Sebelum memulai

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
Buka IAM
Pilih project.
Klik Grant access.
Di kolom New principals, masukkan ID pengguna Anda. Biasanya berupa alamat email untuk Akun Google.
Di daftar Select a role, pilih peran.
Untuk memberikan peran tambahan, klik Tambahkan peran lain, lalu tambahkan setiap peran tambahan.
Klik Simpan.

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

Roles required to select or create a project

Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.

Go to project selector

Verify that billing is enabled for your Google Cloud project.

Make sure that you have the following role or roles on the project: BigQuery Data Editor, BigQuery Job User, Service Usage Admin

Check for the roles

In the Google Cloud console, go to the IAM page.
Go to IAM
Select the project.
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.

Grant the roles

In the Google Cloud console, go to the IAM page.
Buka IAM
Pilih project.
Klik Grant access.
Di kolom New principals, masukkan ID pengguna Anda. Biasanya berupa alamat email untuk Akun Google.
Di daftar Select a role, pilih peran.
Untuk memberikan peran tambahan, klik Tambahkan peran lain, lalu tambahkan setiap peran tambahan.
Klik Simpan.

BigQuery secara otomatis diaktifkan dalam project baru. Untuk mengaktifkan BigQuery dalam project yang sudah ada, buka
Enable the BigQuery API.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the API

Membuat set data

Buat set data BigQuery untuk menyimpan model ML Anda.

Konsol

Di konsol Google Cloud , buka halaman BigQuery.

Buka halaman BigQuery
Di panel Explorer, klik nama project Anda.
Klik View actions > Create dataset.
Di halaman Create dataset, lakukan hal berikut:
- Untuk Dataset ID, masukkan bqml_tutorial.
- Untuk Location type, pilih Multi-region, lalu pilih US (multiple regions in United States).
- Jangan ubah setelan default yang tersisa, lalu klik Create dataset.

bq

Untuk membuat set data baru, gunakan perintah bq mk dengan flag --location. Untuk daftar lengkap kemungkinan parameter, lihat referensi perintah bq mk --dataset.

Buat set data bernama bqml_tutorial dengan lokasi data yang ditetapkan ke US dan deskripsi BigQuery ML tutorial dataset:
```
bq --location=US mk -d \
 --description "BigQuery ML tutorial dataset." \
 bqml_tutorial
```
Perintah ini menggunakan pintasan -d, bukan flag --dataset. Jika Anda menghapus -d dan --dataset, perintah defaultnya adalah membuat set data.
Pastikan set data telah dibuat:
```
bq ls
```

API

Panggil metode datasets.insert dengan resource set data yang ditentukan.

{
  "datasetReference": {
     "datasetId": "bqml_tutorial"
  }
}

BigQuery DataFrames

Sebelum mencoba contoh ini, ikuti petunjuk penyiapan BigQuery DataFrames di Panduan memulai BigQuery menggunakan BigQuery DataFrames. Untuk mengetahui informasi selengkapnya, lihat dokumentasi referensi BigQuery DataFrames.

Untuk melakukan autentikasi ke BigQuery, siapkan Kredensial Default Aplikasi. Untuk mengetahui informasi selengkapnya, lihat Menyiapkan ADC untuk lingkungan pengembangan lokal.

import google.cloud.bigquery

bqclient = google.cloud.bigquery.Client()
bqclient.create_dataset("bqml_tutorial", exists_ok=True)

Membuat model regresi logistik

Buat model regresi logistik menggunakan set data contoh Analytics untuk BigQuery.

SQL

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery

Di editor kueri, jalankan pernyataan berikut:

CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'

Kueri membutuhkan waktu beberapa menit hingga selesai. Setelah iterasi pertama selesai, model Anda (sample_model) akan muncul di panel navigasi. Karena kueri tersebut menggunakan pernyataan CREATE MODEL untuk membuat model, Anda tidak akan melihat hasil kueri.

Detail kueri

Pernyataan CREATE MODEL membuat model, lalu melatih model menggunakan data yang diambil oleh pernyataan SELECT kueri Anda.

Klausa OPTIONS(model_type='logistic_reg') membuat model regresi logistik. Model regresi logistik membagi data input menjadi dua class, lalu memperkirakan probabilitas bahwa data tersebut berada dalam salah satu class. Hal yang ingin Anda deteksi, seperti apakah email adalah spam, diwakili oleh 1 dan nilai lainnya diwakili oleh 0. Kemungkinan nilai tertentu termasuk dalam kelas yang coba Anda deteksi ditunjukkan oleh nilai antara 0 dan 1. Misalnya, jika email menerima estimasi probabilitas 0,9, maka ada probabilitas 90% bahwa email tersebut adalah spam.

Pernyataan SELECT kueri ini mengambil kolom berikut yang digunakan oleh model untuk memprediksi probabilitas pelanggan akan menyelesaikan transaksi:

totals.transactions: jumlah total transaksi e-commerce dalam sesi. Jika jumlah transaksi adalah NULL, maka nilai dalam kolom label ditetapkan ke 0. Jika tidak, nilai ini akan ditetapkan ke 1. Nilai-nilai ini mewakili kemungkinan hasil. Membuat alias bernama label merupakan alternatif untuk menyetel opsi input_label_cols= dalam pernyataan CREATE MODEL.
device.operatingSystem: sistem operasi perangkat pengunjung.
device.isMobile — Menunjukkan apakah perangkat pengunjung adalah perangkat seluler.
geoNetwork.country: negara tempat sesi berasal, berdasarkan alamat IP.
totals.pageviews: jumlah total tayangan halaman dalam sesi.

Klausa FROM — menyebabkan kueri melatih model menggunakan tabel sampel bigquery-public-data.google_analytics_sample.ga_sessions. Tabel ini di-shard menurut tanggal, jadi Anda menggabungkannya menggunakan karakter pengganti dalam nama tabel: google_analytics_sample.ga_sessions_*.

Klausa WHERE — _TABLE_SUFFIX BETWEEN '20160801' AND '20170630' — membatasi jumlah tabel yang dipindai oleh kueri. Rentang tanggal yang dipindai adalah 1 Agustus 2016 hingga 30 Juni 2017.

BigQuery DataFrames

Untuk melakukan autentikasi ke BigQuery, siapkan Kredensial Default Aplikasi. Untuk mengetahui informasi selengkapnya, lihat Menyiapkan ADC untuk lingkungan pengembangan lokal.

from bigframes.ml.linear_model import LogisticRegression
import bigframes.pandas as bpd

# Start by selecting the data you'll use for training. `read_gbq` accepts
# either a SQL query or a table ID. Since this example selects from multiple
# tables via a wildcard, use SQL to define this data. Watch issue
# https://github.com/googleapis/python-bigquery-dataframes/issues/169
# for updates to `read_gbq` to support wildcard tables.

df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20160801"),
        ("_table_suffix", "<=", "20170630"),
    ],
)

# Extract the total number of transactions within
# the Google Analytics session.
#
# Because the totals column is a STRUCT data type, call
# Series.struct.field("transactions") to extract the transactions field.
# See the reference documentation below:
# https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor#bigframes_operations_structs_StructAccessor_field
transactions = df["totals"].struct.field("transactions")

# The "label" values represent the outcome of the model's
# prediction. In this case, the model predicts if there are any
# ecommerce transactions within the Google Analytics session.
# If the number of transactions is NULL, the value in the label
# column is set to 0. Otherwise, it is set to 1.
label = transactions.notnull().map({True: 1, False: 0}).rename("label")

# Extract the operating system of the visitor's device.
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")

# Extract whether the visitor's device is a mobile device.
is_mobile = df["device"].struct.field("isMobile")

# Extract the country from which the sessions originated, based on the IP address.
country = df["geoNetwork"].struct.field("country").fillna("")

# Extract the total number of page views within the session.
pageviews = df["totals"].struct.field("pageviews").fillna(0)

# Combine all the feature columns into a single DataFrame
# to use as training data.
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Logistic Regression model splits data into two classes, giving the
# a confidence score that the data is in one of the classes.
model = LogisticRegression()
model.fit(features, label)

# The model.fit() call above created a temporary model.
# Use the to_gbq() method to write to a permanent location.
model.to_gbq(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
    replace=True,
)

Melihat statistik kerugian model

Machine learning berfokus pada membuat model yang dapat menggunakan data untuk membuat prediksi. Model ini pada dasarnya adalah fungsi yang mengambil input dan menerapkan kalkulasi ke input untuk menghasilkan output, yaitu prediksi.

Algoritma machine learning bekerja dengan mengambil beberapa contoh yang prediksinya sudah diketahui (seperti data historis pembelian pengguna) dan secara berulang menyesuaikan berbagai bobot dalam model sehingga prediksi model cocok dengan nilai sebenarnya. Hal ini dilakukan dengan meminimalkan seberapa salah model menggunakan metrik yang disebut kerugian.

Untuk setiap iterasi, kerugiannya harus berkurang, idealnya ke nol. Jika kerugian bernilai nol, berarti model 100% akurat.

Saat melatih model, BigQuery ML secara otomatis membagi data input menjadi set pelatihan dan evaluasi, untuk menghindari overfitting model. Hal ini diperlukan agar algoritma pelatihan tidak menyesuaikan diri terlalu dekat dengan data pelatihan sehingga tidak dapat digeneralisasi menjadi contoh baru.

Gunakan konsol Google Cloud untuk melihat perubahan kerugian model selama iterasi pelatihan model:

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery
Di panel kiri, klik Explorer:

Jika Anda tidak melihat panel kiri, klik Luaskan panel kiri untuk membuka panel.
Di panel Explorer, luaskan project Anda, klik Datasets, lalu klik set data bqml_tutorial.
Klik tab Models, lalu klik model sample_model.
Klik tab Training dan lihat grafik Loss. Grafik Kerugian menampilkan perubahan metrik kerugian selama iterasi pada set data pelatihan. Jika Anda menahan kursor di atas grafik, Anda dapat melihat bahwa ada garis untuk Kerugian pelatihan dan Kerugian evaluasi. Karena Anda melakukan regresi logistik, nilai kerugian pelatihan dihitung sebagai kerugian log, menggunakan data pelatihan. Kerugian evaluasi adalah kerugian log yang dihitung pada data evaluasi. Kedua jenis kerugian tersebut merepresentasikan nilai kerugian rata-rata, yang dirata-ratakan di semua contoh dalam set data masing-masing untuk setiap iterasi.

Anda juga dapat melihat hasil pelatihan model menggunakan fungsi ML.TRAINING_INFO.

Mengevaluasi model

Evaluasi performa model menggunakan fungsi ML.EVALUATE. Fungsi ML.EVALUATE mengevaluasi nilai prediksi yang dihasilkan oleh model terhadap data sebenarnya. Untuk menghitung metrik khusus regresi logistik, Anda dapat menggunakan fungsi SQL ML.ROC_CURVE atau fungsi DataFrame BigQuery bigframes.ml.metrics.roc_curve.

Dalam tutorial ini, Anda menggunakan model klasifikasi biner yang mendeteksi transaksi. Nilai dalam kolom label adalah dua class yang dihasilkan oleh model: 0 (tanpa transaksi) dan 1 (transaksi dilakukan).

SQL

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery
Di editor kueri, jalankan pernyataan berikut:
```
SELECT
*
FROM
ML.EVALUATE(MODEL `bqml_tutorial.sample_model`, (
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
```
Hasilnya akan terlihat seperti berikut:
```
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  |     precision      |       recall        |      accuracy       |      f1_score       |      log_loss       | roc_auc                   |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  | 0.468503937007874  | 0.11080074487895716 | 0.98534315834767638 | 0.17921686746987953 | 0.04624221101176898    | 0.98174125874125873 |
  +--------------------+---------------------+---------------------+---------------------+---------------------+--------------------+
  
```
Karena Anda melakukan regresi logistik, hasilnya mencakup kolom berikut:
- precision: metrik untuk model klasifikasi. Presisi mengidentifikasi frekuensi terkait apakah model benar saat memprediksi class positif.
- recall: metrik untuk model klasifikasi yang menjawab pertanyaan berikut: Dari semua kemungkinan label positif, berapa banyak yang diidentifikasi dengan benar oleh model?
- accuracy: akurasi adalah fraksi prediksi yang dilakukan model klasifikasi dengan benar.
- f1_score: ukuran akurasi model. Skor f1 adalah rata-rata harmonik presisi dan perolehan. Nilai terbaik skor f1 adalah 1. Nilai terendahnya adalah 0.
- log_loss: fungsi kerugian yang digunakan dalam regresi logistik. Ini adalah ukuran seberapa jauh prediksi model dari label yang benar.
- roc_auc: area di bawah kurva ROC. Ini adalah probabilitas bahwa pengklasifikasi lebih yakin bahwa contoh positif yang dipilih secara acak sebenarnya positif daripada contoh negatif yang dipilih secara acak adalah positif. Untuk mengetahui informasi selengkapnya, lihat Klasifikasi di Kursus Singkat Machine Learning.

Detail kueri

Pernyataan SELECT awal mengambil kolom dari model Anda.

Klausa FROM menggunakan fungsi ML.EVALUATE terhadap model Anda.

Pernyataan SELECT dan klausa FROM bertingkat sama dengan pernyataan dan klausa dalam kueri CREATE MODEL.

Klausa WHERE — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' — membatasi jumlah tabel yang dipindai oleh kueri. Rentang tanggal yang dipindai adalah 1 Juli 2017 hingga 1 Agustus 2017. Ini adalah data yang Anda gunakan untuk mengevaluasi performa prediktif model. Data ini dikumpulkan pada bulan segera setelah jangka waktu yang mencakup data pelatihan.

BigQuery DataFrames

Untuk melakukan autentikasi ke BigQuery, siapkan Kredensial Default Aplikasi. Untuk mengetahui informasi selengkapnya, lihat Menyiapkan ADC untuk lingkungan pengembangan lokal.

import bigframes.pandas as bpd

# Select model you'll use for evaluating. `read_gbq_model` loads model data from a
# BigQuery, but you could also use the `model` object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to evaluate the predictive performance of the model.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

transactions = df["totals"].struct.field("transactions")
label = transactions.notnull().map({True: 1, False: 0}).rename("label")
operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)

# Some models include a convenient .score(X, y) method for evaluation with a preset accuracy metric:

# Because you performed a logistic regression, the results include the following columns:

# - precision — A metric for classification models. Precision identifies the frequency with
# which a model was correct when predicting the positive class.

# - recall — A metric for classification models that answers the following question:
# Out of all the possible positive labels, how many did the model correctly identify?

# - accuracy — Accuracy is the fraction of predictions that a classification model got right.

# - f1_score — A measure of the accuracy of the model. The f1 score is the harmonic average of
# the precision and recall. An f1 score's best value is 1. The worst value is 0.

# - log_loss — The loss function used in a logistic regression. This is the measure of how far the
# model's predictions are from the correct labels.

# - roc_auc — The area under the ROC curve. This is the probability that a classifier is more confident that
# a randomly chosen positive example
# is actually positive than that a randomly chosen negative example is positive. For more information,
# see ['Classification']('https://developers.google.com/machine-learning/crash-course/classification/video-lecture')
# in the Machine Learning Crash Course.

model.score(features, label)
#    precision    recall  accuracy  f1_score  log_loss   roc_auc
# 0   0.412621  0.079143  0.985074  0.132812  0.049764  0.974285
# [1 rows x 6 columns]

Menggunakan model untuk memprediksi hasil

Gunakan model untuk memprediksi jumlah transaksi yang dilakukan oleh pengunjung situs dari setiap negara.

SQL

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery

Di editor kueri, jalankan pernyataan berikut:

SELECT
country,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY country
ORDER BY total_predicted_purchases DESC
LIMIT 10

Hasilnya akan terlihat seperti berikut:

+----------------+---------------------------+
|    country     | total_predicted_purchases |
+----------------+---------------------------+
| United States  |                       220 |
| Taiwan         |                         8 |
| Canada         |                         7 |
| India          |                         2 |
| Turkey         |                         2 |
| Japan          |                         2 |
| Italy          |                         1 |
| Brazil         |                         1 |
| Singapore      |                         1 |
| Australia      |                         1 |
+----------------+---------------------------+

Detail kueri

Pernyataan SELECT awal mengambil kolom country dan menjumlahkan kolom predicted_label. Kolom predicted_label dihasilkan oleh fungsi ML.PREDICT. Saat Anda menggunakan fungsi ML.PREDICT, nama kolom output untuk model adalah predicted_<label_column_name>. Untuk model regresi linear, predicted_label adalah perkiraan nilai label. Untuk model regresi logistik, predicted_label adalah label yang paling mendeskripsikan nilai data input yang diberikan, baik 0 maupun 1.

Fungsi ML.PREDICT digunakan untuk memprediksi hasil menggunakan model Anda.

Pernyataan SELECT dan klausa FROM bertingkat sama dengan pernyataan dan klausa dalam kueri CREATE MODEL.

Klausa WHERE — _TABLE_SUFFIX BETWEEN '20170701' AND '20170801' — membatasi jumlah tabel yang dipindai oleh kueri. Rentang tanggal yang dipindai adalah 1 Juli 2017 hingga 1 Agustus 2017. Ini adalah data yang Anda buat prediksinya. Data ini dikumpulkan pada bulan segera setelah jangka waktu yang dicakup oleh data pelatihan.

Klausa GROUP BY dan ORDER BY mengelompokkan hasil berdasarkan negara dan mengurutkannya berdasarkan jumlah prediksi pembelian dalam urutan menurun.

Klausa LIMIT digunakan di sini untuk menampilkan hanya 10 hasil teratas.

BigQuery DataFrames

Untuk melakukan autentikasi ke BigQuery, siapkan Kredensial Default Aplikasi. Untuk mengetahui informasi selengkapnya, lihat Menyiapkan ADC untuk lingkungan pengembangan lokal.

import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
    }
)
# Use Logistic Regression predict method to predict results
# using your model.
# Find more information here in
# [BigFrames](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.linear_model.LogisticRegression#bigframes_ml_linear_model_LogisticRegression_predict)

predictions = model.predict(features)

# Call groupby method to group predicted_label by country.
# Call sum method to get the total_predicted_label by country.
total_predicted_purchases = predictions.groupby(["country"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# country
# United States    220
# Taiwan             8
# Canada             7
# India              2
# Japan              2
# Turkey             2
# Australia          1
# Brazil             1
# Germany            1
# Guyana             1
# Name: predicted_label, dtype: Int64

Memprediksi pembelian per pengguna

Memprediksi jumlah transaksi yang akan dilakukan setiap pengunjung situs.

SQL

Kueri ini sama dengan kueri di bagian sebelumnya, kecuali untuk klausa GROUP BY. Di sini, klausa GROUP BY — GROUP BY fullVisitorId — digunakan untuk mengelompokkan hasil menurut ID pengunjung.

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery

Di editor kueri, jalankan pernyataan berikut:

SELECT
fullVisitorId,
SUM(predicted_label) as total_predicted_purchases
FROM
ML.PREDICT(MODEL `bqml_tutorial.sample_model`, (
SELECT
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(totals.pageviews, 0) AS pageviews,
IFNULL(geoNetwork.country, "") AS country,
fullVisitorId
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20170701' AND '20170801'))
GROUP BY fullVisitorId
ORDER BY total_predicted_purchases DESC
LIMIT 10

Hasilnya akan terlihat seperti berikut:

  +---------------------+---------------------------+
  |    fullVisitorId    | total_predicted_purchases |
  +---------------------+---------------------------+
  | 9417857471295131045 |                         4 |
  | 112288330928895942  |                         2 |
  | 2158257269735455737 |                         2 |
  | 489038402765684003  |                         2 |
  | 057693500927581077  |                         2 |
  | 2969418676126258798 |                         2 |
  | 5073919761051630191 |                         2 |
  | 7420300501523012460 |                         2 |
  | 0456807427403774085 |                         2 |
  | 2105122376016897629 |                         2 |
  +---------------------+---------------------------+

BigQuery DataFrames

Untuk melakukan autentikasi ke BigQuery, siapkan Kredensial Default Aplikasi. Untuk mengetahui informasi selengkapnya, lihat Menyiapkan ADC untuk lingkungan pengembangan lokal.


import bigframes.pandas as bpd

# Select model you'll use for predicting.
# `read_gbq_model` loads model data from
# BigQuery, but you could also use the `model`
# object from the previous steps.
model = bpd.read_gbq_model(
    your_model_id,  # For example: "bqml_tutorial.sample_model",
)

# The filters parameter limits the number of tables scanned by the query.
# The date range scanned is July 1, 2017 to August 1, 2017. This is the
# data you're using to make the prediction.
# It was collected in the month immediately following the time period
# spanned by the training data.
df = bpd.read_gbq_table(
    "bigquery-public-data.google_analytics_sample.ga_sessions_*",
    filters=[
        ("_table_suffix", ">=", "20170701"),
        ("_table_suffix", "<=", "20170801"),
    ],
)

operating_system = df["device"].struct.field("operatingSystem")
operating_system = operating_system.fillna("")
is_mobile = df["device"].struct.field("isMobile")
country = df["geoNetwork"].struct.field("country").fillna("")
pageviews = df["totals"].struct.field("pageviews").fillna(0)
full_visitor_id = df["fullVisitorId"]

features = bpd.DataFrame(
    {
        "os": operating_system,
        "is_mobile": is_mobile,
        "country": country,
        "pageviews": pageviews,
        "fullVisitorId": full_visitor_id,
    }
)

predictions = model.predict(features)

# Call groupby method to group predicted_label by visitor.
# Call sum method to get the total_predicted_label by visitor.
total_predicted_purchases = predictions.groupby(["fullVisitorId"])[
    ["predicted_label"]
].sum()

# Call the sort_values method with the parameter
# ascending = False to get the highest values.
# Call head method to limit to the 10 highest values.
total_predicted_purchases.sort_values(ascending=False).head(10)

# fullVisitorId
# 9417857471295131045    4
# 0376394056092189113    2
# 0456807427403774085    2
# 057693500927581077     2
# 112288330928895942     2
# 1280993661204347450    2
# 2105122376016897629    2
# 2158257269735455737    2
# 2969418676126258798    2
# 489038402765684003     2
# Name: predicted_label, dtype: Int64

Pembersihan

Agar akun Google Cloud Anda tidak dikenai biaya untuk resource yang digunakan pada halaman ini, ikuti langkah-langkah berikut.

Anda dapat menghapus project yang dibuat, atau menyimpan project dan menghapus set data.

Menghapus set data

Jika project Anda dihapus, semua set data dan semua tabel dalam project akan dihapus. Jika ingin menggunakan kembali project tersebut, Anda dapat menghapus set data yang dibuat dalam tutorial ini:

Di konsol Google Cloud , buka halaman BigQuery.

Buka BigQuery
Di panel kiri, klik Explorer:
Di panel Explorer, luaskan project Anda, klik Datasets, lalu klik set data bqml_tutorial yang Anda buat.
Klik Hapus.
Pada dialog Hapus set data, konfirmasi perintah hapus dengan mengetikkan delete.
Klik Hapus.

Menghapus project

Untuk menghapus project:

Perhatian: Menghapus project memiliki efek berikut:

Semua hal dalam project akan dihapus. Jika menggunakan project yang sudah ada untuk tugas dalam dokumen ini, saat Anda menghapusnya, pekerjaan lain yang telah Anda lakukan dalam project tersebut juga akan terhapus.
Project ID kustom hilang. Saat membuat project ini, Anda mungkin telah membuat project ID kustom yang ingin digunakan di masa mendatang. Untuk mempertahankan URL yang menggunakan project ID, seperti URL appspot.com, hapus resource yang dipilih di dalam project, bukan menghapus seluruh project.

Jika Anda berencana mempelajari beberapa arsitektur, tutorial atau panduan memulai, dengan menggunakan kembali project dapat membantu Anda agar tidak melampaui batas kuota project.

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

Langkah berikutnya

Untuk mempelajari machine learning lebih lanjut, lihat Kursus singkat machine learning.
Untuk ringkasan BigQuery ML, lihat Pengantar BigQuery ML.
Untuk mempelajari lebih lanjut konsol Google Cloud , lihat Menggunakan konsol Google Cloud .