BigQuery DataFrames を使用する

プレビュー版のサポートについては、bigframes-feedback@google.com までメールでお問い合わせください。

このドキュメントでは、BigQuery DataFrames を使用して、BigQuery ノートブックでデータの分析と操作を行う方法について説明します。

BigQuery DataFrames は、BigQuery ノートブックでデータの分析と ML タスクの実行に使用できる Python クライアントライブラリです。

BigQuery DataFrames は、次の要素で構成されています。

bigframes.pandas は、BigQuery 上に pandas のような API を実装します。
bigframes.ml は、BigQuery ML 上に scikit-learn のような API を実装します。

始める前に

Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.

Google Cloud Console の [プロジェクトセレクタ] ページで、Google Cloud プロジェクトを選択または作成します。

プロジェクトセレクタに移動

Google Cloud Console の [プロジェクトセレクタ] ページで、Google Cloud プロジェクトを選択または作成します。

プロジェクトセレクタに移動

Google Cloud プロジェクトで課金が有効になっていることを確認します。

BigQuery API が有効になっていることを確認します。

API の有効化

新しいプロジェクトを作成している場合は、BigQuery API が自動的に有効になっています。

必要な権限

BigQuery ノートブックで BigQuery DataFrames を使用するには、次の Identity and Access Management（IAM）ロールが必要です。

ノートブックを作成する

BigQuery エディタからノートブックを作成するの手順に沿って、新しいノートブックを作成します。

BigQuery DataFrames オプションを設定する

インストール後、BigQuery DataFrames を使用するロケーションとプロジェクトを指定する必要があります。

ノートブックのロケーションとプロジェクトは、次の方法で定義できます。

samples/snippets/set_options_test.py

GitHub で表示

import bigframes.pandas as bpd

PROJECT_ID = "bigframes-dec"  # @param {type:"string"}
REGION = "US"  # @param {type:"string"}

# Set BigQuery DataFrames options
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = REGION

`bigframes.pandas` を使用する

bigframes.pandas API は、BigQuery でデータの分析と操作に使用できる pandas のような API を提供します。bigframes.pandas API は、テラバイト単位の BigQuery データの処理をサポートするようにスケーラブルで、BigQuery クエリエンジンを使用して計算を実行します。

bigframes.pandas API には次の機能があります。

入力と出力: ローカル CSV ファイル、Cloud Storage ファイル、pandas DataFrame、BigQuery モデル、BigQuery 関数など、さまざまなソースのデータにアクセスし、BigQuery DataFrames に読み込むことができます。BigQuery DataFrames から BigQuery テーブルを作成することもできます。
データの操作: 開発には、SQL の代わりに Python を使用できます。BigQuery データ操作はすべて Python で開発でき、言語を切り替えて SQL ステートメントをテキスト文字列としてキャプチャする必要がなくなります。bigframes.pandas API には 250 を超える pandas 関数が用意されています。
Python エコシステムと可視化: bigframes.pandas API は、ツールの Python エコシステム全体へのゲートウェイです。この API は高度な統計オペレーションをサポートし、BigQuery DataFrames から生成された集計を可視化できます。また、組み込みのサンプリングオペレーションを使用して、BigQuery の DataFrame から pandas DataFrame に切り替えることもできます。
カスタム Python 関数: カスタム Python 関数とパッケージを使用できます。bigframes.pandas を使用すると、BigQuery のスケールでスカラー Python 関数を実行するリモート関数をデプロイできます。これらの関数は SQL ルーチンとして BigQuery に保持し、SQL 関数のように使用できます。

BigQuery テーブルまたはクエリからデータを読み込む

次の方法で、BigQuery テーブルまたはクエリから DataFrame を作成できます。

samples/snippets/load_data_from_bigquery_test.py

GitHub で表示

# Create a DataFrame from a BigQuery table:
import bigframes.pandas as bpd

query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

CSV ファイルからデータを読み込む

DataFrame は、ローカルまたは Cloud Storage CSV ファイルから次の方法で作成できます。

samples/snippets/load_data_from_csv_test.py

GitHub で表示

import bigframes.pandas as bpd

filepath_or_buffer = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
df_from_gcs = bpd.read_csv(filepath_or_buffer)
# Display the first few rows of the DataFrame:
df_from_gcs.head()

データの検査と操作

bigframes.pandas を使用して、データの検査と計算のオペレーションを実行できます。

次のコードサンプルは、body_mass_g 列を検査し、平均 body_mass を計算して、平均 body_mass を species で計算するために bigframes.pandas を使用する場合を示しています。

samples/snippets/pandas_methods_test.py

GitHub で表示

import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Inspect one of the columns (or series) of the DataFrame:
bq_df["body_mass_g"]

# Compute the mean of this series:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

# Find the heaviest species using the groupby operation to calculate the
# mean body_mass_g:
(
    bq_df["body_mass_g"]
    .groupby(by=bq_df["species"])
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

`bigframes.ml` を使用する

bigframes.ml scikit-learn のような API を使用すると、複数のタイプの ML モデルを作成できます。

回帰

次のコードサンプルは、次の処理を行うために bigframes.ml を使用する場合を示しています。

BigQuery からデータを読み込む
トレーニングデータをクリーニングして準備する
bigframes.ml.LinearRegression 回帰モデルを作成して適用する

samples/snippets/regression_model_test.py

GitHub で表示

from bigframes.ml.linear_model import LinearRegression
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Filter down to the data to the Adelie Penguin species
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Specify your feature (or input) columns and the label (or output) column:
feature_columns = training_data[
    ["island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex"]
]
label_columns = training_data[["body_mass_g"]]

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

# Create the linear model
model = LinearRegression()
model.fit(feature_columns, label_columns)

# Score the model
score = model.score(feature_columns, label_columns)

# Predict using the model
result = model.predict(test_data)

クラスタリング

bigframes.ml.cluster モジュールを使用して、クラスタリングモデル用の Estimator を作成できます。

次のコードサンプルは、データセグメンテーション用の K 平均法クラスタリングモデルを作成するために bigframes.ml.cluster KMeans クラスを使用する場合を示しています。

samples/snippets/clustering_model_test.py

GitHub で表示

from bigframes.ml.cluster import KMeans
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create the KMeans model
cluster_model = KMeans(n_clusters=10)
cluster_model.fit(bq_df["culmen_length_mm"], bq_df["sex"])

# Predict using the model
result = cluster_model.predict(bq_df)
# Score the model
score = cluster_model.score(bq_df)

LLM リモートモデル

bigframes.ml.llm モジュールを使用して、リモート大規模言語モデル（LLM）用の Estimator を作成できます。

次のコードサンプルは、テキスト生成用の PaLM2 テキスト生成モデルを作成するために bigframes.ml.llm PaLM2TextGenerator クラスを使用する場合を示しています。

samples/snippets/gen_ai_model_test.py

GitHub で表示

from bigframes.ml.llm import PaLM2TextGenerator
import bigframes.pandas as bpd

# Create the LLM model
session = bpd.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
model = PaLM2TextGenerator(session=session, connection_name=connection)

df_api = bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")

# Prepare the prompts and send them to the LLM model for prediction
df_prompt_prefix = "Generate Pandas sample code for DataFrame."
df_prompt = df_prompt_prefix + df_api["API"]

# Predict using the model
df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)

次のステップ

BigQuery ノートブックで BigQuery DataFrames を使用して分析と ML タスクを実行する方法については、BigQuery DataFrames クイックスタートをご覧ください。
BigQuery DataFrames を探索するには、BigQuery DataFrames ライブラリのリファレンスドキュメントをご覧ください。
ソースコードを確認するには、GitHub の BigQuery DataFrames ソースコードをご覧ください。

BigQuery DataFrames を使用する

始める前に

必要な権限

ノートブックを作成する

BigQuery DataFrames オプションを設定する

bigframes.pandas を使用する

BigQuery テーブルまたはクエリからデータを読み込む

CSV ファイルからデータを読み込む

データの検査と操作

bigframes.ml を使用する

回帰

クラスタリング

LLM リモートモデル

次のステップ

`bigframes.pandas` を使用する

`bigframes.ml` を使用する