试用 BigQuery DataFrames
在本快速入门中,您可以使用 BigQuery 笔记本中的 BigQuery DataFrames API 执行以下分析和机器学习 (ML) 任务:
- 创建基于
bigquery-public-data.ml_datasets.penguins
公共数据集的 DataFrame。 - 计算企鹅的平均身体质量。
- 创建一个线性回归模型。
- 基于企鹅数据的一部分创建 DataFrame,以将其用作训练数据。
- 清理训练数据。
- 设置模型参数。
- 拟合模型。
- 对模型进行评分。
须知事项
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
确保已启用 BigQuery API。
如果您创建了一个新项目,则系统会自动启用 BigQuery API。
所需权限
如需创建和运行笔记本,您需要以下 Identity and Access Management (IAM) 角色:
- BigQuery User (
roles/bigquery.user
) - Notebook Runtime User (
roles/aiplatform.notebookRuntimeUser
) - Code Creator (
roles/dataform.codeCreator
)
创建笔记本
按照通过 BigQuery 编辑器创建笔记本中的说明创建新的笔记本。
试用 BigQuery DataFrames
请按照以下步骤试用 BigQuery DataFrames:
- 在笔记本中创建新的代码单元。
复制以下代码并粘贴到代码单元中:
import bigframes.pandas as bpd # Set BigQuery DataFrames options bpd.options.bigquery.project = your_gcp_project_id bpd.options.bigquery.location = "us" # Create a DataFrame from a BigQuery table query_or_table = "bigquery-public-data.ml_datasets.penguins" df = bpd.read_gbq(query_or_table) # Use the DataFrame just as you would a pandas DataFrame, but calculations # happen in the BigQuery query engine instead of the local system. average_body_mass = df["body_mass_g"].mean() print(f"average_body_mass: {average_body_mass}") # Create the Linear Regression model from bigframes.ml.linear_model import LinearRegression # Filter down to the data we want to analyze adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"] # Drop the columns we don't care about adelie_data = adelie_data.drop(columns=["species"]) # Drop rows with nulls to get our training data training_data = adelie_data.dropna() # Pick feature columns and label column X = training_data[ [ "island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex", ] ] y = training_data[["body_mass_g"]] model = LinearRegression(fit_intercept=False) model.fit(X, y) model.score(X, y)
修改
bpd.options.bigquery.project = your_gcp_project_id
行以指定您的项目,例如bpd.options.bigquery.project = "myproject"
。运行该代码单元。
该代码单元会返回数据集中企鹅的平均身体质量,然后返回模型的评估指标。
清理
为了避免产生费用,最简单的方法是删除您为本教程创建的项目。
如需删除项目,请执行以下操作:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
后续步骤
尝试 BigQuery DataFrames 使用入门笔记本。