此页面由 Cloud Translation API 翻译。

在笔记本中探索查询结果

您可以使用 BigQuery 中的 Colab Enterprise 笔记本来探索 BigQuery 查询结果。

在本教程中，您将查询 BigQuery 公共数据集中的数据，并在笔记本中探索查询结果。

所需权限

如需创建和运行笔记本，您需要以下 Identity and Access Management (IAM) 角色：

在笔记本中打开查询结果

您可以运行 SQL 查询，然后使用笔记本来探索数据。如果您想要在使用数据之前先在 BigQuery 中修改数据，或者如果您只需要表中的部分字段，则此方法会非常有用。

在 Google Cloud 控制台中，前往 BigQuery 页面。

转到 BigQuery
在输入内容即可搜索字段中，输入 bigquery-public-data。

如果未显示该项目，请在搜索字段中输入 bigquery，然后点击搜索所有项目，将搜索字符串与现有项目匹配。
选择 bigquery-public-data > ml_datasets > penguins。
对于 penguins 表，点击 查看操作，然后点击查询。
在生成的查询中添加星号 (*)，以便选择字段，如下所示：
```
SELECT * FROM `bigquery-public-data.ml_datasets.penguins` LIMIT 1000;
```
点击运行。
在查询结果部分中，点击打开方式，然后点击笔记本。

准备好笔记本以供使用

通过连接到运行时并设置应用默认值来准备好笔记本，以供使用。

在笔记本标头中，点击连接以连接到默认运行时。
在设置代码块中，点击 运行单元。

探索数据

若要将 penguins 数据加载到 BigQuery DataFrame 中并显示结果，请单击从 BigQuery 作业加载为 DataFrame 的结果集部分的代码块中的 运行单元。
如需获取有关数据的描述性指标，请点击使用 describe() 显示描述性统计信息部分的代码块中的 运行单元。
可选：使用其他 Python 函数或软件包来探索和分析数据。

以下代码示例展示了使用 bigframes.pandas 分析数据，以及使用 bigframes.ml 根据 BigQuery DataFrame 中的 penguins 数据创建线性回归模型：

import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Inspect one of the columns (or series) of the DataFrame:
bq_df["body_mass_g"]

# Compute the mean of this series:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

# Find the heaviest species using the groupby operation to calculate the
# mean body_mass_g:
(
    bq_df["body_mass_g"]
    .groupby(by=bq_df["species"])
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

# Create the Linear Regression model
from bigframes.ml.linear_model import LinearRegression

# Filter down to the data we want to analyze
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the columns we don't care about
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get our training data
training_data = adelie_data.dropna()

# Pick feature columns and label column
X = training_data[
    [
        "island",
        "culmen_length_mm",
        "culmen_depth_mm",
        "flipper_length_mm",
        "sex",
    ]
]
y = training_data[["body_mass_g"]]

model = LinearRegression(fit_intercept=False)
model.fit(X, y)
model.score(X, y)