试用 BigQuery DataFrames

在本快速入门中,您可以使用 BigQuery 笔记本中的 BigQuery DataFrames API 执行以下分析和机器学习 (ML) 任务:

  • 创建基于 bigquery-public-data.ml_datasets.penguins 公共数据集的 DataFrame。
  • 计算企鹅的平均身体质量。
  • 创建一个线性回归模型
  • 基于企鹅数据的一部分创建 DataFrame,以将其用作训练数据。
  • 清理训练数据。
  • 设置模型参数。
  • 拟合模型。
  • 对模型进行评分

须知事项

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. 确保已启用 BigQuery API。

    启用 API

    如果您创建了一个新项目,则系统会自动启用 BigQuery API。

所需权限

如需创建和运行笔记本,您需要以下 Identity and Access Management (IAM) 角色:

创建笔记本

按照通过 BigQuery 编辑器创建笔记本中的说明创建新的笔记本。

试用 BigQuery DataFrames

请按照以下步骤试用 BigQuery DataFrames:

  1. 在笔记本中创建新的代码单元。
  2. 复制以下代码并粘贴到代码单元中:

    import bigframes.pandas as bpd
    
    # Set BigQuery DataFrames options
    bpd.options.bigquery.project = your_gcp_project_id
    bpd.options.bigquery.location = "us"
    
    # Create a DataFrame from a BigQuery table
    query_or_table = "bigquery-public-data.ml_datasets.penguins"
    df = bpd.read_gbq(query_or_table)
    
    # Use the DataFrame just as you would a pandas DataFrame, but calculations
    # happen in the BigQuery query engine instead of the local system.
    average_body_mass = df["body_mass_g"].mean()
    print(f"average_body_mass: {average_body_mass}")
    
    # Create the Linear Regression model
    from bigframes.ml.linear_model import LinearRegression
    
    # Filter down to the data we want to analyze
    adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
    
    # Drop the columns we don't care about
    adelie_data = adelie_data.drop(columns=["species"])
    
    # Drop rows with nulls to get our training data
    training_data = adelie_data.dropna()
    
    # Pick feature columns and label column
    X = training_data[
        [
            "island",
            "culmen_length_mm",
            "culmen_depth_mm",
            "flipper_length_mm",
            "sex",
        ]
    ]
    y = training_data[["body_mass_g"]]
    
    model = LinearRegression(fit_intercept=False)
    model.fit(X, y)
    model.score(X, y)
    
  3. 修改 bpd.options.bigquery.project = your_gcp_project_id 行以指定您的项目,例如 bpd.options.bigquery.project = "myproject"

  4. 运行该代码单元。

    该代码单元会返回数据集中企鹅的平均身体质量,然后返回模型的评估指标。

清理

为了避免产生费用,最简单的方法是删除您为本教程创建的项目。

如需删除项目,请执行以下操作:

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

后续步骤

尝试 BigQuery DataFrames 使用入门笔记本