试用 BigQuery DataFrames

在本快速入门中,您可以使用 BigQuery 笔记本中的 BigQuery DataFrames API 执行以下分析和机器学习 (ML) 任务:

  • 创建基于 bigquery-public-data.ml_datasets.penguins 公共数据集的 DataFrame。
  • 计算企鹅的平均身体质量。
  • 创建一个线性回归模型
  • 基于企鹅数据的一部分创建 DataFrame,以将其用作训练数据。
  • 清理训练数据。
  • 设置模型参数。
  • 拟合模型。
  • 对模型进行评分

准备工作

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Verify that billing is enabled for your Google Cloud project.

  5. 验证已启用 BigQuery API。

    启用 API

    如果您创建了一个新项目,则系统会自动启用 BigQuery API。

  6. 所需权限

    如需创建和运行笔记本,您需要以下 Identity and Access Management (IAM) 角色:

    创建笔记本

    按照通过 BigQuery 编辑器创建笔记本中的说明创建新的笔记本。

    试用 BigQuery DataFrames

    请按照以下步骤试用 BigQuery DataFrames:

    1. 在笔记本中创建新的代码单元。
    2. 在代码单元中添加以下代码:

      import bigframes.pandas as bpd
      
      # Set BigQuery DataFrames options
      # Note: The project option is not required in all environments.
      # On BigQuery Studio, the project ID is automatically detected.
      bpd.options.bigquery.project = your_gcp_project_id
      
      # Use "partial" ordering mode to generate more efficient queries, but the
      # order of the rows in DataFrames may not be deterministic if you have not
      # explictly sorted it. Some operations that depend on the order, such as
      # head() will not function until you explictly order the DataFrame. Set the
      # ordering mode to "strict" (default) for more pandas compatibility.
      bpd.options.bigquery.ordering_mode = "partial"
      
      # Create a DataFrame from a BigQuery table
      query_or_table = "bigquery-public-data.ml_datasets.penguins"
      df = bpd.read_gbq(query_or_table)
      
      # Efficiently preview the results using the .peek() method.
      df.peek()
      
    3. 修改 bpd.options.bigquery.project = your_gcp_project_id 行以指定您的 Google Cloud 项目 ID。例如 bpd.options.bigquery.project = "myProjectID"

    4. 运行该代码单元。

      该代码会返回一个包含企鹅数据的 DataFrame 对象。

    5. 在笔记本中创建一个新的代码单元,并添加以下代码:

      # Use the DataFrame just as you would a pandas DataFrame, but calculations
      # happen in the BigQuery query engine instead of the local system.
      average_body_mass = df["body_mass_g"].mean()
      print(f"average_body_mass: {average_body_mass}")
      
    6. 运行该代码单元。

      该代码会计算企鹅的平均体重,并将其输出到Google Cloud 控制台。

    7. 在笔记本中创建一个新的代码单元,并添加以下代码:

      # Create the Linear Regression model
      from bigframes.ml.linear_model import LinearRegression
      
      # Filter down to the data we want to analyze
      adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
      
      # Drop the columns we don't care about
      adelie_data = adelie_data.drop(columns=["species"])
      
      # Drop rows with nulls to get our training data
      training_data = adelie_data.dropna()
      
      # Pick feature columns and label column
      X = training_data[
          [
              "island",
              "culmen_length_mm",
              "culmen_depth_mm",
              "flipper_length_mm",
              "sex",
          ]
      ]
      y = training_data[["body_mass_g"]]
      
      model = LinearRegression(fit_intercept=False)
      model.fit(X, y)
      model.score(X, y)
      
    8. 运行该代码单元。

      该代码会返回模型的评估指标。

    清理

    为了避免产生费用,最简单的方法是删除您为本教程创建的项目。

    要删除项目,请执行以下操作:

    1. In the Google Cloud console, go to the Manage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then click Delete.
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

    后续步骤