試用 BigQuery DataFrames

透過這個快速入門,您可以在 BigQuery 筆記本中使用 BigQuery DataFrames API 執行下列分析和機器學習 (ML) 工作:

  • bigquery-public-data.ml_datasets.penguins 公開資料集上建立 DataFrame。
  • 計算企鵝的平均體重。
  • 建立線性迴歸模型
  • 針對企鵝資料的子集建立 DataFrame,以做為訓練資料。
  • 清理訓練資料。
  • 設定模型參數。
  • 套用模型。
  • 評分模型。

事前準備

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. 確認已啟用 BigQuery API。

    啟用 API

    如果您建立了新專案,系統會自動啟用 BigQuery API。

  6. 所需權限

    如要建立及執行 Notebook,您必須具備下列 Identity and Access Management (IAM) 角色:

    建立筆記本

    請按照「透過 BigQuery 編輯器建立筆記本」一文中的操作說明,建立新的筆記本。

    試用 BigQuery DataFrames

    如要試用 BigQuery DataFrames,請按照下列步驟操作:

    1. 在筆記本中建立新的程式碼儲存格。
    2. 複製下列程式碼,然後貼到程式碼儲存格中:

      import bigframes.pandas as bpd
      
      # Set BigQuery DataFrames options
      # Note: The project option is not required in all environments.
      # On BigQuery Studio, the project ID is automatically detected.
      bpd.options.bigquery.project = your_gcp_project_id
      
      # Use "partial" ordering mode to generate more efficient queries, but the
      # order of the rows in DataFrames may not be deterministic if you have not
      # explictly sorted it. Some operations that depend on the order, such as
      # head() will not function until you explictly order the DataFrame. Set the
      # ordering mode to "strict" (default) for more pandas compatibility.
      bpd.options.bigquery.ordering_mode = "partial"
      
      # Create a DataFrame from a BigQuery table
      query_or_table = "bigquery-public-data.ml_datasets.penguins"
      df = bpd.read_gbq(query_or_table)
      
      # Efficiently preview the results using the .peek() method.
      df.peek()
      
      # Use the DataFrame just as you would a pandas DataFrame, but calculations
      # happen in the BigQuery query engine instead of the local system.
      average_body_mass = df["body_mass_g"].mean()
      print(f"average_body_mass: {average_body_mass}")
      
      # Create the Linear Regression model
      from bigframes.ml.linear_model import LinearRegression
      
      # Filter down to the data we want to analyze
      adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"]
      
      # Drop the columns we don't care about
      adelie_data = adelie_data.drop(columns=["species"])
      
      # Drop rows with nulls to get our training data
      training_data = adelie_data.dropna()
      
      # Pick feature columns and label column
      X = training_data[
          [
              "island",
              "culmen_length_mm",
              "culmen_depth_mm",
              "flipper_length_mm",
              "sex",
          ]
      ]
      y = training_data[["body_mass_g"]]
      
      model = LinearRegression(fit_intercept=False)
      model.fit(X, y)
      model.score(X, y)
      
    3. 修改 bpd.options.bigquery.project = your_gcp_project_id 行,指定您的專案,例如 bpd.options.bigquery.project = "myproject"

    4. 執行程式碼儲存格。

      程式碼儲存格會傳回資料集中企鵝的平均體重,然後傳回模型的評估指標。

    清除所用資源

    如要避免付費,最簡單的方法就是刪除您為了本教學課程所建立的專案。

    如要刪除專案:

    1. In the Google Cloud console, go to the Manage resources page.

      Go to Manage resources

    2. In the project list, select the project that you want to delete, and then click Delete.
    3. In the dialog, type the project ID, and then click Shut down to delete the project.

    後續步驟

    試試看BigQuery DataFrames 入門筆記本