Try BigQuery DataFrames
Use this quickstart to perform the following analysis and machine learning (ML) tasks by using the BigQuery DataFrames API in a BigQuery notebook:
- Create a DataFrame over the
bigquery-public-data.ml_datasets.penguins
public dataset. - Calculate the average body mass of a penguin.
- Create a linear regression model.
- Create a DataFrame over a subset of the penguin data to use as training data.
- Clean up the training data.
- Set the model parameters.
- Fit the model.
- Score the model.
Before you begin
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
Ensure that the BigQuery API is enabled.
If you created a new project, the BigQuery API is automatically enabled.
Required permissions
To create and run notebooks, you need the following Identity and Access Management (IAM) roles:
- BigQuery User (
roles/bigquery.user
) - Notebook Runtime User (
roles/aiplatform.notebookRuntimeUser
) - Code Creator (
roles/dataform.codeCreator
)
Create a notebook
Follow the instructions in Create a notebook from the BigQuery editor to create a new notebook.
Try BigQuery DataFrames
Try BigQuery DataFrames by following these steps:
- Create a new code cell in the notebook.
Copy the following code and paste it into the code cell:
import bigframes.pandas as bpd # Set BigQuery DataFrames options bpd.options.bigquery.project = your_gcp_project_id bpd.options.bigquery.location = "us" # Create a DataFrame from a BigQuery table query_or_table = "bigquery-public-data.ml_datasets.penguins" df = bpd.read_gbq(query_or_table) # Use the DataFrame just as you would a pandas DataFrame, but calculations # happen in the BigQuery query engine instead of the local system. average_body_mass = df["body_mass_g"].mean() print(f"average_body_mass: {average_body_mass}") # Create the Linear Regression model from bigframes.ml.linear_model import LinearRegression # Filter down to the data we want to analyze adelie_data = df[df.species == "Adelie Penguin (Pygoscelis adeliae)"] # Drop the columns we don't care about adelie_data = adelie_data.drop(columns=["species"]) # Drop rows with nulls to get our training data training_data = adelie_data.dropna() # Pick feature columns and label column X = training_data[ [ "island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex", ] ] y = training_data[["body_mass_g"]] model = LinearRegression(fit_intercept=False) model.fit(X, y) model.score(X, y)
Modify the
bpd.options.bigquery.project = your_gcp_project_id
line to specify your project, for examplebpd.options.bigquery.project = "myproject"
.Run the code cell.
The code cell returns the average body mass for the penguins in the dataset, and then returns the evaluation metrics for the model.
Clean up
The easiest way to eliminate billing is to delete the project that you created for the tutorial.
To delete the project:
- In the Google Cloud console, go to the Manage resources page.
- In the project list, select the project that you want to delete, and then click Delete.
- In the dialog, type the project ID, and then click Shut down to delete the project.
What's next
Try the Getting Started with BigQuery DataFrames notebook.