Use BigQuery DataFrames

For support during the preview, email bigframes-feedback@google.com.

This document describes how you can use BigQuery DataFrames to analyze and manipulate data in a BigQuery notebook.

BigQuery DataFrames is a Python client library that you can use to analyze data and perform machine learning tasks in BigQuery notebooks.

BigQuery DataFrames consists of the following parts:

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  4. Make sure that billing is enabled for your Google Cloud project.

  5. Ensure that the BigQuery API is enabled.

    Enable the API

    If you created a new project, the BigQuery API is automatically enabled.

Required permissions

To use BigQuery DataFrames in a BigQuery notebook, you need the following Identity and Access Management (IAM) roles:

Create a notebook

Follow the instructions in Create a notebook from the BigQuery editor to create a new notebook.

Set up BigQuery DataFrames options

After installation, you need specify the location and project in which you want to use BigQuery DataFrames.

You can define the location and project in your notebook in the following way:

import bigframes.pandas as bpd

PROJECT_ID = "bigframes-dec"  # @param {type:"string"}
REGION = "US"  # @param {type:"string"}

# Set BigQuery DataFrames options
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = REGION

Use bigframes.pandas

The bigframes.pandas API provides pandas-like API that you can use to analyze and manipulate data in BigQuery. The bigframes.pandas API is scalable to support processing terabytes of BigQuery data, and uses the BigQuery query engine to perform calculations.

The bigframes.pandas API provides the following capabilities:

Input and output
You can access data from a variety of sources including local CSV files, Cloud Storage files, pandas DataFrames, BigQuery models, and BigQuery functions, and load it into a BigQuery DataFrame. You can also create BigQuery tables from BigQuery DataFrames.
Data manipulation
You can use Python instead of SQL for your development. You can develop all BigQuery data manipulations in Python, eliminating the need to switch between languages and try to capture SQL statements as text strings. The bigframes.pandas API offers over 250 pandas functions.
Python ecosystem and visualizations
The bigframes.pandas API is a gateway to the full Python ecosystem of tools. The API supports advanced statistical operations, and you can visualize the aggregations generated from BigQuery DataFrames. You can also switch from a BigQuery DataFrame to a pandas DataFrame with built-in sampling operations.
Custom Python functions
You can use custom Python functions and packages. With bigframes.pandas, you can deploy remote functions that run scalar Python functions at BigQuery scale. You can persist these functions back to BigQuery as SQL routines, and use them like SQL functions.

Load data from a BigQuery table or query

You can create a DataFrame from a BigQuery table or query in the following way:

# Create a DataFrame from a BigQuery table:
import bigframes.pandas as bpd

query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

Load data from a CSV file

You can create a DataFrame from a local or Cloud Storage CSV file in the following way:

import bigframes.pandas as bpd

filepath_or_buffer = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
df_from_gcs = bpd.read_csv(filepath_or_buffer)
# Display the first few rows of the DataFrame:
df_from_gcs.head()

Inspect and manipulate data

You can use bigframes.pandas to perform data inspection and calculation operations.

The following code sample shows using bigframes.pandas to inspect the body_mass_g column, calculate the mean body_mass, and calculate the mean body_mass by species:

import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Inspect one of the columns (or series) of the DataFrame:
bq_df["body_mass_g"]

# Compute the mean of this series:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

# Find the heaviest species using the groupby operation to calculate the
# mean body_mass_g:
(
    bq_df["body_mass_g"]
    .groupby(by=bq_df["species"])
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

Use bigframes.ml

The bigframes.ml scikit-learn-like API lets you create several types of machine learning models.

Regression

The following code sample shows using bigframes.ml to do the following:

from bigframes.ml.linear_model import LinearRegression
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Filter down to the data to the Adelie Penguin species
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Specify your feature (or input) columns and the label (or output) column:
feature_columns = training_data[
    ["island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex"]
]
label_columns = training_data[["body_mass_g"]]

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

# Create the linear model
model = LinearRegression()
model.fit(feature_columns, label_columns)

# Score the model
score = model.score(feature_columns, label_columns)

# Predict using the model
result = model.predict(test_data)

Clustering

You can use the bigframes.ml.cluster module to create estimators for clustering models.

The following code sample shows using the bigframes.ml.cluster KMeans class to create a K-means clustering model for data segmentation:

from bigframes.ml.cluster import KMeans
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create the KMeans model
cluster_model = KMeans(n_clusters=10)
cluster_model.fit(bq_df["culmen_length_mm"], bq_df["sex"])

# Predict using the model
result = cluster_model.predict(bq_df)
# Score the model
score = cluster_model.score(bq_df)

LLM remote models

You can use the bigframes.ml.llm module to create estimators for remote large language models (LLMs).

The following code sample shows using the bigframes.ml.llm PaLM2TextGenerator class to create a PaLM2 text generator model for text generation:

from bigframes.ml.llm import PaLM2TextGenerator
import bigframes.pandas as bpd

# Create the LLM model
session = bpd.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
model = PaLM2TextGenerator(session=session, connection_name=connection)

df_api = bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")

# Prepare the prompts and send them to the LLM model for prediction
df_prompt_prefix = "Generate Pandas sample code for DataFrame."
df_prompt = df_prompt_prefix + df_api["API"]

# Predict using the model
df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)

What's next