Use BigQuery DataFrames

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine. BigQuery DataFrames is an open-source package. You can run pip install --upgrade bigframes to install the latest version.

BigQuery DataFrames provides two libraries:

  • bigframes.pandas, which provides a pandas-compatible API for analytics.

  • bigframes.ml, which provides a scikit-learn-like API for machine learning (ML).

Required permissions

Options

After installation, you need specify the location and project in which you want to use BigQuery DataFrames. You can define the location and project in your notebook in the following way:
import bigframes.pandas as bpd

PROJECT_ID = "bigframes-dec"  # @param {type:"string"}
REGION = "US"  # @param {type:"string"}

# Set BigQuery DataFrames options
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

If bf.options.bigquery.project is not set, the $GOOGLE_CLOUD_PROJECT environment variable is used, which is set in the notebook runtime serving the BigQuery Studio and Vertex AI notebooks.

Data processing location

BigQuery DataFrames is designed for scale, which it achieves by keeping data and processing on the BigQuery service. However, you can bring data into the memory of your client machine by calling .to_pandas() on a DataFrame or Series object. If you choose to do this, the memory limitation of your client machine applies.

Session location

BigQuery DataFrames uses a local session object to internally to manage metadata. This session is tied to a location. BigQuery DataFrames uses the US multi-region as the default location, but you can use session_options.location to set a different location. Every query in a session is executed in the location where the session was created. BigQuery DataFrames auto-populates bf.options.bigquery.location with the location of the table if the user starts with read_gbq/read_gbq_table/read_gbq_query() and specifies a table, either directly or in a SQL statement.

If you want to reset the location of the created DataFrame or Series objects, you can close the session by executing bigframes.pandas.close_session(). After that, you can reuse bigframes.pandas.options.bigquery.location to specify another location.

read_gbq() requires you to specify a location if the dataset you are querying is not in the US multi-region. If you try to read a table from another location, you get a NotFound exception.

Data types

BigQuery DataFrames supports the following numpy and pandas dtypes:

BigQuery BigQuery DataFrames and pandas
BOOL pandas.BooleanDtype()
DATE pandas.ArrowDtype(pa.date32())
DATETIME pandas.ArrowDtype(pa.timestamp("us"))
FLOAT64 pandas.Float64Dtype()
GEOGRAPHY

geopandas.array.GeometryDtype()

Supported by to_pandas() only

INT64 pandas.Int64Dtype()
STRING pandas.StringDtype(storage="pyarrow")
STRUCT pandas.ArrowDtype(pa.struct())
TIME pandas.ArrowDtype(pa.time64("us"))
TIMESTAMP pandas.ArrowDtype(pa.timestamp("us", tz="UTC"))

BigQuery DataFrames doesn't support the following BigQuery data types:

  • ARRAY

  • NUMERIC

  • BIGNUMERIC

  • INTERVAL

  • RANGE

  • JSON

All other BigQuery data types display as the object type.

Using the bigframes.pandas library

The bigframes.pandas library provides pandas-like API that you can use to analyze and manipulate data in BigQuery. The bigframes.pandas API is scalable to support processing terabytes of BigQuery data, and uses the BigQuery query engine to perform calculations.
The bigframes.pandas API provides the following capabilities:

Input and output

You can access data from a variety of sources including local CSV files, Cloud Storage files, pandas DataFrames, BigQuery models, and BigQuery functions, and load it into a BigQuery DataFrames DataFrame. You can also create BigQuery tables from BigQuery DataFrames.

Data manipulation

You can use Python instead of SQL for your development. You can develop all BigQuery data manipulations in Python, eliminating the need to switch between languages and try to capture SQL statements as text strings. The bigframes.pandas API offers over 750 pandas functions.

Python ecosystem and visualizations

The bigframes.pandas API is a gateway to the full Python ecosystem of tools. The API supports advanced statistical operations, and you can visualize the aggregations generated from BigQuery DataFrames. You can also switch from a BigQuery DataFrames DataFrame to a pandas DataFrame with built-in sampling operations.

Custom Python functions

BigQuery DataFrames gives you the ability to turn your custom scalar functions into BigQuery remote functions . Creating a remote function in BigQuery DataFrames creates:

  1. A Cloud Functions (2nd gen) function.

  2. A BigQuery connection. By default a connection of the name bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped.

    The service account for the default connection is granted the Cloud Run Invoker (roles/run.invoker) IAM role.

  3. A BigQuery remote function that uses the cloud function (1) using the BigQuery connection (2).

For an example, see Create a remote function.

BigQuery connections are created in the same location as the BigQuery DataFrames session, using the name you provide in the custom function definition. To view and manage connections, do the following:

  1. Go to BigQuery in the Google Cloud console.

  2. Select the project in which you created the remote function.

  3. In the Explorer pane, expand that project and then expand External connections.

BigQuery remote functions are created in the dataset you specify, or in a special type of hidden dataset referred to in code as an anonymous dataset. To view and manage remote functions created in a user provided dataset, do the following:

  1. Go to BigQuery in the Google Cloud console.

  2. Select the project in which you created the remote function.

  3. In the Explorer pane, expand that project, expand the dataset in which you created the remote function, and then expand Routines.

To view and manage Cloud Functions functions, use the Functions page and use the project picker to select the project in which you created the function. For easy identification, the names of the functions created by BigQuery DataFrames are prefixed by bigframes.

Requirements

To use BigQuery DataFrames remote functions, you must enable the following APIs:

To use BigQuery DataFrames remote functions, you must be granted the following IAM roles in the project:

Limitations

  • Remote functions take about 90 seconds to become available when you first create them.

  • Trivial changes in the notebook, such as inserting a new cell or renaming a variable, might cause the remote function to be re-created, even if these changes are unrelated to the remote function code.

  • BigQuery DataFrames does not differentiate any personal data you include in the remote function code. The remote function code is serialized as an opaque box to deploy it as a Cloud Functions function.

  • The Cloud Functions (2nd gen) functions, BigQuery connections, and BigQuery remote functions created by BigQuery DataFrames persist in Google Cloud. If you don't want to keep these resources, you must delete them separately using an appropriate Cloud Functions or BigQuery interface.

  • A project can have up to 1000 Cloud Functions (2nd gen) functions at a time. See Cloud Functions quotas for all the limits.

bigframes.pandas examples

The following examples show common ways of using bigframes.pandas.

Load data from a BigQuery table or query

You can create a DataFrame from a BigQuery table or query in the following way:

# Create a DataFrame from a BigQuery table:
import bigframes.pandas as bpd

query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

Load data from a CSV file

You can create a DataFrame from a local or Cloud Storage CSV file in the following way:

import bigframes.pandas as bpd

filepath_or_buffer = "gs://cloud-samples-data/bigquery/us-states/us-states.csv"
df_from_gcs = bpd.read_csv(filepath_or_buffer)
# Display the first few rows of the DataFrame:
df_from_gcs.head()

Inspect and manipulate data

You can use bigframes.pandas to perform data inspection and calculation operations.
The following code sample shows using bigframes.pandas to inspect the body_mass_g column, calculate the mean body_mass, and calculate the mean body_mass by species:

import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Inspect one of the columns (or series) of the DataFrame:
bq_df["body_mass_g"]

# Compute the mean of this series:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

# Find the heaviest species using the groupby operation to calculate the
# mean body_mass_g:
(
    bq_df["body_mass_g"]
    .groupby(by=bq_df["species"])
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

Using the bigframes.ml library

The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.

ML locations

bigframes.ml supports the same locations as BigQuery ML. BigQuery ML model prediction and other ML functions are supported in all BigQuery regions. Support for model training varies by region. For more information, see BigQuery ML locations.

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:

  • Use the KBinsDiscretizer class in the bigframes.ml.preprocessing module to bin continuous data into intervals.

  • Use the LabelEncoder class in the bigframes.ml.preprocessing module to normalize the target labels as integer values.

  • Use the MaxAbsScaler class in the bigframes.ml.preprocessing module to scale each feature to the range [-1, 1] by its maximum absolute value.

  • Use the MinMaxScaler class in the bigframes.ml.preprocessing module to standardize features by scaling each feature to the range [0, 1].

  • Use the StandardScaler class in the bigframes.ml.preprocessing module to standardize features by removing the mean and scaling to unit variance.

  • Use the OneHotEncoder class in the bigframes.ml.preprocessing module to transform categorical values into numeric format.

  • Use the ColumnTransformer class in the bigframes.ml.compose module to apply transformers to DataFrames columns.

Train models

Create estimators to train models in BigQuery DataFrames.

Clustering models

Create estimators for clustering models by using the bigframes.ml.cluster module.

  • Use the KMeans class to create K-means clustering models. Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.

You can use the bigframes.ml.cluster module to create estimators for clustering models.

The following code sample shows using the bigframes.ml.cluster KMeans class to create a k-means clustering model for data segmentation:

from bigframes.ml.cluster import KMeans
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Create the KMeans model
cluster_model = KMeans(n_clusters=10)
cluster_model.fit(bq_df["culmen_length_mm"], bq_df["sex"])

# Predict using the model
result = cluster_model.predict(bq_df)
# Score the model
score = cluster_model.score(bq_df)

Decomposition models

Create estimators for decomposition models by using the bigframes.ml.decomposition module.

  • Use the PCA class to create principal component analysis (PCA) models. Use these models for computing principal components and using them to perform a change of basis on the data. This provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

Ensemble models

Create estimators for ensemble models by using the bigframes.ml.ensemble module.

  • Use the RandomForestClassifier class to create random forest classifier models. Use these models for constructing multiple learning method decision trees for classification.

  • Use the RandomForestRegressor class to create random forest regression models. Use these models for constructing multiple learning method decision trees for regression.

  • Use the XGBClassifier class to create gradient boosted tree classifier models. Use these models for additively constructing multiple learning method decision trees for classification.

  • Use the XGBRegressor class to create gradient boosted tree regression models. Use these models for additively constructing multiple learning method decision trees for regression.

Forecasting models

Create estimators for forecasting models by using the bigframes.ml.forecasting module.

Imported models

Create estimators for imported models by using the bigframes.ml.imported module.

Linear models

Create estimators for linear models by using the bigframes.ml.linear_model module.

  • Use the LinearRegression class to create linear regression models. Use these models for forecasting. For example, forecasting the sales of an item on a given day.

  • Use the LogisticRegression class to create logistic regression models. Use these models for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value.

The following code sample shows using bigframes.ml to do the following:

from bigframes.ml.linear_model import LinearRegression
import bigframes.pandas as bpd

# Load data from BigQuery
query_or_table = "bigquery-public-data.ml_datasets.penguins"
bq_df = bpd.read_gbq(query_or_table)

# Filter down to the data to the Adelie Penguin species
adelie_data = bq_df[bq_df.species == "Adelie Penguin (Pygoscelis adeliae)"]

# Drop the species column
adelie_data = adelie_data.drop(columns=["species"])

# Drop rows with nulls to get training data
training_data = adelie_data.dropna()

# Specify your feature (or input) columns and the label (or output) column:
feature_columns = training_data[
    ["island", "culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "sex"]
]
label_columns = training_data[["body_mass_g"]]

test_data = adelie_data[adelie_data.body_mass_g.isnull()]

# Create the linear model
model = LinearRegression()
model.fit(feature_columns, label_columns)

# Score the model
score = model.score(feature_columns, label_columns)

# Predict using the model
result = model.predict(test_data)

Large language models

Create estimators for LLMs by using the bigframes.ml.llm module.

You can use the bigframes.ml.llm module to create estimators for remote large language models (LLMs).
The following code sample shows using the bigframes.ml.llm GeminiTextGenerator class to create a Gemini model for code generation:

from bigframes.ml.llm import GeminiTextGenerator
import bigframes.pandas as bpd

# Create the Gemini LLM model
session = bpd.get_global_session()
connection = f"{PROJECT_ID}.{REGION}.{CONN_NAME}"
model = GeminiTextGenerator(session=session, connection_name=connection)

df_api = bpd.read_csv("gs://cloud-samples-data/vertex-ai/bigframe/df.csv")

# Prepare the prompts and send them to the LLM model for prediction
df_prompt_prefix = "Generate Pandas sample code for DataFrame."
df_prompt = df_prompt_prefix + df_api["API"]

# Predict using the model
df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)

Remote models

To use BigQuery DataFrames ML remote models (bigframes.ml.remote or bigframes.ml.llm), you must enable the following APIs:

You must also be granted the following IAM roles in the project:

  • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
  • Project IAM Admin (roles/resourcemanager.projectIamAdmin) if using default BigQuery connection, or Browser (roles/browser) if using a pre-configured connection. This requirement can be avoided by setting bigframes.pandas.options.bigquery.skip_bq_connection_check option to True, in which case the connection (default or pre-configured) would be used as-is without any existence or permission check. If you are using the pre-configured connection and skipping the connection check, make sure the connection is created in the right location and its service account has the Vertex AI User (roles/aiplatform.user) role on the project.

Creating a remote model in BigQuery DataFrames creates a BigQuery connection. By default a connection of the name bigframes-default-connection is used. You can use a pre-configured BigQuery connection if you prefer, in which case the connection creation is skipped. The service account for the default connection is granted the Vertex AI User (roles/aiplatform.user) IAM role.

Create pipelines

Create ML pipelines by using bigframes.ml.pipeline module. Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and allows you to deploy data preprocessing steps and an estimator together.

Use the Pipeline class to create a pipeline of transforms with a final estimator.

What's next