Synchronize online and offline datasets with BigQuery DataFrames

Using Bigtable with BigQuery, you can build a real-time analytics database and use it in machine learning (ML) pipelines. This lets you keep your data in sync, supporting data manipulation and model development (offline access) and low-latency application serving (online access).

To build your real-time analytics database, you can use BigQuery DataFrames, a set of open-source Python libraries for BigQuery data processing. BigQuery DataFrames lets you develop and train models in BigQuery and automatically replicate a copy of the latest data values used for your ML models in Bigtable for online serving.

This document provides an overview of using the bigframes.streaming API to create BigQuery jobs that automatically replicate and synchronize datasets across BigQuery and Bigtable. Before you read this document, make sure that you understand the following documents:

BigQuery DataFrames

BigQuery DataFrames helps you develop and train models in BigQuery and automatically replicate a copy of the latest data values used for your ML models in Bigtable for online serving. It lets you do the following:

Develop data transformations in a Pandas-compatible interface (bigframes.pandas) directly against BigQuery data
Train models using a scikit-learn-like API (bigframes.ML)
Synchronize the data needed for low-latency inference with Bigtable (bigframes.streaming) to support user-facing applications

BigFrames StreamingDataFrame

bigframes.streaming.StreamingDataFrame is a DataFrame type in the BigQuery DataFrames package. It lets you create a StreamingDataFrame object that can be used to generate a continuously running job that streams data from a designated BigQuery table into Bigtable for online serving. This is done by generating BigQuery continuous queries.

A BigFrames StreamingDataFrame can do the following:

Create a StreamingDataFrame from a designated BigQuery table
Optionally, perform additional Pandas operations like select, filter, and preview the content
Create and manage streaming jobs to Bigtable

Required roles

To get the permissions that you need to use BigQuery DataFrames in a BigQuery notebook, ask your administrator to grant you the following IAM roles:

To get the permissions that you need to write data to a Bigtable table, ask your administrator to grant you the following IAM roles:

Bigtable User (roles/bigtable.user)

Get started

BigQuery DataFrames is an open-source package. To install the latest version, run pip install --upgrade bigframes.

To create your first BigFrames StreamingDataFrame and synchronize data between BigQuery and Bigtable, run following code snippet. For the complete code sample, see the GitHub notebook BigFrames StreamingDataFrame.

  import bigframes.streaming as bst

  bigframes.options._bigquery_options.project = "PROJECT"

  sdf = bst.read_gbq_table("birds.penguins_bigtable_streaming")

  job = sdf.to_bigtable(instance="BIGTABLE_INSTANCE",

    table="TABLE",

    app_profile=None,

    truncate=True,

    overwrite=True,`

    auto_create_column_families=True,

    bigtable_options={},

    job_id=None,

    job_id_prefix= "test_streaming_",)

  print(job.running())

  print(job.error_result)

Replace the following:

PROJECT: the ID of your Google Cloud project
BIGTABLE_INSTANCE: the ID of the Bigtable instance that contains the table you are writing to
TABLE: the ID of the Bigtable table that you are writing to

Once the job is initialized, it runs as a continuous query in BigQuery and streams any data changes to Bigtable.

Costs

There are no additional charges for using the BigQuery BigFrames API, but you are charged for the underlying resources used for continuous queries, Bigtable, and BigQuery.

Continuous queries use BigQuery capacity compute pricing, which is measured in slots. To run continuous queries, you must have a reservation that uses the Enterprise or Enterprise Plus edition and a reservation assignment that uses the CONTINUOUS job type.

Usage of other BigQuery resources, such as data ingestion and storage, are charged at the rates shown in BigQuery pricing.

Usage of Bigtable services that receive continuous query results are charged at the Bigtable pricing rates.

Limitations

All feature and location limitations associated with continuous queries are also applicable for streaming DataFrames.