Synchronize online and offline datasets with BigQuery DataFrames
Using Bigtable with BigQuery, you can build a real-time analytics database and use it in machine learning (ML) pipelines. This lets you keep your data in sync, supporting data manipulation and model development (offline access) and low-latency application serving (online access).
To build your real-time analytics database, you can use BigQuery DataFrames, a set of open-source Python libraries for BigQuery data processing. BigQuery DataFrames lets you develop and train models in BigQuery and automatically replicate a copy of the latest data values used for your ML models in Bigtable for online serving.
This document provides an overview of using the bigframes.streaming API
to
create BigQuery jobs that automatically replicate and synchronize
datasets across BigQuery and Bigtable. Before you
read this document, make sure that you understand the following documents:
BigQuery DataFrames
BigQuery DataFrames helps you develop and train models in BigQuery and automatically replicate a copy of the latest data values used for your ML models in Bigtable for online serving. It lets you do the following:
- Develop data transformations in a Pandas-compatible interface
(
bigframes.pandas
) directly against BigQuery data - Train models using a scikit-learn-like API (
bigframes.ML
) - Synchronize the data needed for low-latency inference with
Bigtable (
bigframes.streaming
) to support user-facing applications
BigFrames StreamingDataFrame
bigframes.streaming.StreamingDataFrame
is a DataFrame type in the
BigQuery DataFrames package. It lets you create a
StreamingDataFrame
object that can be used to generate a continuously running
job that streams data from a designated BigQuery table into
Bigtable for online serving. This is done by generating
BigQuery
continuous queries.
A BigFrames StreamingDataFrame
can do the following:
- Create a
StreamingDataFrame
from a designated BigQuery table - Optionally, perform additional Pandas operations like select, filter, and preview the content
- Create and manage streaming jobs to Bigtable
Required roles
To get the permissions that you need to use BigQuery DataFrames in a BigQuery notebook, ask your administrator to grant you the following IAM roles:
To get the permissions that you need to write data to a Bigtable table, ask your administrator to grant you the following IAM roles:
Get started
BigQuery DataFrames is an open-source package. To install the
latest version, run pip install --upgrade bigframes
.
To create your first BigFrames StreamingDataFrame
and synchronize data between
BigQuery and Bigtable, run following code snippet.
For the complete code sample, see the GitHub notebook BigFrames
StreamingDataFrame.
import bigframes.streaming as bst
bigframes.options._bigquery_options.project = "PROJECT"
sdf = bst.read_gbq_table("birds.penguins_bigtable_streaming")
job = sdf.to_bigtable(instance="BIGTABLE_INSTANCE",
table="TABLE",
app_profile=None,
truncate=True,
overwrite=True,`
auto_create_column_families=True,
bigtable_options={},
job_id=None,
job_id_prefix= "test_streaming_",)
print(job.running())
print(job.error_result)
Replace the following:
- PROJECT: the ID of your Google Cloud project
- BIGTABLE_INSTANCE: the ID of the Bigtable instance that contains the table you are writing to
- TABLE: the ID of the Bigtable table that you are writing to
Once the job is initialized, it runs as a continuous query in BigQuery and streams any data changes to Bigtable.
Costs
There are no additional charges for using the BigQuery BigFrames API, but you are charged for the underlying resources used for continuous queries, Bigtable, and BigQuery.
Continuous queries use BigQuery capacity compute
pricing,
which is measured in slots. To
run continuous queries, you must have a
reservation
that uses the Enterprise or Enterprise Plus
edition and a
reservation
assignment
that uses the CONTINUOUS
job type.
Usage of other BigQuery resources, such as data ingestion and storage, are charged at the rates shown in BigQuery pricing.
Usage of Bigtable services that receive continuous query results are charged at the Bigtable pricing rates.
Limitations
All feature and location limitations associated with continuous queries are also applicable for streaming DataFrames.
What's next
- Getting started with Feast on Google Cloud
- Streamlining ML Development with Feast
- Query Bigtable data stored in an external table.
- Export data from BigQuery to Bigtable.