Stream table updates with change data capture
BigQuery change data capture (CDC) updates your BigQuery tables by processing and applying streamed changes to existing data. This synchronization is accomplished through upsert and delete row operations that are streamed in real time by the BigQuery Storage Write API, which you should be familiar with before proceeding.
Before you begin
Grant Identity and Access Management (IAM) roles that give users the necessary permissions to perform each task in this document, and ensure that your workflow meets each prerequisite.
Required permissions
To get the permission that you need to use the Storage Write API,
ask your administrator to grant you the
BigQuery Data Editor (roles/bigquery.dataEditor
) IAM role.
For more information about granting roles, see Manage access.
This predefined role contains the
bigquery.tables.updateData
permission, which is
required to use the Storage Write API.
You might also be able to get this permission with custom roles or other predefined roles.
For more information about IAM roles and permissions in BigQuery, see Introduction to IAM.
Prerequisites
To use BigQuery CDC, your workflow must meet the following conditions:
- You must use the Storage Write API in the default stream.
- You must declare primary keys for the destination table in BigQuery. Composite primary keys containing up to 16 columns are supported.
- Sufficient BigQuery compute resources must be available to perform the CDC row operations. Be aware that if CDC row modification operations fail, you might unintentionally retain data that you intended to delete. For more information, see Deleted data considerations.
Specify changes to existing records
In BigQuery CDC, the pseudo-column _CHANGE_TYPE
indicates the
type of change to be processed for each row. To use CDC, set _CHANGE_TYPE
when
you stream row modifications using the Storage Write API. The
pseudo-column _CHANGE_TYPE
only accepts the values UPSERT
and DELETE
.
A table is considered CDC-enabled while the Storage Write API is
streaming row modifications to the table in this manner.
Example with UPSERT
and DELETE
values
Consider the following table in BigQuery:
ID | Name | Salary |
---|---|---|
100 | Bill | 2000 |
101 | Lucy | 3000 |
102 | Ethan | 5000 |
The following row modifications are streamed by the Storage Write API:
ID | Name | Salary | _CHANGE_TYPE |
---|---|---|---|
100 | DELETE | ||
101 | Lucy | 8000 | UPSERT |
105 | Max | 6000 | UPSERT |
The updated table is now the following:
ID | Name | Salary |
---|---|---|
101 | Lucy | 8000 |
102 | Ethan | 5000 |
105 | Max | 6000 |
Manage table staleness
By default, every time you run a query, BigQuery returns the most
up-to-date results. To provide the freshest results when querying a CDC-enabled
table, BigQuery must apply each streamed row modification up to
the query start time, so that the most up-to-date version of the table is being
queried. Applying these row modifications at query run time increases query
latency and cost. However, if you do not require fully up-to-date query results,
you can reduce cost and latency on your queries by setting the max_staleness
option on your table. When this option is set, BigQuery applies
row modifications at least once within the interval defined by the
max_staleness
value, letting you run queries without waiting for updates to be
applied, at the cost of some data staleness.
This behavior is especially useful for dashboards and reports for which data freshness isn't essential. It is also helpful for cost management by giving you more control over how frequently BigQuery applies row modifications.
Query tables with the max_staleness
option set
When you query a table with the max_staleness
option set,
BigQuery returns the result based on the value of max_staleness
and the time at which the last apply job occurred, which is represented by
the table's upsert_stream_apply_watermark
timestamp.
Consider the following example, in which a table has the max_staleness
option
set to 10 minutes, and the most recent apply job occurred at T20:
If you query the table at T25, then the current version of the table is 5
minutes stale, which is less than the max_staleness
interval of 10 minutes. In
this case, BigQuery returns the version of the table at T20,
meaning the data returned is also 5 minutes stale.
When you set the max_staleness
option on your table, BigQuery
applies pending row modifications at least once within the max_staleness
interval. In some cases, however, BigQuery might not complete the
process of applying these pending row modifications within the interval.
For example, if you query the table at T35, and the process of applying pending
row modifications has not completed, then the current version of the table is 15
minutes stale, which is greater than the max_staleness
interval of 10 minutes.
In this case, at query run time, BigQuery applies all row
modifications between T20 and T35, meaning the data is completely up to date, at
the cost of some additional query latency. This is considered a runtime merge
job.
Recommended table max_staleness
value
A table's max_staleness
value should generally be the higher of the following
two values:
- The maximum tolerable data staleness for your workflow.
- Twice the maximum time it takes to apply upserted changes into your table, plus some additional buffer.
To calculate the time it takes to apply upserted changes to an existing table, use the following SQL query to determine the 95th percentile duration of background apply jobs, plus a seven-minute buffer to allow for the BigQuery write-optimized storage (streaming buffer) conversion.
SELECT project_id, destination_table.dataset_id, destination_table.table_id, APPROX_QUANTILES((TIMESTAMP_DIFF(end_time, creation_time,MILLISECOND)/1000), 100)[OFFSET(95)] AS p95_background_apply_duration_in_seconds, CEILING(APPROX_QUANTILES((TIMESTAMP_DIFF(end_time, creation_time,MILLISECOND)/1000), 100)[OFFSET(95)]*2/60)+7 AS recommended_max_staleness_with_buffer_in_minutes FROM `region-us`.INFORMATION_SCHEMA.JOBS AS job WHERE project_id = 'PROJECT_ID' AND DATE(creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE() AND job_id LIKE "%cdc_background%" GROUP BY 1,2,3;
Replace PROJECT_ID
with the ID of the project containing
the BigQuery tables that are being modified by
BigQuery CDC.
The duration of background apply jobs is affected by several factors including the number and complexity of CDC operations issued within the staleness interval, the table size, and BigQuery resource availability. For more information about resource availability, see Size and monitor BACKGROUND reservations.
Create a table with the max_staleness
option
To create a table with the max_staleness
option, use the
CREATE TABLE
statement.
The following example creates the table employees
with a max_staleness
limit
of 10 minutes:
CREATE TABLE employees ( id INT64 PRIMARY KEY NOT ENFORCED, name STRING) CLUSTER BY id OPTIONS ( max_staleness = INTERVAL 10 MINUTE);
Modify the max_staleness
option for an existing table
To add or modify a max_staleness
limit in an existing table, use the
ALTER TABLE
statement.
The following example changes the max_staleness
limit of the employees
table
to 15 minutes:
ALTER TABLE employees SET OPTIONS ( max_staleness = INTERVAL 15 MINUTE);
Determine the current max_staleness
value of a table
To determine the current max_staleness
value of a table, query the
INFORMATION_SCHEMA.TABLE_OPTIONS
view.
The following example checks the current max_staleness
value of the table
mytable
:
SELECT option_name, option_value FROM DATASET_NAME.INFORMATION_SCHEMA.TABLE_OPTIONS WHERE option_name = 'max_staleness' AND table_name = 'TABLE_NAME';
Replace the following:
DATASET_NAME
: the name of the dataset in which the CDC-enabled table resides.TABLE_NAME
: the name of the CDC-enabled table.
The results show that the max_staleness
value is 10 minutes:
+---------------------+--------------+ | Row | option_name | option_value | +---------------------+--------------+ | 1 | max_staleness | 0-0 0 0:10:0 | +---------------------+--------------+
Monitor table upsert operation progress
To monitor the state of a table and to check when row modifications were
last applied, query the
INFORMATION_SCHEMA.TABLES
view
to get the upsert_stream_apply_watermark
timestamp.
The following example checks the upsert_stream_apply_watermark
value of
the table mytable
:
SELECT upsert_stream_apply_watermark FROM DATASET_NAME.INFORMATION_SCHEMA.TABLES WHERE table_name = 'TABLE_NAME';
Replace the following:
DATASET_NAME
: the name of the dataset in which the CDC-enabled table resides.TABLE_NAME
: the name of the CDC-enabled table.
The result is similar to the following:
[{ "upsert_stream_apply_watermark": "2022-09-15T04:17:19.909Z" }]
Upsert operations are performed by the bigquery-adminbot@system.gserviceaccount.com
service account and appear within the job history of the project containing the
CDC-enabled table.
Configure a BigQuery reservation for use with CDC
You can use BigQuery reservations to allocate dedicated BigQuery compute resources for CDC row modification operations. Reservations let you set a cap on the cost of performing these operations. This approach is particularly useful for workflows with frequent CDC operations against large tables, which otherwise would have high on-demand costs due to the large number of bytes processed when performing each operation.
BigQuery CDC jobs that apply pending row modifications within
the max_staleness
interval are considered background jobs and leverage the
BACKGROUND
assignment type,
rather than the QUERY
assignment type.
In contrast, queries outside of the max_staleness
interval that require row
modifications to be applied at query run time leverage the
QUERY
assignment type.
BigQuery CDC background jobs performed without a BACKGROUND
assignment leverage on-demand pricing.
This consideration is important when designing your workload management
strategy for BigQuery CDC.
To configure a BigQuery reservation for use with CDC, start by
purchasing a capacity commitment and
configuring a reservation in the region where your BigQuery
tables are located. For guidance on the size of your reservation, see
Size and monitor BACKGROUND
reservations.
Once you have created a reservation,
assign the BigQuery
project to the reservation, and set the job_type
option to BACKGROUND
by
running the following
CREATE ASSIGNMENT
statement:
CREATE ASSIGNMENT `ADMIN_PROJECT_ID.region-LOCATION.RESERVATION_NAME.ASSIGNMENT_ID` OPTIONS ( assignee = 'projects/PROJECT_ID', job_type = 'BACKGROUND');
Replace the following:
ADMIN_PROJECT_ID
: the ID of the administration project that owns the reservation.LOCATION
: the location of the reservation.RESERVATION_NAME
: the name of the reservation.ASSIGNMENT_ID
: the ID of the assignment. The ID must be unique to the project and location, start and end with a lowercase letter or a number, and contain only lowercase letters, numbers, and dashes.PROJECT_ID
: the ID of the project containing the BigQuery tables that are being modified by BigQuery CDC. This project is assigned to the reservation.
Size and monitor BACKGROUND
reservations
Reservations determine the amount of compute resources available to perform
BigQuery compute operations. Undersizing a reservation can
increase the processing time of CDC row modification operations. To size a
reservation accurately, monitor historical slot consumption for the project that
performs the CDC operations by querying the
INFORMATION_SCHEMA.JOBS_TIMELINE
view:
SELECT period_start, SUM(period_slot_ms) / (1000 * 60) AS slots_used FROM REGION.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT WHERE DATE(job_creation_time) BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) AND CURRENT_DATE() AND job_id LIKE '%cdc_background%' GROUP BY period_start ORDER BY period_start DESC;
Replace REGION
with the
region name where your project is located. For
example, region-us
.
Deleted data considerations
- BigQuery CDC operations leverage BigQuery
compute resources. If the CDC operations are configured to use
on-demand billing, CDC operations are
performed regularly using internal BigQuery resources. If the
CDC operations are configured with a
BACKGROUND
reservation, CDC operations are instead subject to the configured reservation's resource availability. If there are not enough resources available within the configured reservation, processing CDC operations, including deletion, might take longer than anticipated. - A CDC
DELETE
operation is considered to be applied only when theupsert_stream_apply_watermark
timestamp has passed the timestamp at which the Storage Write API streamed the operation. Once the operation is applied, the standard Google Cloud data deletion process begins. For more information on theupsert_stream_apply_watermark
timestamp, see Monitor table upsert operation progress.
Limitations
- BigQuery CDC does not perform key enforcement, so it's essential that your primary keys are unique.
- Primary keys cannot exceed 16 columns.
- CDC-enabled tables do not support the following:
- Mutating
data manipulation language (DML)
statements such as
DELETE
,UPDATE
, andMERGE
- Querying wildcard tables
- Search indexes
- Mutating
data manipulation language (DML)
statements such as
- CDC-enabled tables that perform runtime merge jobs because the table's
max_staleness
value is too low cannot support the following: - BigQuery export
operations on CDC-enabled tables don't export recently streamed row
modifications that have yet to be applied by a background job. To export the
full table, use an
EXPORT DATA
statement. - If your query triggers a runtime merge on a partitioned table, then the entire table is scanned whether or not the query is restricted to a subset of the partitions.
- If you are using the Standard edition,
BACKGROUND
reservations are not available, so applying pending row modifications uses the on-demand pricing model. However, you can query CDC-enabled tables regardless of your edition.
BigQuery CDC pricing
BigQuery CDC uses the Storage Write API for data ingestion, BigQuery storage for data storage, and BigQuery compute for row modification operations, all of which incur costs. For pricing information, see BigQuery pricing.
Estimate BigQuery CDC costs
In addition to
general BigQuery cost estimation best practices,
estimating the costs of BigQuery CDC might be important for
workflows that have large amounts of data, a low
max_staleness
configuration, or frequently changing
data.
BigQuery data ingestion pricing and BigQuery storage pricing are directly calculated by the amount of data that you ingest and store. However, BigQuery compute pricing can be harder to estimate, as it relates to the consumption of compute resources that are used to run BigQuery CDC jobs.
BigQuery CDC jobs are split into three categories:
- Background apply jobs: jobs that run in the background at regular
intervals that are defined by the table's
max_staleness
value. These jobs apply recently streamed row modifications into the CDC-enabled table. - Query jobs: GoogleSQL queries that run within the
max_staleness
window and only read from the CDC baseline table. - Runtime merge jobs: jobs that are triggered by ad hoc GoogleSQL
queries that run outside the
max_staleness
window. These jobs must perform an on-the-fly merge of the CDC baseline table and the recently streamed row modifications at query runtime.
All three types of BigQuery CDC jobs take advantage of BigQuery clustering, but only query jobs take advantage of BigQuery partitioning. Background apply jobs and runtime merge jobs can't use partitioning because, when applying recently streamed row modifications, there is no guarantee to which table partition the recently streamed upserts are applied to. In other words, the full baseline table is read during background apply jobs and runtime merge jobs. Understanding the amount of data that is being read to perform CDC operations is helpful in estimating the total cost.
If the amount of data being read from the table baseline is high, consider using the BigQuery capacity pricing model, which is not based on the amount of processed data.
BigQuery CDC cost best practices
In addition to general BigQuery cost best practices, use the following techniques to optimize the costs of BigQuery CDC operations:
- Unless necessary, avoid configuring a table's
max_staleness
option with a very low value. Themax_staleness
value can increase the occurrence of background apply jobs and runtime merge jobs, which are more expensive and slower than query jobs. For detailed guidance, see Recommended tablemax_staleness
value. - Consider configuring a
BigQuery reservation for use with CDC tables.
Otherwise background apply jobs and runtime merge jobs use on-demand pricing,
which can be more costly due to more data processing. For more details, learn
about
BigQuery reservations and
follow the guidance on
how to size and monitor a
BACKGROUND
reservation for use with BigQuery CDC.
What's next
- Learn how to implement the Storage Write API default stream.
- Learn about best practices for the Storage Write API.
- Learn how to use Datastream to replicate transactional databases to BigQuery with BigQuery CDC.