The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.
You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.
Pipeline requirements
Template parameters
Required parameters
- inputSubscriptions : The comma-separated list of Pub/Sub input subscriptions to read from, in the format
<SUBSCRIPTION_NAME>,<SUBSCRIPTION_NAME>, ...
. - changeLogDataset : The BigQuery dataset to store the staging tables in, in the format <DATASET_NAME>.
- replicaDataset : The location of the BigQuery dataset to store the replica tables in, in the format <DATASET_NAME>.
Optional parameters
- inputTopics : Comma-separated list of PubSub topics to where CDC data is being pushed.
- updateFrequencySecs : The interval at which the pipeline updates the BigQuery table replicating the MySQL database.
- useSingleTopic : Set this to true if you have configured your Debezium connector to publish all table updates to a single topic. Defaults to: false.
- useStorageWriteApi : If true, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is
false
. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api). - useStorageWriteApiAtLeastOnce : When using the Storage Write API, specifies the write semantics. To use at-least once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to
true
. To use exactly-once semantics, set the parameter tofalse
. This parameter applies only whenuseStorageWriteApi
istrue
. The default value isfalse
. - numStorageWriteApiStreams : When using the Storage Write API, specifies the number of write streams. If
useStorageWriteApi
istrue
anduseStorageWriteApiAtLeastOnce
isfalse
, then you must set this parameter. Defaults to: 0. - storageWriteApiTriggeringFrequencySec : When using the Storage Write API, specifies the triggering frequency, in seconds. If
useStorageWriteApi
istrue
anduseStorageWriteApiAtLeastOnce
isfalse
, then you must set this parameter.
Run the template
To run this template, perform the following steps:
- On your local machine, clone the DataflowTemplates repository.
- Change to the
v2/cdc-parent
directory. - Ensure that the Debezium connector is deployed.
- Using Maven, run the Dataflow template:
mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \ --inputSubscriptions=SUBSCRIPTIONS \ --updateFrequencySecs=300 \ --changeLogDataset=CHANGELOG_DATASET \ --replicaDataset=REPLICA_DATASET \ --project=PROJECT_ID \ --region=REGION_NAME"
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobSUBSCRIPTIONS
: your comma-separated list of Pub/Sub subscription namesCHANGELOG_DATASET
: your BigQuery dataset for changelog dataREPLICA_DATASET
: your BigQuery dataset for replica tables
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.