The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.
You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.
Pipeline requirements
Template parameters
Parameter | Description |
---|---|
inputSubscriptions |
The comma-separated list of Pub/Sub input subscriptions to read from, in the format of <subscription>,<subscription>, ... |
changeLogDataset |
The BigQuery dataset to store the staging tables, in the format of <my-dataset> |
replicaDataset |
The location of the BigQuery dataset to store the replica tables, in the format of <my-dataset> |
updateFrequencySecs |
(Optional) The interval at which the pipeline updates the BigQuery table replicating the MySQL database. |
Run the template
To run this template, perform the following steps:
- On your local machine, clone the DataflowTemplates repository.
- Change to the
v2/cdc-parent
directory. - Ensure that the Debezium connector is deployed.
- Using Maven, run the Dataflow template:
mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \ --inputSubscriptions=SUBSCRIPTIONS \ --updateFrequencySecs=300 \ --changeLogDataset=CHANGELOG_DATASET \ --replicaDataset=REPLICA_DATASET \ --project=PROJECT_ID \ --region=REGION_NAME"
Replace the following:
PROJECT_ID
: the Google Cloud project ID where you want to run the Dataflow jobSUBSCRIPTIONS
: your comma-separated list of Pub/Sub subscription namesCHANGELOG_DATASET
: your BigQuery dataset for changelog dataREPLICA_DATASET
: your BigQuery dataset for replica tables