Replicating data from MySQL to BigQuery

This tutorial shows you how to create and deploy a pipeline that continuously replicates changed data from a MySQL database to a BigQuery table.


In this tutorial, you:

  1. Deploy your MySQL database on Compute Engine.
  2. Set up your MySQL database to enable replication.
  3. Create and run a Cloud Data Fusion Replication job.
  4. View the results in BigQuery.


This tutorial uses billable components of Google Cloud, including:

When Replication runs, you are charged for the Dataproc cluster and you incur processing costs for BigQuery. To optimize these costs, we strongly recommend that you use BigQuery flat rate pricing.

Use the pricing calculator to generate a cost estimate based on your projected usage. New Google Cloud users might be eligible for a free trial.

Before you begin

  1. Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
  2. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  3. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  4. Enable the Cloud Data Fusion, BigQuery, and Cloud Storage APIs.

    Enable the APIs

  5. Create a private Cloud Data Fusion instance in version 6.3.0 or above. When you set up the instance:

If you use an existing instance, or if you don't see Replication in the Cloud Data Fusion menu , see Upgrade to enable Replication.

Install MySQL in Compute Engine

  1. Download a MySQL Server Docker image.

  2. Upload your Docker image to the Container Registry.

  3. Deploy your Docker image on a new VM instance.

  4. On the Compute Engine Disks page, change the disk size to 500 GB, and then restart the VM.

    Go to the Disks page

  5. Create a firewall for the VM instance.

  6. Install the Sakila sample database.

Enable replication in your MySQL database

To enable replication, set up Change Data Capture (CDC) in MySQL.

Create and run a Cloud Data Fusion Replication job

Upload the JDBC driver

  1. Download the MySQL JDBC driver (version 8 or higher) to your local machine.

  2. In the Cloud Data Fusion UI, upload the JDBC driver.

    Use these values to configure the JDBC driver:

    • In the Name field, enter mysql.
    • In the Version field, keep the default.
    • In the Class Name field, enter com.mysql.jdbc.Driver.

Create the pipeline

  1. In the Cloud Data Fusion UI, click the menu and navigate to the Replication page.

  2. On the Create new replication job page, specify a Replication job Name.

  3. Click Next.

  4. Configure the source:

    1. Select MySQL as the source.
    2. For Host, enter the hostname of the MySQL server to read from.
    3. For Port, enter the port to use to connect to the MySQL server: 3306.
    4. For JDBC Plugin Name , select mysql (or the name you specified when you configured the JDBC driver).
    5. For Database Name, enter sakila.
    6. In the Credentials section, enter your username and password to access the MySQL server.
  5. Click Next.

  6. If the connection is successful, a list of Sakila sample database tables is displayed. For this tutorial, select a few tables and events to replicate (including Insert, Update, and Delete events).

  7. Click Next.

  8. Configure the target:

    1. Select the BigQuery target.
    2. The Project ID and Service Account Key are automatically detected. Keep the default values.
    3. (Optional) In the Advanced section, you can configure the staging bucket's name and location, the load interval, the staging table prefix, and the behavior when tables or databases are dropped.
  9. Click Next.

  10. (Optional) Configure the advanced properties. For this tutorial, you can use the default settings.

  11. Click Next.

  12. On the Review assessment page, click View mappings by any of the tables to get an assessment of schema issues, missing features, or connectivity issues that might occur during replication. If any issues occur, they must be resolved before you can proceed. For this tutorial, if any of the tables have issues, go back to the step where you selected tables and instead select a table or event (Inserts, Updates, or Deletes) without issues.

  13. Click Back.

  14. Click Next.

  15. Review the summary Replication job details, and then click Deploy Replication job.

Start the pipeline

On the Replication job details page:

Click Start.

The Replication job transitions from Provisioning to Starting to Running state. In the running state, the Replication job loads an initial snapshot of the table data that you selected into BigQuery. In this state, the state of the table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the table are replicated to BigQuery, and the state of the table is listed as Replicating.

Monitor the pipeline

You can start and stop the Replication job, review its configuration and logs, and monitor your Replication job.

You can monitor Replication job activities from the Replication job details page.

  1. From the Replication page, click the desired Replication job Name.

  2. Click Monitoring.

View the results in BigQuery

The Replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding MySQL database and table names.

  1. Open BigQuery in the Cloud Console.

  2. In the left panel, click on your project name to expand a list of datasets.

  3. Select the sakila dataset, and then select a table.

For more information, see the BigQuery documentation.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you've finished the tutorial, clean up the resources you created on Google Cloud so they won't take up quota and you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.

Delete the Cloud Data Fusion instance

Follow the instructions to delete your Cloud Data Fusion instance.

Delete the project

The easiest way to eliminate billing is to delete the project that you created for the tutorial.

To delete the project:

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next