Replicating data from SQL Server to BigQuery

This tutorial shows you how to create and deploy a job that continuously replicates changed data from a Microsoft SQL Server database to a BigQuery table.

Objectives

In this tutorial, you:

Enable Change Data Capture (CDC) in your SQL Server database.
Create and run a Cloud Data Fusion replication job.
View the results in BigQuery.

Costs

In this document, you use the following billable components of Google Cloud:

To generate a cost estimate based on your projected usage, use the pricing calculator.

New Google Cloud users might be eligible for a free trial.

When Replication runs, you're charged for the Dataproc cluster and you incur processing costs for BigQuery. To optimize these costs, we strongly recommend using BigQuery flat rate pricing.

Before you begin

In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
Roles required to select or create a project
- Select a project: Selecting a project doesn't require a specific IAM role—you can select any project that you've been granted a role on.
- Create a project: To create a project, you need the Project Creator (roles/resourcemanager.projectCreator), which contains the resourcemanager.projects.create permission. Learn how to grant roles.
Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

Go to project selector
Verify that billing is enabled for your Google Cloud project.
Enable the Cloud Data Fusion, BigQuery, and Cloud Storage APIs.
Roles required to enable APIs
To enable APIs, you need the Service Usage Admin IAM role (roles/serviceusage.serviceUsageAdmin), which contains the serviceusage.services.enable permission. Learn how to grant roles.
Enable the APIs

Create a public Cloud Data Fusion instance in version 6.3.0 or later. If you create a private instance, set up VPC network peering.
- When you create the instance, enable Replication by clicking Add Accelerators and selecting the Replication checkbox.
- To enable it in an existing instance, see Enable Replication.

For more information about networking, see Using HAProxy with Cloud Data Fusion to Navigate Complex Network Topologies.

Required roles

To get the permissions that you need for this tutorial, see Access control with IAM and Grant service account user permission.

Optional: Set up a SQL Server VM instance

Create a SQL Server instance.
Download the AdventureWorks2017 (OLTP) database and load the data into your SQL Server instance.

Enable CDC in your SQL Server database

For replication, enable Change Data Capture (CDC) on the database and the table you want to replicate.

If you defined custom data types in the table columns being replicated, the owner of the table must grant EXECUTE permissions for the custom data types to the database user who is specified in the replication job.

Create and run a Cloud Data Fusion replication job

Upload the JDBC driver

Download the SQL Server JDBC driver to your local machine.
In the Cloud Data Fusion web interface, upload the JDBC driver. Use these values to configure the JDBC driver:
- In the Name field, enter sqlserver.
- In the Class Name field, enter com.microsoft.sqlserver.jdbc.SQLServerDriver.
- In the Version field, keep the default.

Create the job

In the Cloud Data Fusion web interface, click Replication.
Click Create a replication job.
On the Create new replication job page, specify a replication job Name and click Next.
Configure the source:
1. Select Microsoft SQL Server as the source.
2. For Host, enter the hostname of the SQL Server to read from.
3. For Port, enter the port to use to connect to the SQL Server: 1433.
4. For JDBC Plugin Name, select sqlserveror the name you specified when you configured the JDBC driver.
5. For Database name, enter AdventureWorks2017.
6. In the Credentials section, enter your username and password to access the SQL Server.
Click Next.
Configure the target:
1. Select the BigQuery target.
2. The Project ID and Service Account Key are automatically detected. Keep the default values.
3. Optional: In the Advanced section, you can configure the name and location of the Cloud Storage bucket, the load interval, the staging table prefix, and the behavior when tables or databases are dropped.
Click Next.
If the connection is successful, a list of AdventureWorks2017 tables is displayed. For this tutorial, select a few tables and events, such as Insert, Update, and Delete events.

Note: To proceed, you must enable CDC on the selected tables.
Optional: Configure the advanced properties. For this tutorial, you can accept the default settings.
Click Next.
On the Review assessment page, click View mappings by any of the tables for an assessment of schema issues, missing features, or connectivity issues that might occur during replication. Issues must be resolved before you can proceed. For this tutorial, if any of the tables have issues, go back to the step where you selected tables and select a table or event without issues.

For more information about data type conversions from the source database to the BigQuery destination, see Replication data types.
Click Back.
Click Next.
Review the summary replication job details, and then click Deploy replication job.

Start the job

From the replication job details page, click Start.

The replication job transitions from Provisioning to Starting to Running state. In the running state, the replication job loads an initial snapshot of the table data that you selected (for example, the People table) into BigQuery. In this state, the state of the People table is listed as Snapshotting. After the initial snapshot is loaded into BigQuery, any changes made to the People table are replicated to BigQuery. The state of the table is listed as Replicating.

Monitor the job

You can start and stop the replication job, review its configuration and logs, and monitor your replication job.

You can monitor replication job activities from the Replication job details page.

From the Replication page, click the replication job Name.
Click Monitoring.

View the results in BigQuery

The replication job creates a replicated dataset and table in BigQuery, with names inherited from the corresponding SQL Server database and table names.

Open BigQuery in the Google Cloud console.
In the left panel, click the project name to expand a list of datasets.
Select the adventureworks2017 dataset, and then select a table to view.

For more information, see the BigQuery documentation.

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

After you've finished the tutorial, delete the resources you created on Google Cloud.

Delete the VM instance

In the Google Cloud console, go to the VM instances page.
Go to VM instances
Select the checkbox for the instance that you want to delete.
To delete the instance, click More actions, click Delete, and then follow the instructions.

Delete the Cloud Data Fusion instance

Follow the instructions to delete your Cloud Data Fusion instance.

Delete the project

In the Google Cloud console, go to the Manage resources page.
Go to Manage resources
In the project list, select the project that you want to delete, and then click Delete.
In the dialog, type the project ID, and then click Shut down to delete the project.

What's next

Learn more about Replication in Cloud Data Fusion.
Refer to the Replication API reference.
Work through the tutorial for Replicating data from MySQL to BigQuery.
Work through the tutorial Replicating data from Oracle to BigQuery.