Developers & Practitioners

Migrating to Cloud Spanner using HarbourBridge and Data Validation Tool

There has recently been a lot of interest in migrating from existing open source relational databases such as MySQL or PostgreSQL to Google Cloud Spanner to enable horizontal scaling. This type of heterogeneous migration can be complicated, however there are two open source tools that can help: HarbourBridge and the Data Validation Tool. HarbourBridge is a stand-alone tool that ingests schema and data from MySQL or PostgreSQL and migrates it into Spanner. It maps data types and features that are supported in Spanner and ignores those that are not. The Data Validation Tool does exactly what its name sounds like. It is a Python CLI tool that connects to a source and target database and ensures the data in the two tables match through multi-leveled validations or custom queries. Here is a brief overview of the process. Look forward to a full end to end tutorial of the whole process soon. 

HarbourBridge

Installing HarbourBridge requires the Google Cloud SDK, Go and GCC. 

Download the tool and install it. 

  GO111MODULE=on go get github.com/cloudspannerecosystem/harbourbridge

HarborBridge can be run against either a dump file or connect directly to a database. If using dump files, the files must use the standard plaintext pg_dump or mysqldump formats. When connected directly to a source HarbourBridge retrieves table schema and data by querying the database.

To use the tool on a PostgreSQL database dump called mydb, run

  pg_dump mydb | $GOPATH/bin/harbourbridge -driver=pg_dump

To use the tool on a MySQL database dump called mydb, replace with -driver=mysqldump

To use the tool directly on a PostgreSQL database, set the --driver='postgres' and provide the environment variables PGHOST, PGPORT, PGUSER,PGDATABASE. 

  $GOPATH/bin/harbourbridge -driver=postgres

To use the tool directly on a MySQL database, set --driver='mysql' and provide the environment variables MYSQLHOST, MYSQLPORT, MYSQLUSER, MYSQLDATABASE. 

The tool uses the cloud project in the GCLOUD_PROJECT environment variable and creates a new database in the spanner instance with the converted schema and then loads the data. It also produces several output report files: schema, session, report and dropped files. These files contain more information about what the tool did and any errors it encountered with the data or the schema. 

HarbourBridge is best for migrating moderate sized databases up to approximately 100GB that do not require continuous data migration or CDC. For continuous data migration to Spanner and/or larger databases, a third-party partner product, Striim can be a good fit. Striim can do a bulk initial load as well as streaming CDC.

Data Validation Tool

Installing the Data Validation Tool (DVT) requires the Google Cloud SDK, git, python3, python3-dev, pip, gcc and optionally terraform. DVT outputs results to stdout by default, but can also write to BigQuery. It is recommended to utilize BigQuery as a report handler to store and explore the output. 

  git clone https://github.com/GoogleCloudPlatform/professional-services-data-validator.git

Clone the repository onto your local machine and access the terraform folder. Inside are the terraform files you can edit and use to create the BigQuery dataset and tables for the output. Alternatively, you can use the bq command line tool from the Google Cloud SDK to create the table using the result_schema.json file. 

  bq mk --table \
  --time_partitioning_field start_time \
  --clustering_fields validation_name,run_id \
  pso_data_validator.results \
  terraform/results_schema.json

DVT can then be installed with pip.

  pip install google-pso-data-validator

To run DVT you need to first create a connection for the source MySQL (other databases are also supported). 

  data-validation connections add mysource MySQL --host=HOST --port=PORT --user=USER --password=PASSWORD --database=DATABASE

Next create a connection for the target Cloud Spanner.

  data-validation connections add myspanner Spanner --project-id=MYPROJECTID --instance-id=MYSPANNER --database-id=DBID

Then, to run a validation, specify the validation type, source connection, target connection, list of tables to compare and the BigQuery output table.

  data-validation run \
--type Column \
--source-conn mysource \
--target-conn myspanner \
--tables-list database.source_table1=database.target_table1, \
database.source_table2=database.target_table2 \
-bqrh $PROJECT.pso_data_validator.results

The output in BigQuery will look like this:

image2.png

By default, a Column type validation runs a simple COUNT(*) against the tables specified. In some cases it might be more relevant to run an aggregation for each column instead. By adding --count columnName, --sum myFirstColumnName, --avg anotherColumnName or a combination you can easily build different checks. There are other aggregations, filter clauses, and even custom SQL that can be used for more flexibility with DVT. More information about the different types can be found on the Github page for DVT. 

If you add an additional -c validate.yaml argument, DVT instead outputs a yaml file that stores the configuration specified. This can be  easily edited and saved for future reference and reproducibility. 

  result_handler:
  project_id: mygcpproject
  table_id: pso_data_validator.results
  type: BigQuery
source: mysource
target: myspanner
validations:
- aggregates:
  - field_alias: count
    source_column: null
    target_column: null
    type: count
  filters: []
  labels: []
  schema_name: myschema
  table_name: actor
  target_schema_name: mysql_2021-07-18_6mwe-6373
  target_table_name: actor
  threshold: 0.0
  type: Column
- aggregates:
  - field_alias: count
    source_column: null
    target_column: null
    type: count
  filters: []
  labels: []
  schema_name: myschema
  table_name: payment
  target_schema_name: mysql_2021-07-18_6mwe-6373
  target_table_name: payment
  threshold: 0.0
  type: Column

Running a saved config is as simple as

  data-validation run-config -c validate.yaml

Further Information

Introducing the Data Validation Tool for EDW migrations

MySQL to Cloud Spanner via HarbourBridge

Faster Cloud Spanner migrations with HarbourBridge's Schema Assistant

Migrating from MySQL to Cloud Spanner

Continuous data replication to Cloud Spanner using Striim

HarbourBridge and Data Validation Tool are under active development. Please give them a try and help out by giving feedback, filing issues and sending PRs for fixes and enhancements. These open source tools are part of the user community and are not officially supported by Google Cloud.