Developers & Practitioners
Migrating to Cloud Spanner using HarbourBridge and Data Validation Tool
There has recently been a lot of interest in migrating from existing open source relational databases such as MySQL or PostgreSQL to Google Cloud Spanner to enable horizontal scaling. This type of heterogeneous migration can be complicated, however there are two open source tools that can help: HarbourBridge and the Data Validation Tool. HarbourBridge is a stand-alone tool that ingests schema and data from MySQL or PostgreSQL and migrates it into Spanner. It maps data types and features that are supported in Spanner and ignores those that are not. The Data Validation Tool does exactly what its name sounds like. It is a Python CLI tool that connects to a source and target database and ensures the data in the two tables match through multi-leveled validations or custom queries. Here is a brief overview of the process. Look forward to a full end to end tutorial of the whole process soon.
Installing HarbourBridge requires the Google Cloud SDK, Go and GCC.
Download the tool and install it.
GO111MODULE=on go get github.com/cloudspannerecosystem/harbourbridge
HarborBridge can be run against either a dump file or connect directly to a database. If using dump files, the files must use the standard plaintext pg_dump or mysqldump formats. When connected directly to a source HarbourBridge retrieves table schema and data by querying the database.
To use the tool on a PostgreSQL database dump called mydb, run
pg_dump mydb | $GOPATH/bin/harbourbridge -driver=pg_dump
To use the tool on a MySQL database dump called mydb, replace with -driver=mysqldump
To use the tool directly on a PostgreSQL database, set the --driver='postgres' and provide the environment variables PGHOST, PGPORT, PGUSER,PGDATABASE.
To use the tool directly on a MySQL database, set --driver='mysql' and provide the environment variables MYSQLHOST, MYSQLPORT, MYSQLUSER, MYSQLDATABASE.
The tool uses the cloud project in the GCLOUD_PROJECT environment variable and creates a new database in the spanner instance with the converted schema and then loads the data. It also produces several output report files: schema, session, report and dropped files. These files contain more information about what the tool did and any errors it encountered with the data or the schema.
HarbourBridge is best for migrating moderate sized databases up to approximately 100GB that do not require continuous data migration or CDC. For continuous data migration to Spanner and/or larger databases, a third-party partner product, Striim can be a good fit. Striim can do a bulk initial load as well as streaming CDC.
Data Validation Tool
Installing the Data Validation Tool (DVT) requires the Google Cloud SDK, git, python3, python3-dev, pip, gcc and optionally terraform. DVT outputs results to stdout by default, but can also write to BigQuery. It is recommended to utilize BigQuery as a report handler to store and explore the output.
git clone https://github.com/GoogleCloudPlatform/professional-services-data-validator.git
Clone the repository onto your local machine and access the terraform folder. Inside are the terraform files you can edit and use to create the BigQuery dataset and tables for the output. Alternatively, you can use the bq command line tool from the Google Cloud SDK to create the table using the result_schema.json file.
bq mk --table \
--time_partitioning_field start_time \
--clustering_fields validation_name,run_id \
DVT can then be installed with pip.
pip install google-pso-data-validator
To run DVT you need to first create a connection for the source MySQL (other databases are also supported).
data-validation connections add mysource MySQL --host=HOST --port=PORT --user=USER --password=PASSWORD --database=DATABASE
Next create a connection for the target Cloud Spanner.
data-validation connections add myspanner Spanner --project-id=MYPROJECTID --instance-id=MYSPANNER --database-id=DBID
Then, to run a validation, specify the validation type, source connection, target connection, list of tables to compare and the BigQuery output table.
data-validation run \
--type Column \
--source-conn mysource \
--target-conn myspanner \
--tables-list database.source_table1=database.target_table1, \
The output in BigQuery will look like this:
By default, a Column type validation runs a simple COUNT(*) against the tables specified. In some cases it might be more relevant to run an aggregation for each column instead. By adding --count columnName, --sum myFirstColumnName, --avg anotherColumnName or a combination you can easily build different checks. There are other aggregations, filter clauses, and even custom SQL that can be used for more flexibility with DVT. More information about the different types can be found on the Github page for DVT.
If you add an additional -c validate.yaml argument, DVT instead outputs a yaml file that stores the configuration specified. This can be easily edited and saved for future reference and reproducibility.
- field_alias: count
- field_alias: count
Running a saved config is as simple as
data-validation run-config -c validate.yaml
HarbourBridge and Data Validation Tool are under active development. Please give them a try and help out by giving feedback, filing issues and sending PRs for fixes and enhancements. These open source tools are part of the user community and are not officially supported by Google Cloud.