Databases

Introducing the Data Validation Tool for EDW migrations

#databases

Data validation is a crucial step in data warehouse, database, or data lake migration projects. It involves comparing structured or semi-structured data from the source and target tables and verifying that they match after each migration step (e.g data and schema migration, SQL script translation, ETL migration, etc.) 

Today, we are excited to announce the Data Validation Tool (DVT), an open sourced Python CLI tool that provides an automated and repeatable solution for validation across different environments. The tool uses the Ibis framework to connect to a large number of data sources including BigQuery, Cloud Spanner, Cloud SQL, Teradata, and more.

data validation toool.jpg

Why DVT?

Cross platform data validation is a non-trivial and time-consuming effort, and many customers have to build and maintain a custom solution to perform such tasks. The DVT provides a standardized solution to validate customer's newly migrated data in Google Cloud against the existing data from their on-prem systems. DVT can be integrated with existing enterprise infrastructure and ETL pipelines to provide a seamless and automated validation. 

Solution

The DVT provides connectivity to BigQuery, Cloud SQL, and Spanner as well as third-party database products and file systems. In addition, it can be easily integrated into other Google Cloud services such as Cloud Composer, Cloud Functions and Cloud Run. DVT supports the following connection types:

  • BigQuery

  • Cloud SQL

  • FileSystem (GCS, S3, or local files)

  • Hive

  • Impala

  • MySQL

  • Oracle

  • Postgres

  • Redshift

  • Snowflake

  • Spanner

  • SQL Server

  • Teradata

The DVT performs multi-leveled data validation functions, from the table level all the way to the row level. Below is a list of the validation features:

Table level

  • Table row count

  • Group by row count

  • Column aggregation

  • Filters and limits

Column level

  • Schema/Column data type 

Row level 

  • Hash comparison (BigQuery only)

Raw SQL exploration

  • Run custom queries on different data sources

How to Use the DVT

The first step to validating your data is creating a connection. You can create a connection to any of the data sources listed previously. Here’s an example of connecting to BigQuery:


  data-validation connections add --connection-name $MY_BQ_CONN BigQuery --project-id $MY_GCP_PROJECT

Now you can run a validation. This is how a validation between a BigQuery and a MySQL table would look:

  data-validation run \
--type Column \
--source-conn $MY_BQ_CONN --target-conn $MY_SQL_CONN \
--tables-list project.dataset.source_table=database.target_table

The default validation if no aggregation is provided is a COUNT *. The tool will count the number of columns in the source table and verify it matches the count on the target table.

The DVT supports a lot of customization while validating your data. For example, you can validate multiple tables, run validations on specific columns, and add labels to your validations.

  data-validation run \
--type Column \
--source-conn $MY_BQ_CONN --target-conn $MY_BQ_CONN \
--tables-list project.dataset.source_a=project.dataset.target_a,\ project.dataset.source_b=project.dataset.target_b \
--sum total_cost,revenue \
--labels owner=Mandy,description=‘finance tables’

You can also save your validation to a YAML configuration file. This way you can store previous validations and modify your validation configuration. By providing the `config-file` flag, you can generate the YAML file. Note that the validation will not execute when this flag is provided - only the file is created.

  data-validation run \
--type Column \
--source-conn $MY_BQ_CONN --target-conn $MY_BQ_CONN \
-tables-list bigquery-public-data.new_york_citibike.citibike_trips \
-c citibike.yaml

Here is an example of a YAML configuration file for a GroupedColumn validation:

  ​​result_handler:
 project_id: my-project-id
 table_id: pso_data_validator.results
 type: BigQuery
source: my_bq_conn
target: my_bq_conn
validations:
- aggregates:
 - field_alias: count
   source_column: null
   target_column: null
   type: count
 - field_alias: sum__num_bikes_available
   source_column: num_bikes_available
   target_column: num_bikes_available
   type: sum
 - field_alias: sum__num_docks_available
   source_column: num_docks_available
   target_column: num_docks_available
   type: sum
 filters:
 - source: region_id=71
   target: region_id=71
   type: custom
 grouped_columns:
 - cast: null
   field_alias: region_id
   source_column: region_id
   target_column: region_id
 labels:
 - !!python/tuple
   - description
   - test
 schema_name: bigquery-public-data.new_york_citibike
 table_name: citibike_stations
 target_schema_name: bigquery-public-data.new_york_citibike
 target_table_name: citibike_stations
 threshold: 0.0
 type: GroupedColumn

Once you have a YAML configuration file, it is very easy to run the validation.

  data-validation run-config -c citibike.yaml

Validation reports can be output to stdout (default) or to a result handler. The tool currently supports BigQuery as the result handler. 

In order to output to BigQuery, simply add the `--bq-result-handler` or `-bqrh` flag.

  data-validation run \
--type Column \
--source-conn my_bq_conn --target-conn my_bq_conn \
--tables-list bigquery-public-data.new_york_citibike.citibike_trips \
--count tripduration,start_station_name \
--bq-result-handler $YOUR_PROJECT_ID.pso_data_validator.results

Below is an example of the validation results in BigQuery. View the complete schema for validation reports in BigQuery here.

query results.jpg

Getting Started

Ready to start integrating the DVT into your data movement processes? Check out the tool on PyPi here and contribute to the tool via GitHub. We’re actively working on new features to make the tool as useful to our customers. Happy validating!