Developers & Practitioners

Data Validation for Oracle to Cloud Spanner Migration

August 9, 2021

Logan Ramalingam

Strategic Cloud Engineer, Google Cloud

Introduction

The Oracle database is a popular RDBMS in the database industry, but for a highly scalable environments we see customers choosing to leverage Cloud Spanner with its multi-region, unlimited scalability and up to 99.999% availability. Migration of Oracle to Spanner can be addressed using many approaches but one of the most important stages of the migration is Data Validation. One tool that helps simplify the data validation process is the Data Validation Tool (DVT).

Data Validation Tool is an open sourced python CLI based tool that compares heterogeneous data source tables and provides multi-level validation functions. DVT can connect to different types of source and target databases and create validation reports based on the criteria defined. In this blog, we will see how to set up the Data Validation Tool to compare data between Oracle and Spanner tables and create validation reports in BigQuery.

Installing Data Validation Tool

To install and run a data validation tool (DVT), we need a machine with python 3.6+, pip, gcc installed. DVT natively supports BigQuery and Spanner connections with Google Cloud SDK installed. DVT has the capability to write the output of data validation into BigQuery tables directly (console output is default).

BigQuery output tables need to be created beforehand for this and we can use terraform to create them easily. This will need a terraform cli to be installed on your machine. In addition to that, make sure git packages are installed to clone the repository.

Clone the repository onto your machine.

Inside the repository, navigate to the terraform folder. Perform the below steps to create the BigQuery tables necessary for data validation reports.

Delete the testenv.tf file inside the terraform folder
View variables.tf inside the terraform folder and replace default "pso-kokoro-resources" with default = "YOUR_PROJECT_ID"
Run the following commands from terraform directory

Alternatively, we can use the bq command line tool as a part of Google Cloud SDK to create the tables.

Before installing the dvt tool, create a virtual environment in python to sandbox the tool.

Now we can install the tool in the selected environment using pip.

To configure the Oracle connection, install cw_Oracle package using pip.

This will also require an Oracle instant client library installed on the local machine. Follow the steps mentioned here.

Creating Connections:

Before running the validation, connection to the Oracle system needs to be created.

Fill in the below details before executing the command

<oracle-host-name>	Oracle hostname
<oracle-port-number>	Oracle portname
<user-name>	Oracle login username
<password>	Oracle login password
<database-name>	Oracle database name

Next step, create a connection to the Spanner instance.

Fill in the below details before executing the command:

<gcp-project-id>	GCP Project ID where Spanner instance is located
<spanner-instance-id>	Spanner Instance ID
<spanner-database-id>	Spanner Database ID

Use `gcloud auth application-default login` to use the default user to connect to GCP services. In case if you want to use a service account to create a connection, you can provide a service account json key file in the custom parameter `google_service_account_key_path`.

Validation Types:

DVT supports the following validation types:

Table level

Table row count
Group by row count
Column aggregation - SUM(), MIN(), MAX(), AVG()
Filters and limits

Column level

Full column data type

Based on the scenario, different validation types can be used. For this tutorial, we will be doing each column aggregation based validation using count, min and max functions. To visualize and customize the validation, provide the config-file option to write the validation configuration as a yaml file.

For example, we are considering validation of employee table which has 11 columns. We will be adding count based aggregation on all columns and min and max aggregation on the salary column.

Fill in the below details before executing the command

<gcp-project-id>

GCP Project ID where Spanner instance is located

Config.yaml is created with the validation criteria.

result_handler:
  project_id: <gcp-project-id>
  table_id: pso_data_validator.results
  type: BigQuery
source: oracle_conn
target: spanner_conn
validations:
- aggregates:
  - field_alias: count
    source_column: null
    target_column: null
    type: count
  - field_alias: count__employee_id
    source_column: employee_id
    target_column: employee_id
    type: count
  - field_alias: count__first_name
    source_column: first_name
    target_column: first_name
    type: count
  - field_alias: count__last_name
    source_column: last_name
    target_column: last_name
    type: count
  - field_alias: count__email
    source_column: email
    target_column: email
    type: count
  - field_alias: count__phone_number
    source_column: phone_number
    target_column: phone_number
    type: count
  - field_alias: count__hire_date
    source_column: hire_date
    target_column: hire_date
    type: count
  - field_alias: count__job_id
    source_column: job_id
    target_column: job_id
    type: count
  - field_alias: count__salary
    source_column: salary
    target_column: salary
    type: count
  - field_alias: count__commission_pct
    source_column: commission_pct
    target_column: commission_pct
    type: count
  - field_alias: count__manager_id
    source_column: manager_id
    target_column: manager_id
    type: count
  - field_alias: count__department_id
    source_column: department_id
    target_column: department_id
    type: count
  - field_alias: min__salary
    source_column: salary
    target_column: salary
    type: min
  - field_alias: max__salary
    source_column: salary
    target_column: salary
    type: max
  filters: []
  labels: []
  schema_name: SYS
  table_name: employees
  target_schema_name: ''
  target_table_name: employees
  threshold: 0.0
  type: Column

Running Validation:

After the configuration file is created, we can run the validation based on the confg.yaml.

This command will run the validation tool and store the results in the BigQuery table.

Analyzing Validation Report

https://storage.googleapis.com/gweb-cloudblog-publish/images/image2_bJKVuW2.max-800x800.png

The validation report shows aggregation values and differences for each column defined in the config file. Status column automatically fills the value based on the threshold provided (default threshold is 0.0).

Further Information

Introducing the Data Validation Tool for EDW migrations

Migrating from Oracle to Cloud Spanner

Data Validation Tool is under active development. Please give them a try and help out by giving feedback, filing issues and sending PRs for fixes and enhancements. These open source tools are part of the user community and are not officially supported by Google Cloud.

Posted in