Data Validation for Oracle to Cloud Spanner Migration

By mullaned2002

August 9, 2021

865

Introduction

The Oracle database is a popular RDBMS in the database industry, but for a highly scalable environments we see customers choosing to leverage Cloud Spanner with its multi-region, unlimited scalability and up to 99.999% availability. Migration of Oracle to Spanner can be addressed using many approaches but one of the most important stages of the migration is Data Validation. One tool that helps simplify the data validation process is the Data Validation Tool (DVT).

Data Validation Tool is an open sourced python CLI based tool that compares heterogeneous data source tables and provides multi-level validation functions. DVT can connect to different types of source and target databases and create validation reports based on the criteria defined. In this blog, we will see how to set up the Data Validation Tool to compare data between Oracle and Spanner tables and create validation reports in BigQuery.

Installing Data Validation Tool

To install and run a data validation tool (DVT), we need a machine with python 3.6+, pip, gcc installed. DVT natively supports BigQuery and Spanner connections with Google Cloud SDK installed. DVT has the capability to write the output of data validation into BigQuery tables directly (console output is default).

BigQuery output tables need to be created beforehand for this and we can use terraform to create them easily. This will need a terraform cli to be installed on your machine. In addition to that, make sure git packages are installed to clone the repository.

Clone the repository onto your machine.

Inside the repository, navigate to the terraform folder. Perform the below steps to create the BigQuery tables necessary for data validation reports.

Delete the testenv.tf file inside the terraform folder

View variables.tf inside the terraform folder and replace default “pso-kokoro-resources” with default = “YOUR_PROJECT_ID”

Run the following commands from terraform directory

Alternatively, we can use the bq command line tool as a part of Google Cloud SDK to create the tables.

Before installing the dvt tool, create a virtual environment in python to sandbox the tool.

Now we can install the tool in the selected environment using pip.

To configure the Oracle connection, install cw_Oracle package using pip.

This will also require an Oracle instant client library installed on the local machine. Follow the steps mentioned here.

Creating Connections:

Before running the validation, connection to the Oracle system needs to be created.

Fill in the below details before executing the command

<oracle-host-name>

Oracle hostname

<oracle-port-number>

Oracle portname

<user-name>

Oracle login username

Oracle login password

<database-name>

Oracle database name

Next step, create a connection to the Spanner instance.

Fill in the below details before executing the command:

<gcp-project-id>

GCP Project ID where Spanner instance is located

<spanner-instance-id>

Spanner Instance ID

<spanner-database-id>

Spanner Database ID

Use `gcloud auth application-default login` to use the default user to connect to GCP services. In case if you want to use a service account to create a connection, you can provide a service account json key file in the custom parameter `google_service_account_key_path`.

Validation Types:

DVT supports the following validation types:

Table level

Table row count

Group by row count

Column aggregation – SUM(), MIN(), MAX(), AVG()

Filters and limits

Column level

Full column data type

Based on the scenario, different validation types can be used. For this tutorial, we will be doing each column aggregation based validation using count, min and max functions. To visualize and customize the validation, provide the config-file option to write the validation configuration as a yaml file.

For example, we are considering validation of employee table which has 11 columns. We will be adding count based aggregation on all columns and min and max aggregation on the salary column.

Fill in the below details before executing the command

<gcp-project-id>

GCP Project ID where Spanner instance is located

Config.yaml is created with the validation criteria.

Running Validation:

After the configuration file is created, we can run the validation based on the confg.yaml.

This command will run the validation tool and store the results in the BigQuery table.

Analyzing Validation Report

The validation report shows aggregation values and differences for each column defined in the config file. Status column automatically fills the value based on the threshold provided (default threshold is 0.0).

Further Information

Introducing the Data Validation Tool for EDW migrations

Migrating from Oracle to Cloud Spanner

Data Validation Tool is under active development. Please give them a try and help out by giving feedback, filing issues and sending PRs for fixes and enhancements. These open source tools are part of the user community and are not officially supported by Google Cloud.

Cloud BlogRead More

Previous articleAnnouncing model improvements and lower annotation limits for Amazon Comprehend custom entity recognition

Next articleAccurately predicting future sales at Clearly using Amazon Forecast

Data Validation for Oracle to Cloud Spanner Migration

Introduction

Installing Data Validation Tool

Creating Connections:

Validation Types:

Running Validation:

Analyzing Validation Report

Further Information

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Built with BigQuery: Quantum Metric unlocks data for frictionless customer experiences

The real value of 5G and cloud computing

Google Cloud now the most favored platform for research globally, according to Ipsos

POPULAR CATEGORY