Data Migration is the process of transferring data from one system to another while changing the storage, database or application. In reference to the ETL (Extract-Transform-Load) process, data migration always requires at least Extract and Load steps.
Typically data migration occurs during an upgrade of existing hardware or transfer to a completely new system. Examples include: migration to or from hardware platform; upgrading a database or migrating to new software; or company-mergers when the parallel systems in the two companies need to be merged into one. There are three main options to accomplish data migration:
- Merge the systems from the two companies into a brand new one
- Migrate one of the systems to the other one.
- Leave the systems as they are but create a common view on top of them – a data warehouse.
Data Migration Challenges
Let us describe the data migration challenges in little more detail. Data migration can be a simple process, however there are challenges one may face in implementation.
Storage Migration
Storage migration can be handled in a manner transparent to the application so long as the application uses only general interfaces to access the data. In most systems this is not an issue. However, careful attention is necessary for old applications running on proprietary systems. In many cases, the source code of the application is not available and the application vendor may not be in market anymore. In such cases storage migration is rather tricky and should be properly tested before releasing the solution into production.
Database Migration
Database migration is rather straight forward, assuming the database is used just as storage. It “only” requires moving the data from one database to another. However, even this may be a difficult task. The main issues one may encounter include:
- Unmatched data types (number, date, sub-records)
- Different character sets (encoding)
Different data types can be handled easily by approximating the closest type from the target database to maintain data integrity. If a source database supports complex data formats (e.g. sub-record), but the target database does not, amending the applications using the database is necessary. Similarly, if the source database supports different encoding in each column for a particular table but the target database does not, the applications using the database need to be thoroughly reviewed.
When a database is used not just as data storage, but also to represent business logic in the form of stored procedures and triggers, close attention must be paid when performing a feasibility study of the migration to target database. Again, if the target database does not support some of the features, changes may need to be implemented by applications or by middleware software.
ETL tools are very well suited for the task of migrating data from one database to another i. Using the ETL tools is highly advisable particularly when moving the data between the data stores which do not have any direct connection or interface implemented.
Application Migration
If we take a step back to previous two cases, you may notice that the process is rather straight forward. This however, is extremely uncommon in the case of application migration. The reason is that the applications, even when designed by the same vendor, store data in significantly different formats and structures which make simple data transfer impossible. The full ETL process is a must as the Transformation step is not always straight forward. Of course, application migration can and usually does include storage and database migration as well. The advantage of an ETL tool in this instance is its ready-to-use connectivity to disparate data sources/targets.
Difficulty may occur when migrating data from mainframe systems or applications using proprietary data storage. Mainframe systems use record based formats to store data. Record based formats are easy to handle; however, there are often optimizations included in the mainframe data storage format which complicate data migration. Typical optimizations include binary coded decimal number storage, non-standard storing of positive/negative number values, or storing the mutually exclusive sub-records within a record. Let us consider a library data warehouse as an example. There are two types of publications – books and articles. The publication can be either a book or an article but not both. There are different kinds of information stored for books and articles. The information stored for a book and an article are mutually exclusive. Hence, when storing a book, the data used has a different sub-record format for a book and an article while occupying the same space. Still the data is stored using rather standard encoding. On the Contrary, proprietary data storage makes the Extract step even more complicated. In both cases, the most efficient way to extract data from the source system is performing the extraction in the source system itself; then converting the data into a printable format which can be parsed later using standard tools.
Character Encoding
Most of the systems developed on PC-based platform use ASCII encoding or national extension based on ASCII. The latest one is UTF-8 which keeps ASCII mapping for alpha and numerical characters but enables storage of characters for most of the national alphabets including Chinese, Japanese and Russian. Mainframe systems are mostly based on EBCDIC encoding which is incompatible with ASCII and conversion is required to display the data. ETL tools should support the conversions between character sets, including EBCDIC.