Data Migration from Navigational to Relational Databases

Ali A. Keskin, Dogan Yilmaz , Timothy K. Payne

Abstract

There are many commercial enterprises that currently face the challenge of migrating their legacy business data, stored in navigational databases (hierarchical, network etc.) in which they have invested heavily in the past, to a relational database. The aim is to provide a flexible, efficient database system, allowing them to keep pace with changes in today’s fast-moving market, employing technological changes and adopting their business processes in a timely way. Choosing the right data migration approach among a set of alternatives, and bringing such a potentially complex project to a success within time and budget are the main considerations of these enterprises. Having been faced with the same challenge, Gouden Gids BV, a company that is a market leader in providing telephone directory services in the Netherlands , started a large-scale project in June 1998 to migrate their legacy applications and data to new client/server based applications and relational databases, respectively. A major data migration project (named Odyssey) was set up to move all legacy data stored in the company’s numerous hierarchical databases to a single Oracle relational database. The project completed successfully within less than 2 years, migrating all company legacy data.  A general-purpose data migration tool, Map Application, was developed, and used to implement data migration tasks. This document describes this large-scale data migration project as a case study, focusing on the Map Application itself and the technical solutions implemented, which can be useful for any organizations facing similar data migration requirements.

Introduction

The advantages offered to businesses by relational database systems are many, and it is almost inevitable for many enterprises to migrate their legacy business data stored in navigational databases to relational databases.  In order to achieve this, enterprises in this situation are first faced with the question of choosing from a number of options the best data migration alternative for their particular organization and circumstances. Depending on various factors such as technical feasibility, data/application architecture, criticality of applications and, possibly most importantly, budget and time considerations, several approaches are generally considered for data migration implementation, including language interfaces, data propagation systems and source code conversions. Two of the most common approaches taken by enterprises are to write a set of data conversion applications in a high-level language such as C or Cobol, or buying/employing an off-the-shelf Data Migration tool. In the first case, writing many, isolated data conversion applications is a very unstructured approach and can be very difficult and costly to maintain. On the other hand, in the second case, off-the-shelf Data Migration tools must be very carefully evaluated in terms of customization capabilities and performance to justify their heavy costs and possibly one-time usage. They often prove insufficiently flexible to be adapted to the specific requirements of the source or target system.

Facing exactly this challenge of migrating its legacy data, stored in a number of hierarchical databases to a new relational database, Gouden Gids BV, a company which is a market-leader in providing telephone directory services, started a large-scale data migration project (named Odyssey), in June 1998. After considering many of the above-mentioned approaches and their pros and cons, it was decided to develop a data conversion application in-house, but with the strict proviso that a very central, structured architecture be adopted. A data migration tool, Map Application, was developed for this purpose, and used to implement various data migration tasks in a highly configurable way. The project completed successfully within less than 2 years, migrating all company legacy data. This document describes this large-scale data migration project, the technical solutions implemented and the Map Application itself.

1         Data Migration Environment

1.1        Source Environment

Gouden Gids business legacy data was stored in an hierarchical database system called DRS (Data Retrieval System). DRS database system actually contained 9 separate databases. Five of these contained very large volumes of data. The other four contained small amounts of data, mainly business driver information, but their data models were more complex. Since the complete application system using these databases was called ADTEC, these databases too were referred to as ADTEC databases. The total database size was 6 G-bytes excluding indexes.  The five big databases that stored operational data had in total 38 tables. Small tables where driver/business rule information was stored had 66 tables in total.

1.2        Target Environment

The target database for the Data Migration project was an Oracle relational database, called ITHACA . A relational data model, called Corporate Data Model (CDM), was developed to support the target applications that will be running on this new database system. Parts of this data model were logically close to their related ADTEC databases, but in general it had many important differences. The target data model, CDM, evolved as the data migration and application migration projects progressed and could not be finalized until the very late stages of this project.

Because of all the differences between ADTEC and CDM databases, physical data volume has increased by about 70% in total in the target database. CDM data model had eventually about 240 tables in it.  180 tables were populated by Odyssey project. The remaining tables were going to be incrementally populated by target applications, or not used.

2         Data Migration Tool Requirements

After considering our data migration requirements and several data migration alternatives mentioned in earlier sections, it has been chosen to develop a data migration tool.  First, a set of requirements was identified for a Data Migration tool, so that they could be taken into account while designing and developing such a tool. The data migration tool to be used for this large-scale data migration project needed to meet a set of basic requirements, including:

 

-          Ability to scan and process multiple data sources and targets.

-          Ability to perform simple and complex data transformations.

-          Ability to support user exits (flexibility for customization).

-          Time and space efficiency.

-          Meta-data management.

-          Data validation and cleansing.

-          Logging and reporting.

3         Data Migration Architecture

Figure 1. shows a high-level layout of the data migration architecture. As shown in the figure, the contents of the ADTEC hierarchical databases are first carried to middle layer Oracle relational database, named RADTEC (Relational-ADTEC). Creating this middle layer relational database served three main purposes:

 

i)                    It allowed us to do very comprehensive and efficient data analyses of the source data using ORACLE/SQL capabilities, without having to issue the same queries in the source hierarchical databases.

ii)                  It allowed us not to use ADTEC hierarchical databases as source of data conversion, thus avoiding inefficient data accesses in these databases.

iii)                 It allowed us to retrieve our source legacy data from a single relational database, as opposed to multiple hiearchical databases. Using this approach, data mappings didn’t need to know or deal with the internals of the source hierarchical database systems.

Figure 1. The Data Migration Architecture

 

3.1        Converting Hierarchical Data into Relational Data

All tables of all ADTEC databases were downloaded into flat files, and were sorted according to their hierarchical keys. It was not necessary to do any structural changes on the ADTEC table layouts, since the ADTEC data model was reasonably normalized. Keys of the parent tables were repeated in the descendant tables, as shown in Figure 2. Data in different ADTEC databases was easily matched by the use of a unique reference number maintained consistently across the databases.

Figure 2. Hierarchical Data Model Representation

 

The flat files downloaded from ADTEC are uploaded to RADTEC Oracle database, using the Oracle utility SqlLoader. Thus, just after this download-upload operation, the data content of ADTEC databases, source flat files and RADTEC database were the same.

 

3.2        The Map Application

As shown in Figure 1, at the core of the Data Migration Architecture is the Map Application. This uses the Map Descriptors stored in a Map Catalogue to convert a set of source components, namely source flat files and source database tables, into a set of target components, namely target flat files or target Oracle database tables. Map Descriptors, which are structures to define a mapping between source and target components will be described in detail later. The Map Catalogue is a set of Oracle tables that store Map Descriptors.

The Map Application performs data conversion on a map-by-map basis, that is, per Map Descriptor. After fetching a Map Descriptor from the Map Catalogue, and depending on the access method specified in a Map Descriptor, the Map Application System may use the source flat files or the contents of a RADTEC table as the source of data for the mapping. If a Map Descriptor contains any exit points (that is, user-defined functions supplied by the author of the Map Descriptor), the Map Application first loads the corresponding Map DLL into memory, then calls the necessary functions. After scanning the source data and performing the data conversion as indicated by the Map Descriptor, a set of target flat files are produced.

The target flat files generated are directly uploaded to the target ITHACA database system, using Oracle SqlLoader.

3.3        Map Descriptors

Map descriptors are structures that are used to define and store the definition of a mapping between a set of source and target components, that is flat files or tables. A Map descriptor can be considered as an object with associated definition, storage, fetch and activation services. A GUI is used to define, store, browse, update and activate Map Descriptors. Each Map Descriptor has a unique identification number, called Map-Id, which is passed to the Map Application as the only parameter when a map is to be executed. Figure 3. shows a high level view of a Map Descriptor.

Figure 3. The Map Descriptors

It is the access mode indicated in the Map Descriptor that directs the Map Application to use a specific scan algorithm on source files or tables. There are 4 different scan algorithms employed:

· Hierarchical flat file scan: This method makes use of the characteristics (parent-child relationships) of the hierarchical data model while scanning the set of source flat files, which must be sorted with respect to the fields that define the parent-child relationship between them. The Relation List component of the Map Descriptor is used to indicate such relationships.

Figure 4. shows two source tables which are in a parent-child relationship and how they are mapped to a target table through a hierarchical flat file scan:

PK1

PF1

PF2

PF3

 

PK1

CK1

CF1

CF2

 

TF1

TF2

TF3

TF4

Pk1

Pf1a

Pf2a

Pf3a

 

Pk1

Ck1

Cf1a

Cf2a

 

Pk1

Ck1

Pf2a

Cf1a

Pk2

Pf1b

Pf2b

Pf3b

  +

Pk1

Ck2

Cf1b

Cf2b

->

Pk1

Ck2

Pf2a

Cf1b

Pk3

 

 

 

 

Pk3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Parent Source Table

 

Child Source Table

 

Target Table

 

Figure 4. Example case for Hierarchical Flat File Scan


· Merge flat file scan: This method scans the source flat files based on a common key among them, specified using a ‘Relation List’, and forms groups of records from each source file that contains the common key value. The source files need to be sorted on the common key specified. The common key can be composed of multiple fields. This method has been by far the most commonly selected method for implementing data migration maps.

Figure 5. shows two source tables which are in a common key relationship and how they are mapped to a target table through merge flat file scan:

K1

K2

F1

F2

 

K1

F1

K2

F2

 

TK1

TK2

TF1

TF2

Pk1

Pk2

F1a

F2a

 

Pk1

F1x

Pk2

F2x

 

Pk1

Pk2

TF1a

TF2a

Pk1

Pk2

F1b

F2b

  +

Pk3

F1y

Pk4

F2y

->

Pk1

Pk2

TF1b

TF2b

Pk3

Pk4

 

 

 

 

 

 

 

 

Pk1

Pk2

TF1c

TF2c

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Source Table 1

 

Source Table 2

 

Target Table

 

Figure 5. Example case for Merge Flat File Scan

 

The target fields TF1, TF2, in the above example, are formed based on the content of the record groups of Source Table 1 and Source Table 2 which share a common key and don’t have a direct, one-to-one relationship with any of the source table fields.

 

· Query table scan: In this method, first a SQL-defined ‘view query’ is defined on top of source tables, then the view query is executed within the database and the result set is used to feed the scan mechanism, one record per scan iteration. The ‘view query’ may either be statically formed as complex as possible and stored within the Map Descriptor, or it can be generated on–the-fly by the Map Application using the relation list associated with the source tables as “join conditions”. Thus, this method fetches its source data from Oracle tables, rather than the source flat files. Even though it is, in general, slower than flat file scans, it is quite a powerful way of allowing very complex source table scans. The Map application uses Dynamic-SQL and cursor mechanisms to bring results from the database per scan iteration.

· Customized file/table scan: The user who defines the Map Descriptor may choose not to employ any of the pre-defined access methods, but provide his/her own scan functions. The Map Application System then uses these user-provided functions (coded in a high-level language such as C) to scan source components.

Each Map Descriptor contains at least one Map list. Each Map list, in turn, consists of a list of Map items, which represents a field-by-field forming of target components. In other words, each Map item corresponds to one field in one of the target components and tells the Map Application System how to generate a value for that field during each iteration of the above mentioned scans. A Map item, for example, may indicate that the corresponding target component (flat file or database table) field will have a NULL value, or a Constant value or will receive its value directly from one of the source component fields, which is called ‘simple mapping’. It may be sometimes necessary to attach to a Map item a Map function, whose purpose is to evaluate the target field value using the current value content of one or more source components. This is called ‘complex mapping’. The associated Map function, which is written in a high-level language such as C, can be as complex as necessary but there is no need for the author of the function to know anything related to the internals of the Map Application. The user writes his/her Map function using only the logical definitions of the source and target components, and the pre-defined Map application function interfaces.

 

Map functions are stored in shared Dynamic Link Libraries (DLLs on Windows and Sharable Images on Vax/VMS) and called by the Map Application during run-time.

 

3.4        Map Application Structure and Environment

The Map Application consists of two major, independent execution units:

 

-          Map GUI

-          Map Driver

 

The Map GUI is the front-end application, running on a PC under Windows, used by developers to create, update, browse and activate Map Descriptors. Figure 6 shows a screen capture of the Map GUI. The Map GUI connects to the Map Catalogue stored in Oracle through an ODBC connection.

 

The Map Driver is the main application, which executes an individual data conversion task as defined by a Map Descriptor.  A unique Map Identifier is passed as the only parameter to the Map Driver. The Driver fetches the corresponding Map Descriptor from the Map Catalogue and takes all the necessary actions to perform the data conversion as specified. Any erroneous cases encountered are reported in map-specific error/log files.

 

Figure 6. Screen capture of Map GUI (field-level Map definition)

4         Conclusion

Many commercial enterprises currently face the challenge of migrating their legacy business data, stored in navigational databases (hierarchical, network etc.) to a relational database and undertake big data migration projects. The duration and success of these projects depend on various factors. Gouden Gids too undertook such a large-scale data migration project and successfully migrated all its hierarchical legacy data to a relational database system.  Probably the most important lesson drawn is that, having a very structured approach to analysis, design, development and quality aspects, supported by an automated data migration tool such as the Map Application, is crucial for the success of any data migration project.

Ali A. Keskin is a Technical Project Leader at Gouden Gids BV. He led the Odyssey project and developed the Map application. He received a B.Sc. degree in Computer Engineering from the Middle East Technical University and a M.Sc. degree in Computer Science from University of Georgia . He had publications in Journal of Object Oriented Programming, International Journal in Computer Simulation, Proceedings of SE ACM Conference and Proceedings of International Symposium on Large Spatial Databases. His interests include Database and Middleware technologies.

Dogan Yilmaz was Technical Project Manager at Gouden Gids BV. He led both phases of Odyssey project. He is now working as an independent consultant. He received a B.Sc. degree in Computer Engineering from the Middle East Technical University and an MBA degree from Brunel University . His interests are IT Service and Project Management, Service Quality Measurement, and Database technologies.

Timothy K. Payne: is a Software Analyst/Developer at Gouden Gids BV. He was a key member of the Odyssey team, participating in all phases of the project. He received a B.Sc. degree in Physics from the University of Southampton and a M.Sc. degree in Applied Computer Technology from Middlesex University . His interests include database design and software engineering techniques.