Data Migration from Navigational to
Relational Databases
Ali A. Keskin,
Dogan Yilmaz
, Timothy K. Payne
Abstract
There are many commercial enterprises that currently face the challenge of
migrating
their legacy business data, stored in navigational databases (hierarchical,
network etc.) in which they have invested heavily in the past, to a relational
database. The aim is to provide a flexible, efficient database system, allowing
them to keep pace with changes in today’s fast-moving market, employing
technological changes and adopting their business processes in a timely way.
Choosing the right data migration approach among a set of alternatives, and
bringing such a potentially complex project to a success within time and budget
are the main considerations of these enterprises. Having been faced with the
same challenge, Gouden Gids BV, a company that is a market leader in providing
telephone directory services in the
Netherlands
, started a large-scale project in June 1998 to migrate
their legacy applications and data to new client/server based applications and
relational databases, respectively. A major data migration project (named
Odyssey) was set up to move all legacy data stored in the company’s numerous
hierarchical databases to a single Oracle relational database. The project
completed successfully within less than 2 years, migrating all company legacy
data.
A general-purpose data migration tool, Map
Application, was developed, and used to implement data migration tasks.
This document describes this large-scale data migration project as a case
study, focusing on the Map Application itself and the technical solutions
implemented, which can be useful for any organizations facing similar data
migration requirements.
Introduction
The advantages offered to businesses by relational database systems are many,
and it is almost inevitable for many enterprises to migrate their legacy
business data stored in navigational databases to relational databases.
In order to achieve this, enterprises in this situation are first faced with the question of
choosing from a number of options the best data migration alternative for their
particular organization and circumstances. Depending on various factors such as
technical feasibility, data/application architecture, criticality of
applications and, possibly most importantly, budget and time considerations,
several approaches are generally considered for data migration implementation,
including language interfaces, data propagation systems and source code
conversions. Two of the most common approaches taken by enterprises are to
write a set of data conversion applications in a high-level language such as C
or
Cobol, or buying/employing an off-the-shelf Data
Migration tool. In the first case, writing many,
isolated data conversion applications is
a very unstructured approach and can be very difficult and costly to maintain.
On the other hand, in the second case, off-the-shelf Data Migration tools must
be very carefully evaluated in terms of customization capabilities and
performance to justify their heavy costs and possibly one-time usage. They
often prove insufficiently flexible to be adapted to the specific requirements
of the source or target system.
Facing exactly this challenge of migrating its legacy data, stored in a number
of hierarchical databases to a new relational database, Gouden Gids BV, a
company which is a market-leader in providing telephone directory services,
started a large-scale data migration project (named Odyssey), in June 1998.
After considering many of the above-mentioned approaches and their pros and
cons, it was decided to develop a data conversion application in-house, but
with the strict proviso that a very central, structured architecture be
adopted. A data migration tool, Map Application, was developed for this purpose, and used to implement various data migration tasks in
a highly configurable way. The project completed successfully within less than
2 years, migrating all company legacy data. This document describes this
large-scale data migration project, the technical solutions implemented and the
Map Application itself.
1
Data
Migration Environment
1.1
Source Environment
Gouden Gids business legacy data was stored in
an hierarchical database system called DRS (Data Retrieval System).
DRS database system actually contained 9 separate databases. Five of these
contained very large volumes of data. The other four contained small amounts of
data, mainly business driver information, but their data models were more
complex. Since the complete application system using these databases was called
ADTEC, these databases too were referred to as ADTEC databases. The total
database size was 6 G-bytes excluding indexes.
The five big databases that stored operational data had in total 38 tables.
Small tables where driver/business rule information was stored had 66 tables in
total.
1.2
Target Environment
The target database for the Data Migration project was an Oracle relational
database, called
ITHACA
. A relational data model, called Corporate Data Model (CDM), was
developed to support the target applications that will be running on this new
database system. Parts of this data model were logically close to their related
ADTEC databases, but in general it had many important differences. The target
data model, CDM, evolved as the data migration and application migration
projects progressed and could not be finalized until the very late stages of
this project.
Because of all the differences between ADTEC and CDM databases, physical data
volume has increased by about 70% in total in the target database. CDM data
model had eventually about 240 tables in it.
180 tables were populated by Odyssey project. The remaining tables were going
to be incrementally populated by target applications, or not used.
2
Data
Migration Tool Requirements
After considering our data migration requirements and several data migration
alternatives mentioned in earlier sections, it has been chosen to develop a
data migration tool. First, a set
of requirements was identified for a Data Migration tool, so that they could be
taken into account while designing and developing such a tool. The data
migration tool to be used for this large-scale data migration project needed to
meet a set of basic requirements, including:
-
Ability to scan and process multiple data sources and targets.
-
Ability to perform simple and complex data transformations.
-
Ability to support user exits (flexibility for customization).
-
Time and space efficiency.
-
Meta-data management.
-
Data validation and cleansing.
-
Logging and reporting.
3
Data
Migration Architecture
Figure 1.
shows a high-level layout of the data migration
architecture. As shown in the figure, the contents of the ADTEC hierarchical
databases are first carried to middle layer Oracle relational database, named
RADTEC (Relational-ADTEC). Creating this middle layer relational database
served three main purposes:
i)
It
allowed us to do very comprehensive and efficient data analyses of the source
data using ORACLE/SQL capabilities, without having to issue the same queries in
the source hierarchical databases.
ii)
It
allowed us not to use ADTEC hierarchical databases as source of data
conversion, thus avoiding inefficient data accesses in these databases.
iii)
It
allowed us to retrieve our source legacy data from a single relational
database, as opposed to multiple hiearchical databases. Using this approach,
data mappings didn’t need to know or deal with the internals of the source
hierarchical database systems.
Figure 1.
The Data Migration Architecture
3.1
Converting Hierarchical Data into Relational Data
All tables of all ADTEC databases were downloaded into
flat files, and were sorted according to their hierarchical keys. It was not
necessary to do any structural changes on the ADTEC table layouts, since the
ADTEC data model was reasonably normalized. Keys of the parent tables were
repeated in the descendant tables, as shown in Figure 2. Data in different
ADTEC databases was easily matched by the use of a unique reference number
maintained consistently across the databases.
Figure 2.
Hierarchical Data Model Representation
The flat files downloaded from ADTEC are uploaded to RADTEC Oracle database,
using the Oracle utility SqlLoader. Thus, just after this download-upload
operation, the data content of ADTEC databases, source flat files and RADTEC
database were the same.
3.2
The Map Application
As shown in Figure 1, at the core of the Data
Migration Architecture is the Map Application. This uses the
Map Descriptors stored in a Map
Catalogue
to convert a set of source components, namely source flat files and source
database tables, into a set of target components, namely target flat files or
target Oracle database tables. Map Descriptors, which are structures to define
a mapping between source and target components will
be described in detail later. The Map Catalogue is a set of Oracle tables that
store Map Descriptors.
The Map Application performs data conversion on a map-by-map basis, that is,
per Map Descriptor. After fetching a Map Descriptor from the Map
Catalogue,
and depending on the access method specified in a Map Descriptor, the Map
Application System may use the source flat files or the contents of a RADTEC
table as the source of data for the mapping. If a Map Descriptor contains any exit points (that is, user-defined functions supplied by the author of the Map Descriptor), the Map
Application first loads the corresponding Map DLL into memory, then calls the
necessary functions. After scanning the source data and performing the data
conversion as indicated by the Map Descriptor, a set of target flat files are
produced.
The target flat files generated are directly uploaded to the target
ITHACA
database system, using Oracle SqlLoader.
3.3
Map Descriptors
Map descriptors are structures that are used to define and store the definition
of a mapping between a set of source and target
components,
that
is flat files or tables. A Map descriptor can be considered as an object with
associated definition, storage, fetch and activation services. A GUI is used to
define, store, browse, update and activate Map Descriptors. Each Map Descriptor
has a unique identification number, called Map-Id, which is passed to the Map
Application as the only parameter when a map is to be executed.
Figure 3.
shows
a high level view of a Map Descriptor.

Figure 3.
The Map Descriptors
It is the access mode indicated in the Map Descriptor
that directs the Map Application to use a specific scan algorithm on source
files or tables. There are 4 different scan algorithms employed:
·
Hierarchical flat file scan: This
method makes use of the characteristics (parent-child relationships) of the
hierarchical data model while scanning the set of source flat files, which must
be sorted with respect to the fields that define the parent-child relationship
between them. The Relation List component of the Map Descriptor is used to
indicate such relationships.
Figure 4. shows two source tables which are in a parent-child relationship and
how they are mapped to a target table through a hierarchical flat file scan:
|
PK1
|
PF1
|
PF2
|
PF3
|
|
PK1
|
CK1
|
CF1
|
CF2
|
|
TF1
|
TF2
|
TF3
|
TF4
|
|
Pk1
|
Pf1a
|
Pf2a
|
Pf3a
|
|
Pk1
|
Ck1
|
Cf1a
|
Cf2a
|
|
Pk1
|
Ck1
|
Pf2a
|
Cf1a
|
|
Pk2
|
Pf1b
|
Pf2b
|
Pf3b
|
+
|
Pk1
|
Ck2
|
Cf1b
|
Cf2b
|
->
|
Pk1
|
Ck2
|
Pf2a
|
Cf1b
|
|
Pk3
|
|
|
|
|
Pk3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Parent Source Table
|
|
Child Source Table
|
|
Target Table
|
Figure 4.
Example case for Hierarchical Flat File Scan
·
Merge flat file scan: This method
scans the source flat files based on a common key among them, specified using a
‘Relation List’, and forms groups of records from each source file that
contains the common key value. The source files need to be sorted on the common
key specified. The common key can be composed of multiple fields. This method
has been by far the most commonly selected method for implementing data
migration maps.
Figure 5. shows two source tables which are in a common key relationship and
how they are mapped to a target table through merge flat file scan:
|
K1
|
K2
|
F1
|
F2
|
|
K1
|
F1
|
K2
|
F2
|
|
TK1
|
TK2
|
TF1
|
TF2
|
|
Pk1
|
Pk2
|
F1a
|
F2a
|
|
Pk1
|
F1x
|
Pk2
|
F2x
|
|
Pk1
|
Pk2
|
TF1a
|
TF2a
|
|
Pk1
|
Pk2
|
F1b
|
F2b
|
+
|
Pk3
|
F1y
|
Pk4
|
F2y
|
->
|
Pk1
|
Pk2
|
TF1b
|
TF2b
|
|
Pk3
|
Pk4
|
|
|
|
|
|
|
|
|
Pk1
|
Pk2
|
TF1c
|
TF2c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Source Table 1
|
|
Source Table 2
|
|
Target Table
|
Figure 5.
Example case for Merge Flat File Scan
The target fields TF1, TF2, in the above example, are
formed based on the content of the record groups of Source Table 1 and Source
Table 2 which share a common key and don’t have a direct, one-to-one
relationship with any of the source table fields.
·
Query table scan: In this method, first a SQL-defined ‘view query’ is defined
on top of source tables, then the view query is executed within the database
and the result set is used to feed the scan mechanism, one record per scan
iteration. The ‘view query’ may either be statically formed as complex as
possible and stored within the Map Descriptor, or it can be generated
on–the-fly by the Map Application using the relation list associated with the
source tables as “join conditions”. Thus, this method fetches its source data
from Oracle tables, rather than the source flat files. Even though it is, in
general, slower than flat file scans, it is quite a powerful way of allowing
very complex source table scans. The Map application uses Dynamic-SQL and
cursor mechanisms to bring results from the database per scan iteration.
·
Customized file/table scan: The
user who defines the Map Descriptor may choose not to employ any of the
pre-defined access methods, but provide his/her own scan functions. The Map
Application System then uses these user-provided functions (coded in a
high-level language such as C) to scan source components.
Each Map Descriptor contains at least one Map list. Each Map list, in turn,
consists of a list of Map items, which represents a field-by-field forming of
target components. In other words, each Map item corresponds to one field in
one of the target components and tells the Map Application System how to
generate a value for that field
during each iteration
of the above mentioned scans. A Map item, for example, may indicate that the
corresponding target component (flat file or database table) field will have a
NULL value, or a Constant value or will receive its value directly from one of
the source component fields, which is called ‘simple mapping’. It may be
sometimes necessary to attach to a Map item a Map function, whose purpose is to
evaluate the target field value using the current value content of one or more
source components. This is called ‘complex mapping’. The associated Map
function, which is written in a high-level language such as C, can be as
complex as necessary but there is no need for the author of the function to
know anything related to the internals of the Map Application. The user writes
his/her Map function using only the logical definitions of the source and
target components, and the pre-defined Map application
function
interfaces.
Map functions are stored in shared Dynamic Link
Libraries (DLLs on Windows and Sharable Images on Vax/VMS) and called by the
Map Application during run-time.
3.4
Map Application Structure and Environment
The Map Application consists of two major,
independent execution units:
-
Map GUI
-
Map Driver
The Map GUI is the front-end
application, running on a PC under Windows, used by developers to create,
update, browse and activate Map Descriptors. Figure 6 shows a screen capture of
the Map GUI. The Map GUI connects to the Map Catalogue stored in Oracle through
an ODBC connection.
The Map Driver is the main
application, which executes an individual data conversion task as defined by a
Map Descriptor. A unique Map
Identifier is passed as the only parameter to the Map Driver. The Driver
fetches the corresponding Map Descriptor from the Map Catalogue and takes all
the necessary actions to perform the data conversion as specified. Any
erroneous cases encountered are reported in map-specific error/log files.
Figure 6.
Screen capture of Map GUI (field-level Map definition)
4
Conclusion
Many commercial enterprises
currently face the challenge of migrating their
legacy business data, stored in navigational databases (hierarchical, network
etc.) to a relational database and undertake big data migration projects. The
duration and success of these projects depend on various factors. Gouden Gids
too undertook such a large-scale data migration project and successfully
migrated all its hierarchical legacy data to a relational database
system. Probably the most
important lesson drawn is that, having a very structured approach to analysis,
design, development and quality aspects, supported by an automated data
migration tool such as the Map Application, is crucial for the success of any
data migration project.
Ali A. Keskin
is a Technical Project Leader at Gouden Gids BV.
He led the Odyssey project and developed the Map application. He received a
B.Sc. degree in Computer Engineering from the
Middle East
Technical
University
and a M.Sc. degree in Computer Science from
University
of
Georgia
.
He had publications in Journal of Object Oriented Programming, International
Journal in Computer Simulation, Proceedings of SE ACM Conference and
Proceedings of International Symposium on Large Spatial Databases. His
interests include Database and Middleware technologies.
Dogan Yilmaz
was Technical Project Manager at
Gouden Gids BV. He led both phases of Odyssey project. He is now working as an
independent consultant. He received a B.Sc. degree in Computer Engineering from
the
Middle East
Technical
University
and an MBA degree from
Brunel
University
.
His interests are IT Service and Project Management, Service Quality
Measurement, and Database technologies.
Timothy K. Payne:
is a Software Analyst/Developer at Gouden Gids
BV. He was a key member of the Odyssey team, participating in all phases of the
project. He received a B.Sc. degree in Physics from the
University
of
Southampton
and a M.Sc. degree in Applied Computer Technology from
Middlesex
University
.
His interests include database design and software engineering techniques.