Don Stanley Sysware Consulting Group Box 634 Wellington NEW ZEALAND
Written MAY 1998. All references to SAS are to Version 6.12 running on Solaris 5.2
Find Out All About Your Data
The new environment
How to migrate data
The Role of SAS
A migration is a movement of bulk data from one source to another. In the context of this paper this is caused by a need to move data from one operational environment to another as the client (an insurance company) replace their mainframe based Legacy systems with open UNIX/NT based software and Sybase databases.
A migration may be ‘big bang’ where all possible data is migrated at one time and possible the legacy system is switched off, or phased where the migration occurs in well defined chunks over a period of time.
It is important background knowledge when reading this paper to keep in mind that we did not replace the whole legacy system in phase one, that the legacy system is still the master system, and that legacy fields moved to the new system are not yet updateable.
Since the legacy replacement program is phased, and legacy remains the master system, we need to have a way to keep the new system in sync with legacy. This is a common occurrence in new system development, and a part of the new system (designed to stop when legacy replacement completes) is an interface system that updates the new system with legacy changes every night.
The following diagram sums up the migration process that we employed at this client site.
In this paper, the terminology used includes
|Legacy||an existing computer system that is being replaced|
|Insure2000||the name of the legacy replacement environment|
|Sybase||the database used to store data in the new environment|
All references to the SAS product line in this paper refer to version 6.12.
The fundamental problems we face in a data migration are how, what and when we migrate.
|How||we need to consider such questions as|
|What:||we need to consider such questions as|
|When||we need to consider questions such as|
Often Legacy systems accumulate unexpected values in fields. This can occur due to holes in the Legacy code or fields not properly initialised. For whatever reason these occur, it is a vital component of data migration that we detect and correct such data to map to current business standards.
We also found a degree of duplication, some people having multiple ‘unique’ personal records. This is no criticism of business processes, issues such as this occur in every site for reasons like
As we detect such issues, they become part of an ‘issues’ register that the business use to determine what, if anything, to do with such data.
A tool such as SAS, and especially the elementary features such as PROC FREQ and PROC UNIVARIATE, can perform this up front analysis with ease.
Although we had to do a lot of work to determine what data is in legacy, that should be kept in perspective. The reasons for this lack of knowledge were largely historical, the clients systems having been maintained by different software companies over time, and those companies handed over all programs and databases after some had been placed through a re-engineering process. The core data in this client site is well maintained and well understood, but the role of non-core data was a prime concern.
To indicate how much importance was placed on knowing what was in all the legacy database fields, we had a team of 4 SAS developers spending 14 months analysing the data contents of the legacy systems. Without that analysis, neither the business nor the migration developers would have been able to make decisions with any great degree of knowledge of all the legacy data. We would have finished migration much later, or alternatively with less data accuracy. We knew that a vital component of our work was to make data as accurate as possible, and invested that upfront time ensuring it would be.
It is not unusual to find that no one really knows just how much data there is in Legacy systems. Often, you have
Ultimately, this is a business decision. However, in order for the business to be able to decide what to migrate, they first need an intimate knowledge of the data.
Once you have tied the business down to agreeing what is to be migrated, you can then work out just how data will be migrated. This is a crucial piece of knowledge, as it may determine what the best technical migration tool and methodology is.
You may also need to consider whether data that is clean from the legacy perspective is also clean from the new database perspective. A good example is that a field looked clean in legacy, but in fact a few records (less than 10) somehow contained a high hex value. Sybase stored this value, but upon retrieval the application was unable to work with the data that Sybase gave it. So our cleaning task expanded to looking at individual characters of apparently valid character strings. We were fortunate, given the magnitude of such a task, that a new version of Sybase we were about to install wasn’t impacted by these values, so we were able to ignore them.
Often, Legacy systems have no formal data dictionary or data model. However, there is almost always a technical schema which the legacy database system requires for its own internal use. That schema may range from being an integrated part of the database, to being programmed field layouts in a Cobol program or some other component of an application.
It is a fundamental rule of data migration that you need to know about the data before you can start. The meta-data that is found in data dictionarys, Cobol copybooks and database schemas is basic to the task. It may partially drive the creation of the new system, for example the knowledge that a field is 30 bytes long in the legacy system, and a business decision that it should be ‘the same’ in the new system, determines the field length in the new system. However, without that meta-data knowledge of the legacy field length, the task is less well defined.
At this site, we had the legacy system in Fujitsu AIM databases. No formal data dictionary existed, but AIM has an integral component called the ADL, which contains the meta-data that AIM requires on all fields. We built a data dictionary on UNIX in Sybase, reading the Fujitsu ADL. Using SAS, we parsed the ADL, and extracted all tables, fields, field descriptions, field lengths, and everything else of interest.
In the SAS job that reads the ADL, we also derived for every legacy field, the SAS informat, and a unique name for the field. This allowed us to use the data dictionary as a source system for generating SAS code to read legacy extracts.
his short discussion doesn’t really do the custom built dictionary justice, it is the pivotal part of the migration process, being used by many business groups and development teams. Its usefulness to us in developing migration tools was immense, as we used it as a source system to build code generators that extracted data and loaded into SAS, all with no coding by the person doing the migration. This allowed developers to concentrate on the real issues of data translations and cleaning, rather than data extract input.
Given that the new environment starts from scratch, there is no excuse for not building comprehensive data models of that database system. Whilst the legacy system may need to be modelled retrospectively (or reversed engineered if appropriate tools exist – they did not for our Fujitsu systems) from existing technical structures, the new system demands a data model, probably even before the new database system is available.
Data models are rarely dependent upon software. It is well worth (as this client did) starting the modelling as early as possible. This helps everybody, as there is more time to consider legacy to new system mappings and to consider how best to approach the migration task.
You may buy problems if you buy a database (often a lengthy affair), then consider data models afterwards. There is often a need to show that an expensive database has a quick payback, but starting modelling early will give more guarantee of that payback.
This question takes on immense importance. If the requirement is to simply take from legacy and give to the new database, with a one to one field mapping, the task is really quite simple.
However, in practical terms it is far more likely that there will be modifications to field values, or new fields in the new system. These fields may be
When you require to add or modify fields, the migration task changes. You need a migration product that gives a high degree of manipulative ability. An advanced SQL product may do the trick. SAS will usually do the job far easier with its advanced data step and macro facilities, rich function set, and integrated procedures such as SORT.
Again this may drive the choice of migration software. The situation here is closely related to that in 3.2, except that here we require to modify fields. An example from this client is where an ethnicity field in legacy using home grown data values is required to use an international standard in the new system. Rather than clean the legacy system, which would have required custom built Cobol programs and a change to legacy, we built the translation into the migration. This also allowed extra flexibility in that if the business wanted a further change to the field we could handle that easier in the migration than the legacy programmers could.
Incidentally, change to legacy may be frowned upon during a period of writing a new application as the development focus and resource tends to be away from legacy. We found it much easier to make changes in the migration tools we were writing. When doing a major re-development, a stable and static legacy system reduces business risk.
The initial intended approach was a phased approach. We intended to base the phases (chunks of data) on records types, which neatly divided into PERSON data, CLAIM data, EMPLOYER data, PAYMENT data and REFERENCE tables.
It was intended to migrate over a five week period. This was possible since the client was not relinquishing the legacy system, and indeed this would be the master system for the short term. We wrote an ongoing interface (using SAS and C++) that runs daily to transfer changes from legacy to new.
ever as we started to do our chunk testing it was recognised that this 5 week approach was not going to be tenable. We needed all the data available for the business to successfully cut over to the new system. We still wrote all our migration software based on the chunks above, but used a big bang over a 5 day period. So we
the business, this meant they saw no down time on the legacy system which is to remain as the master system for the short term. The reason this is remaining the master is that the work done thus far is the first phase, and there is still a lot of data to be migrated and application written to cope with the remainder of legacy.
|Advantages||It gets the job done in one go. We knew exactly when the job would start and when it was finished. Certain risks, such as machine down time were minimised because we did the job in a shorter time span. Perhaps of most benefit to the project team, it meant the entire job was finished by a certain date, rather than migration continuing for longer.|
|Disadvantages||If something goes wrong you may need to re-do the whole job or a large component of it. The business impact of this is critical and clearly not acceptable in any site -- the business may have down time on computer systems. This is because the load may be so long that they cannot start to use the new system, and you normally don’t want operational data changes to legacy after migrating. However, this can be overcome to some extent by doing a second load later, to pick up any changes, or an ongoing daily interface to migrate changed records. We adopted the interface approach.|
|Advantages||Can move smaller chunks of data, but more often. This allows test data to begin filtering through, and permits Legacy to continue with minimal down time. It also allows the new system to be phased in, for example some branches may come on-line sooner than others.
From a software point of view this may be easier to manage. It is not necessary to start and end all coding at the same time.
|Disadvantages||It may be necessary to be able to update data on legacy that already moved. An example is a branch by branch approach where a record is migrated, but then transferred to an un-migrated branch. This gets very complex, and likely involves an unwanted reverse migration back to legacy.|
In the big bang approach, we simply shut down the legacy system on a certain date and cut across to the new system. Don’t mess around, just get rid of it. While we certainly considered this approach it was just not feasible. For a start the software developers on the new application were not 100% complete, which meant testing was not complete. To shut down any component of legacy in that situation would have been less than prudent.
Also, at no stage was this first phase development intended to be replacing legacy. It replaced part of legacy, and gave some new functionality, but it only attempted to replace about 30% of the legacy system. Thus, shutting down legacy was rejected early on, and a conscious decision was made that the legacy system would remain the master system until completion of all legacy replacement.
As we did not shut down legacy but ran in tandem with Insure2000 we needed a mechanism to migrate changes to legacy across to the new system. We modified the legacy system so every time a change was recorded on the legacy system (to the tables we had migrated) the change was also written to an incremental change table. At the end of each day this table was sent from Fujitsu to UNIX, where a SAS job picked it up and applies various translations, cleansing, and checks. The SAS job wrote to a holding table. Then a C++ process kicks in and applies application business objects to the data to mimic data coming in through the application.
An interesting fact was that application developers needed databases populated for their testing and they needed that data weeks before our migration tools and code were ready. The paradox of course is that until we finished our code we could not give them the data.
In the end we were able to give them data, but prototyped data for many fields. If a field was in a one to one mapping from legacy to new they got real data. If not we gave them prototype data. The reasons for this prototyping are that we did not at this early stage have complete definitions of all data, but we knew for example that “field x” would be on the database and would be numeric. Thus we invented values for “field x”, which permitted the developers to be able to write code that addressed “field x”. Such techniques, also known as dummy data, are a standard accepted mechanism for most new system development, and invariably will be needed in this sort of situation.
Fortunately, our code generators from the data dictionary were near complete, and this was an excellent test of them. It was largely due to the data dictionary and code generators that we were able to read legacy extracts and pump them through to Sybase very quickly. In fact, it took only minutes for us to generate a legacy extract and the SAS code to read it. 5.1. Testing can be logistically hard work
Testing our own migration processes was difficult. We needed mainframe disk space and 5-6 days mainframe and UNIX processing for each full test. Thus, our testing had to interfere with normal operations and other software development. We lived with this, but every time we tested, even at a chunk level which (theoretically) required less resource, we had to beg steal and borrow space. I say theoretically, because in practice we all tested chunks at similar times so required a lot of resource anyway.
A good piece of advice, start early on justifying and ensuring you have lots of usable disk space. It makes everyone’s life easier to know that we don’t have to ‘negotiate’ with other users to do our testing, especially since that negotiation almost always conflicts with other important work. We were fortunate that we had done this and that plenty of dasd was available for us.
Our migration process testing consisted of three major parts
The SAS only part occurs when a chunk developer wants to test code they have written. We almost always do this against existing extracts. We structured our programs so that a single chunk load was physically several programs split at logical points. Thus we could test by starting the process part way through using data saved from earlier runs.
The extract and SAS part comes when we have to test the process from when ops submit jobs on the legacy system to extract right through to SAS updating a holding area Sybase table. This is a vital test, because it is out of the developers hands and runs as if these two components were in a production mode. We ran these tests overnight and over weekends. They also gave us an indication of how long the final load would take. These all worked well, but resulted in process refinements and code refinements as we identified bottlenecks. We had expected this, and would have been worried had the tests revealed perfection!
Finally, testing of the extract, SAS and update components. The update component is a purely Sybase task. For speed we wrote to temporary Sybase tables with no indexes or constraints. We found loading to such tables, then having Sybase do the real load into the application database, much quicker than SAS loading straight to the application database. At any rate, a business decision was that only Sybase or C++ processes could update the new database.
As the project progressed we found it necessary to do tests against full data every time. We had envisaged cut down databases for testing initially, but we couldn’t guarantee all necessary data items would be available. In addition, a vital test was to try and reduce time of migration, which cut down databases do not help with. Field translations turned out to be constantly changing – needed 8 full runs of 125 million records, plus testing of individual chunks. It also turned out that the target database structure was constantly changing – which meant we had to write code that could easily cope with removal and addition of fields.
Why did we use SAS? For one thing it was licensed and available. For another, its analytical and reporting ability, plus the ability to seamlessly read ebcdic data on an ascii machine, made it a natural choice. High available knowledge and a willingness to train staff also contributed. As much as anything, its ability to read and write Sybase tables was a telling factor – as well as source of frustration with it not fully supporting Sybase structures.
SAS does not run on Fujitsu and could not directly access the Fujitsu databases. We had to use a mainframe query tool supplied by Fujitsu to extract data to a flat file, from where we either FTP’ed to UNIX or read it direct using the SAS FTP engine from UNIX.
Having the ability to read EBCDIC data directly on the ASCII UNIX machine simplified the task greatly. This was especially true for packed numeric fields from the mainframe extract tool.
We also were able to use FTP from SAS datasteps to treat remote flat files on the Fujitsu as if they were a standard input source. This reduced overheads with FTP.
For more detail of the FTP processes we developed, and also some of the data analysis techniques we worked with, see the paper Paper 73, SUGI 22 Conference Proceedings 1997. Using FTP, Views, and PROC SUMMARY to Analyse Large Databases Don Stanley (invited paper) ...
We used SAS/ACCESS to Sybase to do this. We found many issues with this software (some of which would not be issues without 7 programmers using it). The biggest issues we struck were undoubtedly
But many other issues were struck, and one of our utility macros, called Sybload, was written to provide a simple one line interface to dbload. This macro did many tasks, including interrogating Sybase to work out optimal sorting before loading. It streamlined DBLOAD and certainly masked its complexities and quirks from our developers.
Data step, macro, proc steps, SAS/ACCESS, the rich function set, FRAME to front-end tools, all helped make a potentially complex analyses and coding code simpler.
We were able to easily write macros to convert text strings to mix case, and also to carry out table referential integrity before loading to Sybase
If your migration jobs are structured this way it can save time on re-starts, as you don’t need to run already completed modules. Also, it aids testing by taking a modular approach. This heading is really a remainder of a good programming practice, modularity aids development, testing and production troubleshooting.
You need to have plenty of checks in code for unusual conditions, and write information to the log liberally. This of course assumes that the business want cleaner data in the new system than the old.
It is really useful to be able to determine ahead of time records that may violate rules in the new system. To do this you need to be able to identify records and write messages to the log – or to an external message file. If you identify such records in testing, the production run becomes much smoother. The last thing we needed was for the Sybase DBA to do the final copy to Sybase and find data that violated rules and referential integrity constraints.
The SASLOG should be reviewed by someone other than the developer. Apparently trivial issues such as numeric to character conversions should be carefully scrutinised. In our situation these did cause a problem in one chunk – they occurred whilst setting up a cntlin dataset for Proc Format, which caused the ‘other’ category to not exist.
We did have one major problem that was a direct occurrence of having to make changes in business rules near production time. Less than a week from production we had to do a change to one chunk. The test SASLOG was not thoroughly checked immediately – until after the production run had started. Unfortunately there was an error that prevented a table being migrated and it had already impacted the production run when noted in the test SASLOG. We were modifying code right up to two days before migration, and this led to a rush near migration time.
People always want migration done faster. That usually shows a lack of awareness of what a data migration must do. The quicker you do the migration, and the more shortcuts you take (especially in testing), the greater the chance of errors or omissions in the migrated data. You also have to make more assumptions, and with a dynamic business environment these may be tested and rejected further down the Insure2000.
You can only push data around up to a particular speed if you’re hardware constrained. Data transfer speed between machines can become a major and emotive issue which is usually independent of migration software, but which can cloud the real job of that migration software. It can become important to get fast lines between machines if possible, and again as with disk space, this needs to be considered early in the process.
We did a successful migration from 50 Fujitsu AIM tables to 118 Sybase tables, encompassing 125 million records and many coded data translations.
The final job took 5 days from Fujitsu extract through to final target Sybase load
We did 14 months data analysis and 6 months development .
10 people worked on the migration including 6 SAS programmers, QA testers, data modeller, data analysts
we had one error occur in the loads, directly related to an unavoidable late change in business requirements resulting in the test logs not being analysed in time. To minimise the likelihood of this again, a SASLOG Analyser tool has been written.
Other SAS tools we wrote include