Many large organisations are faced with challenging strategic objectives in order to stay ahead of the game in their respective industries – they need to constantly improve service offerings, reduce cost, gain competitive advantage and attract new customers.
Companies are often held hostage by disparate legacy applications and business processes that may have attributed to their success in the past, but in the present have reached capacity and turned into a liability, hindering them from achieving these objectives.
To overcome these hurdles, Business and IT Departments are often tasked with identifying new applications that leverage new technologies and business processes that can be an enabler to the organisation and help them achieve their strategic objectives.
In the quest to address these issues and drive business forward, organisations are proactively seeking strategies to insure that success. To that end, many are adopting migration as a strategy to resolve their business and IT challenges. When migrating from a legacy application (source) to a new application (target) a Data Migration approach is preferred. To stay ahead of the game, organisations should maximize the value of their data.
There are many architectural options when approaching a Data Migration project, such as the direct migration of information from source to target, or the use of a staging area to enrich organisational data from source to target.
Why Stage Data?
Data Migration often lets a business maximise the value of data before it is migrated to the target system, as there might be limited opportunity to improve the data once it is loaded. The idea is to make the data as useful as possible when it reaches the target system.
It makes sense to stage data before migrating to the target system, because you can:
- Rectify, standardise, enrich and/ or harmonise source system data
- Align source system data format and quality to target system requirements
- Consolidate multiple legacy source systems to a single target system
- Audit and reconcile migrated data
Data Migration of legacy systems with poor maintenance history and silo existence grants the use of a staging area.
A multinational utility company needed a Data Migration approach to consolidate hard and soft copy documents (along with metadata) from its Legacy systems to re-constitute its design base within a specialised Engineering Tool. These Legacy systems were commissioned over the life of the company, with multiple data owners operating in departmental silos, coupled with poor maintenance history. This had resulted in data duplication and poor data quality. In contrast, the target system had strict business rules, schemas and mandatory fields that the source system couldn’t cater for.
A direct integration between the source and target was not possible without data cleaning and standardisation. A staging environment was chosen to remedy these short falls before migrating the data to the target environment. This strategy had to be repeatable across its 16 regional installations.
What is staging?
The staging area is an interim data source into which data, documents and supporting metadata from single or multiple source systems are extracted, profiled, cleaned and loaded. The staged data can be analysed by subject matter experts, to formulate business rules that can be applied to enrich and transform data to satisfy target system requirements. The finalised data is verified by subject matter experts for integrity and accuracy. These experts can approve or reject the migration to the target system. Migration auditing, configuration management and reconciliation are value added services provided by the staging area.
Data Migration Phases
Data Migration is performed in four primary phases:
Planning for the staging environment starts with the Data Migration Planning Phase. The Staging Phase starts with Data Quality Validation and Analysis & Design, and ends with the Data Migration Implementation Phase.
1. Data Migration Planning
A good plan is like a road map: it shows the final destination and usually the best way to get there.
H. Stanley Judd
Thorough planning is the foundation for consistent success in any process, and data migration is no exception. The following guiding principles, along with risk and/ or critical success factors, are used when planning for a Data Migration implementation:
- Minimise project risk: this is achieved through stringent planning and execution, as well as multiple test runs, to ensure that risks are identified and mitigated prior to going live
- Ensure Conversion Data Accuracy: this is achieved through data reconciliations and testing
- Minimise business interruptions: this is achieved through stringent planning, risk and issues identification and mitigation
- Tie-back to Objectives: ensures that all activities and/ or specifications detailed in a document are linked directly to the objective of the document
- Stable and repeatable process: ensures that the migration approach and process is stable and repeatable for all future data migrations
- Approved approach: ensures that the approach is approved by the relevant stakeholders involved
- Achievable: ensures that the approach and adopted process are achievable within the given time frames
The following Data Migration Planning activities will directly influence the planning of a Staging Area:
a) Determining Data Migration Technology:
The decision to choose a particular Data Migration Technology is often influenced by the following factors:
a) Ensuring scalability and throughput of the tool to support varying volumes of data
b) Realising investments by re-using licensed tools, rather than investing in new tools
c) Providing reporting, reconciliation, configuration and auditing capabilities
d) Providing an interface and data repository to configure and store the data dictionary, as well as business and transformation rules
e) Providing analysis capabilities, such as data profiling and data mining
These two common data integration tools and practices can be applied to data migration:
a) ETL (Extract Transform and Load): In a more traditional data migration scenario, where data is to be migrated from source to target system with extraction, transformation and loading requirements, a vendor-specific ETL tool is preferred. Users prefer ETL for its unique ability to handle the extreme requirements of data migration, including terabyte-scale datasets, multi-pass data transformations, deep data profiling, interoperability with data quality tools, and many-to-many data integration capabilities.
b) Hybrid Staging Solution: With today’s complex Data Migration requirements, best practices recommend using a hybrid solution rather than a single tool to migrate data. The hybrid solution will consist of components that are specialised to support various stages of the migration process.
Best practices recommend the use of Hybrid Solutions for Data Migration Projects with complex requirements.
A multinational utility company required a data migration solution that could collect and track physical and electronic document stored in various locations, from a paper based library, file servers, PCs and Legacy Document Management Systems. Document content was to be searched and analysed to extract supporting metadata, which was required by the target system.
Duplicate and non-technical documents or files were to be identified and excluded from loading to the target system. The finalised data set (documents and supporting metadata) was to be loaded to the target system, which could then be reconciled and audited, post-migration.
The chosen vendor adopted a Hybrid Solution to meet this complex requirement, which composed of technologies such as scanning and bar-coding, OCR, Hashing algorithms, full-text indexing, a document-based data model and business process automation tools.
b) Determining the Conversion Plan:
Migration requirements may require a change to the legacy data during the migration process. The changes may be to form, value, or volume. The Project Manager should plan and schedule workshops between the staging area’s business analyst and data stewards, to define and form data conversion rules between the source and target systems. These rules can be stored in configurable data dictionaries within the staging area. The following sample requirements can trigger data conversions:
a. Conformance to target data requirements
b. Translation of source data values to target data values
c. Enhanced data types and/ or formats in newer technology
c) Determining the Integration Plan
Data migration may require drawing data from more than one legacy data source. The Integration Plan describes how conflicts and duplication among multiple source data and data structures will be resolved. The plan also determines how to move the data from the source system(s) to the target system.
There are two options:
- Load the data sources in to the Staging Area sequentially, until all source data has been loaded. Then perform the integration of all source data in the Staging Area. Finally, move the integrated data to the target data store.
- In some cases the volume of data or time restrictions may not support the above option, and this may result in sequential individual data migrations (one for each source system). The staging area could serve as an integration environment, to simulate loading the new data set into an environment already populated with operational data.
d) Developing the Migration Data Quality Plan
Planning for data quality is essential for managing migration timelines within a staging environment. It is recommended to use a data profiling tool to profile the quality and relationship between data at source. Defect management process and tools should be used to log data quality issues, which can be channelled to the relevant team or person for resolution.
Weekly status meetings can be held to track the resolution progress and status of defects. Also, weekly reports can be sent out by the defects co-coordinator, providing the priorities, statuses and age analysis of the defects.
Data loss tolerance factors should be defined, as obsolete data structures and formats often don’t translate 100% into the target system. The Project Management team should consult with business stakeholders to determine the tolerance level for data loss. This will contribute to the remediation (what to do if a level of data loss is unavoidable) and the validation procedures (measuring whether the migration is successful).
e) Planning for Simulations
Depending on the scope, complexity and timelines of the migration, the following need to be included in the migration plan: mock-runs, simulations and dress-rehearsals. This will mitigate data quality, as well as transformation and conversion risk before the Migration Go-Live.
f) Planning for Switchover
The main consideration in the scheduling of the data migration is the conversion window allowed for on the date of the migration. This should be defined up front and targeted throughout the data migration process to ensure that it is achievable. This will be tested for in the various simulations.
The date and duration of Migration will be chosen to minimise downtime (next to zero-downtime) and disruptions of service delivery to customers.
g) Planning for Data Reconciliation and Auditing
The success or failure of the migration can be judged by reconciling the integrity, quality, accuracy and usability of the migrated data against the staged data. Requirements for reconciliation reports can be drafted during the planning phase, which can be developed and produced by the Technical Migration Team (Staging and Target) at crucial migration checkpoints. It is advisable to define checkpoints after crucial migration activities (e.g. extraction, enrichment and transformation) to verify that data integrity and quality isn’t compromised prior to final migration.
Subject Matter Experts and Business Stakeholders should review and audit the reconciliation reports so that an informed decision can be made to sign-off or roll-back the migration.
2) Staging Analysis & Design Activities
Analysis activities are initially performed during the Pre-Migration phase. The objective of these analysis activities is to understand the source system data, identify and cleanse data-quality issues (at source), and formulate mapping and transformation rules that will be used during the Migration. Analysis efforts will have to be re-visited on mock runs as new data is received. The documented results of the analysis become a resource for the design of the staging and target architectures.
The Analysis and Design activities should aim to produce the following Data Set qualities for implementation:
- Usable data sets: The finalised data is in a format that is usable for the implementation
- Complete data sets: All the data that is necessary for the implementation is available
- Accurate data sets: The data represents the correct real world situation and can be reconciled and signed off
- Correct data sets: The correct versions of the data sets are available and mapped to the correct destination
To achieve the above objectives, a sequence of task and sub-task (repeatable and achievable across migration runs) should be modelled in a Detailed Data Migration Process. While not every Analysis and Design activity will necessarily occur for every migration, a thorough Data Migration Plan should include either all requirements for the Analysis and Design activities or justifications for not including a particular activity.
a) Analyse the Current Environment
The Functional and Technical Migration Team (in collaboration with the source system team) reviews all complied artefacts related to the legacy environment to fully understand the IT Environment in place, and any constraints or technical limitations. It is important to determine remedies for these limitations in the new target system. This analysis can be performed during the Pre-Migration phase. The results of the analysis can feed as input for the Project\ Migration planning phase.
b) Extract Legacy Data
Extract the legacy source data while maintaining its integrity and relationships. The data should be extracted to the format that can be loaded into the staging environment. Legacy data should be extracted with zero to minimum disruption to business operations.
c) Data Profiling
Data Quality refers to the reliability and effectiveness of data. These are the types of potential data quality issues, each of which require a different set of remediation activities:
- Non-compliance with the metadata structure, such as an ID number data field containing alpha instead of numeric characters, or currency abbreviations that are non-compliant with standard abbreviations (such as “ZA” instead of “ZAR” for South Africa)
- Bad or corrupted data in the database (multiple records for the “same” person, mandatory data fields left blank, special characters etc.)
Data Profiling is an initial assessment of source data to understand and determine any quality challenges. Data profiling benefits include:
- Uncovering relationships that exist between different data elements
- Reducing data quality issues that pop up in testing and deployment
- Discovering sensitive data that may need to be masked
- Monitoring data quality on an on-going basis to support data governance
Data Quality issues experienced during data profiling should be assigned to the respective data steward using the defect management process outlined in the Data Quality Plan.
When possible, users should profile with a vendor tool to get greater accuracy, repeatability, and productivity.
d) Design Staging Area
The Staging Data Structures should house the interim data during the (trial) migration, and the procedures (staging, cleansing, conversion, migration, reconciliation and auditing) must populate the structures throughout the migration(s). To ensure the success and validity of the data migration procedures, it is essential that the staging environment mirrors the production environment. Otherwise, the results of the trial migration might not be representative.
e) Design Data Dictionary
A Data Dictionary or metadata repository is a centralised repository of information about data, such as its relationship to other data, its format and default values.
Due to the difference between the source and target data models, a data dictionary should be formulated to align the differences for most data migration projects. The data dictionary repository is housed within the staging area; it is presented to the staging team through a configurable GUI. The following configurations can be found in a data dictionary:
- Data Mapping: Mapping between source fields and its corresponding target field
- Data Translations: Translations between the source master data fields to target master data fields, e.g. Gender Male might be represented as 1 in source system and M in target system
- Defaulting: When a field is mandatory in the target system, but it can’t be mapped to a source system value, such fields should be defaulted to a value as agreed by the business
Formulating the Data Dictionary is a complex exercise that requires attention to detail and a comprehensive understanding of source and target data models by the staging team. The definitions in the data dictionary will be used by procedures in the Implementation Phase to enrich, transform and clean legacy data.
3) Data Migration Implementation & Validation
Action is the foundational key to all success.
The Implementation Phase commences with receiving profiled and cleaned data into the staging area, and ends with the validation of the migration to the target system as depicted in Figure 3.
The activities performed during the Implementation Phase should be tracked and monitored in the form of a checklist and timing sheet artefact, maintained by the Migration Manager. These artefacts can be used as a reference to improve efficiencies prior to subsequent simulations, and give the migration team and business stakeholders a close to realistic timeline for Migration Go-Live.
Although not every implementation and validation activity may occur for every migration, the ones that do occur generally follow the sequence below:
a) Analysis and Data Enrichment
When Data is being migrated from multiple sources to a single target system the staging area acts as an integration platform. As part of the integration effort, analysis will have to be done on the metadata from the different source, to identify the most accurate and relevant metadata that can be transformed to the target system.
Analysis efforts are also spent to derive missing metadata for source systems that are mandatory in target by identifying patterns and relationships between data. Due to the volumes of data involved in multi-source migrations, the analysis effort can be made more efficient by providing the functional team or data stewards with a rules-based engine. This engine can be configured to apply pattern-and keyword searches that will filter data relevant to the analysis.
b) Clean and Transform
The Data Cleaning and Conversion Procedures are applied in the staging area to resolve data quality issues, and transform legacy data to be compliant with target system requirements. The mappings and translations stored in the Data Dictionary are referenced for transforming data.
Check point reports can be generated from the staging environment for Functional Teams, to verify the accuracy, integrity and data quality tolerance factors of data after the data enrichment, cleaning and transformation efforts. Based on the verification results, the Project Management Team (along with Business Stakeholders) can make a decision to proceed with the migration of the data to the target system or – if further analysis, cleaning and transformation are needed – to improve the quality of staged data.
Data Migration procedures can be applied to migrate the finalised data from the staging environment to the target system. Two approaches can be taken to load data to the target system:
- Preferential Phased Loading: In cases of data migration projects with large data sets, loading of data to the target system is preferentially phased to save time, prepare the target system and identify issues early in the loading process. System-Critical/ Master Data is preferentially loaded to the target system, while Operational Data might still be cleansed and transformed in the staging area.
- Complete Loading: In Data Migration projects with smaller data sets the loading of data to the target system can be achieved in one go.
In case peripheral systems (that weren’t migrated) have to interface with the target system, the target master data and other referential data should be synchronised to the peripheral system from the staging area.
f) Test and Reconcile
Post-migration functional testing of the target system should be done by the Quality Assurance Team. Test reports should be produced for review by the Functional Team and Business Stakeholders.
The integrity and quality of the migrated data can be verified by the Functional Team and Business Stakeholders using reconciliation reports generated from the staging area, as well as the target system. The reconciliation reports should highlight un-reconciled records and reason for mismatch, which can be used as a reference to rectify mismatches.
g) Switch over
It is near to impossible to achieve 100% migration of data from source to target. If the errors in the functional testing and un-reconciled records are within the tolerance boundaries of the migration acceptance criteria (defined in the data quality plan), business can make a decision to switch over operations to the new system with the assurance that any defects will be fixed and migrated within an acceptable time frame. Alternatively, if the defects are beyond the tolerance boundaries (for example, they will affect service delivery), then business can make a decision to continue operations on the source system until confidence in the migration is regained.
Considering the advantages of leveraging staging for Data Migration, there are many business advantages over more traditional integration approaches, including:
- The cost advantage of keeping volatility within a single domain refers to the fact that, since a staging area is where all of the data from different sources is temporary housed and we are using different formats and different time sensitivities, any changes made to the source and/or target systems are handled within the staging area and can be abstracted within the staging area (which is able to adjust using a configuration mechanism, rather than force expensive redevelopment).
- Reduce Total Cost of Ownership: Identifying redundant data and flagging it as such, so that it is not migrated to the new system, reduces Total Cost of Ownership of data.
The following sources contributed to the content and/or formatting included herein:
- Data Migration Project Life Cycle Figure 2 drawn from Department of Education Office of Federal Student Aid Data Migration Roadmap: A Best Practice Summary Version 1.0
- Best Practices in Leveraging a Staging Area for SaaS-to-Enterprise Integration - David S. Linthicum
- Best Practices in Data Migration – Philip Russom