1. What We Do
  2. Case Studies
  3. Products & Partners
  4. News
  5. About Us
  6. Contact Us

Data Cleansing, Consolidation and Migration IT Solution

Data Cleansing, Consolidation and Migration IT Solution

Business Overview

A multinational utility company required a data cleansing, consolidation and migration IT solution. This company embarked on a design base “back-fit” initiative where technical and engineering documentation and drawings (legacy documents) were being sourced in order to reconstitute the design base on their Target Engineering System.

The utility company approached a multi-displinary engineering company with the requirements to consolidate, verify, extract metadata from, allocate and classify legacy documents at its 8 regional installations in order to meet the design base back-fit criteria.

The engineering company partnered with Digiata, to combine engineering and software design and development expertise and implement a staging solution that facilitated the utility company’s requirements.

The Challenge

The utility company’s regional installations had legacy documents stored in various locations, from paper -ased libraries to various file servers, PCs and document management systems.

The consolidation and identification of duplicate documents saved in the various locations provides a real challenge. If an electronic solution could not be implemented, the client would have had to employ a substantial amount of new people to manually review the documentation and perform the necessary classifications and allocations.

The auditing of manual processes is a very difficult task because one has to ensure that business rules are applied consistently and uniformly.

Time constraints also ruled out such a manual process, as the bigger installations had in excess of 600,000 documents that needed to be processed.

The Solution

Digiata’s staging solution contained processes and procedures to collect, track and store legacy documents, and a document cleanup and integration system that was implemented in 3 phases.


The 1st phase consisted of data collection at the 8 regional installations; a dedicated Digiata resource was based at each installation to assist document handlers to scan legacy documents and drawings into the document cleanup and integration system. A scanning and barcoding process was put in place to identify and track the location of these documents. During the scanning process, duplicate documents were identified and barcoded. A custom solution was developed using Digiata’s Linx integration tool and Stadium product (toolset for the rapid development of web applications without having to resort to coding) to specifically extract the information from various document management sources such as Pigo (πGO)™, as well as to obtain files from various sources such as file servers, external data storage devices, SAP etc.

Cleanup and Configuration :

The 2nd phase involved analysing the collected drawings and documents to identify patterns within documents that would be used to extract metadata. A rules-based engine was used to apply pattern and free text search to extract and enrich metadata. Configuration management principles were formulated and applied during the pre- and post-cleanup phases.

Verification and Migration:

The 3rd phase involved auditing and verification of the cleansed data by subject matter experts from the client. The verified and signed-off data sets were migrated into the Target Engineering System using Digiata’s Linx Integration Tool.

The Document Cleanup and Integration System provided the following functions to assist with the cleansing and augmentation of metadata:

• OCR of all documentation to provide full text searching functions

• Excluding of duplicate documents using a hash code algorithm

• Excluding of non-engineering and technical documentation using a keyword search approach

• Implementing all cleansing and data extraction/ augmentation via predefined, customisable rules that were audited and logged

• Ability to undo rules when errors are picked up

• Identifying documents containing the same information and packaging them into the various versions with a proper revision management scheme

• Harvesting metadata from the contents as well as other attributes within the files

• Identifying and allocating documentation to the various levels in the codification structure (KKS & AKZ), as well as system level tags

• Classifying documentation according to the IEC 32-9 document standard

• Identifying and allocating documents to the different contractors that are working at the installations

• Exporting the information using the agreed interface requirements to the Intergraph SPO application

• Completing the audit trail – all actions, system and user initiated, were recorded and logged, which provided a complete and transparent view of all data and actions in the system

• Multiple source locations – allowing sourcing of information from multiple locations e.g. e-mail, FTP sites, network drive, etc.

• Mongo: No SQL database

Meeting all the strict deadlines imposed to the team by the client

The Results

The benefits that the utilities company gained from the process were as follows:

• It consolidated information at a plant level in a very short time - an average of 640,000 documents (over 640GB) were imported, analysed and enhanced, per installation in the space of 5 months. This included developing the custom migration specification with the Target Engineering System and delivering data ready to be loaded.

• It excluded sensitive and other documentation not relevant to the engineering process (over 75,000 documents per installation).

• It offered a consolidated view of the documentation across various power plants.

• It converted documentation from an old classification scheme to the new IEC standard.

• It identified gaps in their codification structures as the software identified missing nodes.

• It was able to identify duplicate drawings, re-drawn in different formats and tools (AUTOCAD dgn/dwg, tif,pdf), as having the same document unique identifiers (Document Number, Revision and Sheet Number). This was accomplished using search and merge capabilities to merge these drawings as one single document.

• It identified all the records in the Pigo system that do not have any physical documents attached (over 41,000 that could not be matched to any file).

• It implemented configuration management principles to the documents imported.

• Integration – quick and easy integration into virtually any system.

• Scalability – the system caters for varying volumes of legacy document extraction and cleanup.


Design Base:

“Any combination of the specifications, criteria, codes, standards, analyses, constraints, qualifications, and limitations which determines the functions, interfaces, and expectations of a facility, structure, system or component. The design bases identify and support ‘WHY’ design requirements are established. Calculations are typically considered part of the design bases. Calculations generally translate design bases into design requirements or confirm that a design requirement supports the design bases.”

Staging Solution:

The staging area is an interim data source into which data, document and supporting metadata from single or multiple source systems are extracted, profiled, cleaned and loaded. The staged data can be analysed by subject matter experts to formulate business rules that can be applied to enrich and transform data, such as to satisfy target system requirements. The finalised data is verified for integrity and accuracy by subject matter experts who can approve or reject the migration to the target system.

To find out how we can solve your data cleansing, consolidation and migration-related business problem please contact us.

Email Us

Fill in your details and message below to contact us directly