A data warehouses has a set of processes which is called ETL standing for extract transform load. The case of a data warehouse implementation involves several databases in different data stores scattered in different nodes with the information system network. These data stores can be of different systems.
For instance one data stored may be running on a Unix and other Unix-like operating system. Another may be running on Windows and its suite of Server Operating Systems. Others may be run on Linux and still others may be from legacy systems.
Since a data warehouse is an expensive investment and undertaking, the updating of the entire system may done gradually such as addition of new computer nodes every few months or years. This can greatly affect the integration of system data resulting in disparate data.
Another consideration is that different data sources may be powered by different kinds of database management system. Some may be relational and other non-relational. Even with today’s advance technology where most databases in used are of the new relational databases, there are still issues issue related to disparity.
There are hundreds of relational database management system vendors although all of these vendors make their software based on a the basic premise of relational databases, they still have their individual, distinct and even proprietary functions and formats that may hardly be compatible with other relational database management systems.
Still another problem with dealing with data on a large data warehouse is the existence of flat files. It may be hard to believe that even in today’s fast advancement of information technology, flat files still exist. Flat files may come from document format and other spreadsheets applications.
Now, there has to be a way to unified all these disparate data so that the data warehouse will be able to processes them, sort them out, and come up with relevant information or at least an ironed out format of data that the data warehouse can feed to the business intelligence system for use by top company personnel and other data consumers.
The extract, transform, load (ETL) set of processes overcomes the problem on data disparity in large data warehouse systems involving various data stores of varying platforms. What ETL does it that it initially extracts all the data from flat files, relational and non-relational database files. They need to be in some kind of operational data store where they wait to be cleansed and treated.
They need to be in this state so that all data redundancies are eliminated as well as the data structure of data are closely examined by the system for integrity. In this stage, the data will be sort of stripped from all the formats dressed on them from the platforms and applications they were coming from. What will be left of them is the very core, the very value of their existence.
After the data has been extracted with only their value left, they will be transformed (as what the ETL really means). There any many ways to transform and one of the most common ways of transforming them into one format the data warehouse takes is by using XML.
When they have been transformed they will then be loaded into the data warehouse where they await whoever queries them for relevant and useful information for reporting for where they business intelligence system automatically gets them for generating enterprise reports and other useful data. The process of extracting data is very labor intensive giving the bulk of data coming everyday.