Data warehouse implementation, although basically the similar in that is based on certain data architecture and data models, also has several differences brought about the differences in hardware and software platforms which are hosting the databases. In a data warehouse, specially the very large ones, there are various data sources. Big companies typically implement several data sources each representing a certain company department or a group of departments into one data source.
As each data source share the data from their database, the need to extracted first and then transformed before being finally loaded into the data warehouse. This collective process in known in the data warehousing world as ETL which stands for extract, transform and load.
A typical large data warehouse implementation involves one or several operational data stores. These operational data stores are designed for function as a place or area for integrating data from many different disparate sources in order for these disparate data to be operated on, analyzed and reported. And because the data comes from various sources with highly potentially differing platforms and formats, the integration often involves cleaning, redundancy resolution and business rule enforcement.
As such, an operational data stored is typically designed so that it can contain low level, atomic or indivisible data like prices and transactions (examples non atomic or non divisible data include net contributions which is aggregated or summarized data).
The operational data store can achieve data atomicity by extracting the data from the data sources. So, as it is, since various databases powering the data sources send data converging at the operational data store, there needs to be a way whereby extracts can be efficiently facilitated and the extract specification handles this particular aspect.
An extract specification usually contains the general format of the extract files, the contents of each of the extract file and its accompanying information, and then each of the data elements in details including valid values as well as all other important data extract characteristics.
An extract specification format may be tailored to the needs of each department; but if this cannot be possible, a generic format may be utilized for each extract data. This would require each data store to convert the extract file in a format which is suitable for specific systems.
The general format of each the extracted file is similar in that all data elements may be represented as variable length data (but in some cases data element descriptions may be identified with fix lengths). Textual data elements may be surrounded by quotation marks and each of the record within a file are sorted according to different categorization codes.
The Extract File Descriptions may contain more detailed descriptions. The descriptions may include the general description about the extracted data, a brief description about the data source, any special comment about the data and a list of the data elements in an ordered form.
The definition of the data elements may take the longest and most detail parts. This has to illustrate in-depth aspects about the data and how it will be used and manipulated during the data extraction process.
It is extremely important to give a serious planning on the design of the extract specification because this process being the first in the ETL, has a very critical role which may send waves of problems when small glitches are undetected.