Data Scrubbing is a technique to correct error by using a background task in order to periodically inspect the memory for errors. The technique helps to decode, merge, filter and even translate the source data so that the data for data warehouse remains valid.
The error correction used is often the ECC memory or another copy of the data. Employing data scrubbing greatly reduces the possibility that single correctable errors will accumulate.
To illustrate the need for data scrubbing, let us take this example. For example, if a person is being asked: "Are Joseph Smith of 32 Mark St, Buenavista, CA and Josef Smith of 32 Clarke St., Buenvenida, Canada the same person?", the person would probably answer that the two most probably are the same. But to a computer without any aid of a specialized software application, the two are totally different guys.
Our human eyes and mind would spot and justify that the two sets of records are really just the same but there was a mistake or inconsistency taking place during the data entry. But in the end, it is the computer that handles all data there should be a way to make things perfect for the computer. Data scrubbing weeds out, fixes or discards incorrect, inconsistent, or incomplete data.
Computer cannot reason. The operate along the concept of "garbage in garbage out" so no matter how sophisticated a software application is, if the data input is not of high quality, the output data will not be of high quality also.
With the widespread popularity of data warehouses, enterprise resource planning (ERP) and customer relationship management (CRM) implementations nowadays, the issue of data hygiene has become increasingly important. Without data scrubbing, the staff in a company may face the sad prospect of merging data which are corrupt or incomplete from multiple databases. If one thinks about it, a single very small bit of dirty data may seem to be very trivial but if this gets multiplied by thousands or millions of pieces of erroneous, duplicated or inconsistent data, it could turn into a huge disaster. And a tiny bit of dirty data will highly likely multiply.
There are many sources of dirty data. Some of these most common sources include:
- Poor data entry, which includes misspellings, typos and transpositions, and variations in spelling or naming;
- Lack of companywide or industry-wide data coding standards;
- Data missing from database fields;
- Multiple databases scattered throughout different departments or organizations, with the data in each structured according to the idiosyncratic rules of that particular database;
- older systems that contain poorly documented or obsolete data.
Today, there are hundred of specialized software applications that are developed for data scrubbing. These tools have complex and sophisticated algorithms capable of parsing, standardizing, correcting, matching and consolidating data. Most of them have a wide variety of functions ranging from simple data cleansing to consolidation of high volumes of data in the database.
A lot of these specialized data cleansing and data scrubbing software tools also have the capacity to reference comprehensive data sets to be used to correct and enhance data. For instance, customer data for a CRM software solution could be used in referencing and matching to additional customer information like household income and other related information.
Despite the fact that data hygiene is very important in getting useful results from any application, it should not be confused with data quality. Data quality is about good (valid) or bad (invalid). In other words, validity is the measure of the data’s relevance to the analysis which is at hand. Data scrubbing is often confused with data cleansing; but they have similarities to a certain degree.