Automatic Data Replication is the process wherein created data and metadata automatically replicates based in the request of the client at a specific data site. A data site could maintain many computers working as a system to manage one or more data warehouses. These warehouses are repositories of millions of millions of data and more are gathered, aggregates, distributed and updated every second.
Data replication makes use of redundant resources like hardware or software so that the whole data site system can have improved reliability and performance and become tolerant to unexpected problems arising from load intensive processes.
Data replication can be done manually or automatically as programmed or set by the database administrator. Data replication can be stored on the same storage devices or spread across several multiple storage devices within the same data site or across other data sites in different geographic locations.
Data replication is commonly used in distributed systems. A distributed system is composed any computers trying to process different parts of a program. These computers constantly communicate with each other over a network so that their processing job can be synchronized and collated to come up with the desired output.
Data replication in distributed systems comes in three methods. The transactional replication model is used to automatically replicated data used in transactions. The state machine replication is mainly used to achieve fault tolerance by having copies of some deterministic tasks executed on multiple nodes. The virtual synchrony replication functions by having a group of processes cooperate so they can replicate in-memory data.
Data replications used on a lot of database management systems have master-slave relationships between the original and the replicated copies. When the logs of the master are updated, all the slaves also follow. The slave receives the update, it send a message to the master that the slave receive more subsequent updates.
In a multi-master replication in database management systems, updates can be submitted to any node and then spreads out to other serves. This can result to faster updates but may be impractical to use in some situations because of its complexity and the potential conflict in some cases.
Active storage replication is done by having updates from data block devices distributed to many separate physical hard disks. The file system can be replicated without any modification and the process is implemented either in a disk array controller in the hardware or in the device driver software.
Data replication is also employed in distributed shared memory systems. In this system, many nodes share the same page of the memory which means data is being replicated in different nodes. This is used to boost speed performance in large data warehouses.
Search engines where the biggest data warehouses of data and metadata are index every second employ the most intensive use of automatic data replication as they services the public internet users around the world.
Load balancing despite being different from data replication is often associated with data replication because it only distributes loads of different computation in many machines.
Back up, while the process involves making copies of data, is different from data replication in that the data saves cannot be changed for a long period even if the replicas are constantly updated.
Both load balancing and back up are important processes in large data warehouses. Many business companies invest on data warehouses with automatic data replication mainly to take advantage of enhance availability of specific and general data and to have disaster recovery protection. Other benefits of having a data warehouse with automatic data replication include tolerance from disaster, ease of use and management and more robust system.