Distributed Databases Overview
Suppose you created a database for a web application a few years ago. It started with a handful of users but steadily grew, and now its growth is far outpacing the server’s relatively limited resources. You could upgrade the server, but that would only stem the effects of the growth for a year or two. Also, now that you have thousands of users, you are worried not only about scalability but also about reliability.
If that one database server fails, a few sleepless nights and many hours of downtime might be required to get a brand new server configured to host the database again. No one-time solution is going to scale infinitely or be perfectly reliable, but there are a number of ways to distribute a database across multiple servers that will increase the scalability and reliability of the entire system.
Put simply, we want to have multiple servers hosting the same database. This will prevent a single failure from taking the entire database down, and it will spread the database across a large resource pool.
By definition, a distributed database is one that is run by a central management system but which has storage nodes distributed among many processors. These “slave” nodes could be in the same physical location as the “master”, or they could be connected via a LAN, WAN, or the Internet. Many times, database nodes in configurations like this have significant failover properties, like RAID storage and/or off-site backup, to improve chances of successful recovery after a database failure.
Distributed databases can exist in many configurations, each of which may be used alone or combined to achieve different goals.
Distributed Database Architecture
A distributed database is divided into sections called nodes. Each node typically runs on a different computer, or at least a different processor, but this is not true in all cases.
Horizontal Fragments
One of the usual reasons for distributing a database across multiple nodes is to more optimally manage the size of the database.
For example, if a database contains information about customers in the United States, it might be distributed across three servers, one each for the eastern United States, the mid-western United States, and the western United States.
Each server might be responsible for customers with certain ZIP codes. Since ZIP codes are generally arranged from lowest to highest as they progress westward across the country, the actual limits on the ZIP codes might be 00000 through 33332, 33333 through 66665, and 66666 through 99999, respectively.
In this case, each node would be responsible for approximately one third of the data for which a single, non-distributed node would be. If each of these three nodes approached its own storage limit, another node or two nodes might be added to the database, and the ZIP codes for which they are responsible could be altered appropriately. More “intelligent” configurations could be imagined, as well, wherein, for example, population density is considered, and larger metropolitan areas like New York City would be grouped with fewer other cities and towns.
In a distribution like this, either the database application or the database management system, itself, could be responsible for deciding which database node would process requests for certain ZIP codes. Regardless of which approach is taken, the distribution of the database must remain transparent to the user of the application. That is, the user should not realize that separate databases might handle different transactions.
Reducing node storage size in this manner is an example of using horizontal fragments to distribute a database. This means that each node contains a subset of the larger database’s rows.
{qbapagebreak title=Distributed Database Architecture Vertical Fragments}
Vertical Fragments
The vertical fragment approach to database distribution is similar in concept to the horizontal fragment approach, but it does not lend itself as easily to scalability. Vertical fragments occur when columns, instead of rows, are distributed across multiple nodes.
A situation that calls for vertical fragments might arise if a table contains information that is pertinent, separately, to multiple applications. Using the previous example of a database that stores customer information, we might imagine an airline’s frequent flyer program.
These programs typically track, among other things, customers’ personal information, like addresses and phone numbers, along with a list of all the trips they have flown and the miles they have accrued along the way.
These sets of data have different applications: the customer information is used when mailing tickets and other correspondence, and the mileage information is used when deciding how many complimentary flights a customer may purchase or whether the customer has flown enough miles to obtain “elite” status in the program. Since the two sets of data are generally not accessed at the same time, they can easily be separated and stored on different nodes.
Since airlines typically have a large number of customers, this distribution could be made even more efficient by incorporating both horizontal fragmentation and vertical fragmentation. This combined fragmentation is often called the mixed fragment approach.
Other Fragmentation Types
A database can be broken up into many smaller pieces, and a large number of methods for doing this have been developed. A simple web search for something like “distributed databases” would probably prove fruitful for further exploration into other, more complex, methods of implementing a distributed database. However, there are two more terms with which the reader should be familiar with respect to database fragmentation.
The first is homogeneous distribution, which simply means that each node in a distributed database is running the same software with the same extensions and so forth. In this case, the only logical differences among the nodes are the sets of data stored at each one. This is normally the condition under which distributed databases run.
However, one could imagine a case in which multiple database systems might be appropriate for managing different subsets of a database. This is called heterogeneous distribution and allows the incorporation of different database software programs into one big database. Systems like this are useful when multiple databases provide different feature sets, each of which could be used to improve the performance, reliability, and/or scalability of the database system.
Replication
In addition to the distribution situations above, full-database replication is also available for many database platforms. This is really what we mean when we say a database is hosted by multiple servers, but in general, the idea of distributing pieces of a database should be considered before putting much thought into wholesale replication of a database. This is for one simple reason: replication is expensive. It’s expensive in terms of finance, time, and data, but for many applications, it truly is the best solution.
Here, we briefly discuss “master-master” replication, which is perhaps the most complicated of all the replication solutions. This is also the most comprehensive replication solution, since each master always has a current copy of the database. Because of this, the entire database will still be available if one node fails.
{qbapagebreak title=The Three Expenses in Distributed Databases}
The Three Expenses in Distributed Databases
Essentially, replication entails creating exact copies of databases on many computers and updating every database simultaneously whenever an update is performed on one database. The pitfalls of this process are explained by the three expenses, below.
Replication has finance expense because every server, every hard drive, every battery-backed RAID card, every network switch, every fast network connection, every battery-backed power supply, and every other piece of associated hardware must be purchased. In addition to that are the costs of bandwidth, maintenance, backup servers, co-location, remote management, and many other things. For a decent-sized database, this could very easily run into the tens of thousands of dollars before even getting to the “every hard drive” part of the list.
Replication has time expense because each operation performed on one node’s database must be performed on each other node’s database simultaneously. Before the operation can be said to be committed, each other node must verify that the operation in its own database succeeded. This can take a lot of time and produce considerable lag in the interface to the database.
And, replication has data expense because every time the database is replicated, another hard drive or two or more fills up with data pertaining to the database. Then, every time one node gets a request to update that data, it must transmit just as many requests as there are other nodes. And, confirmations of those updates must be sent back to the node that requested the update. That means a lot of data is flying around among the database nodes, which, in turn, means ample bandwidth must be available to handle it.
How to Initiate Replication
Many of the more popular databases support some sort of native replication. MySQL, for example, provides the GRANT REPLICATION command, which initiates replication automatically. PostgreSQL, on the other hand, requires external software for replication. This usually happens in the form of Slony-1, a comprehensive replication suite. Each database platform has a different method for initiating replication services, so it is best to consult that platform’s manual before implementing a replication solution.
Considerations
When implementing a distributed database, one must take care to properly weigh the advantages and disadvantages of the distribution. Distributing a database is a complicated and sometimes expensive task, and it may not be the best solution for every project. On the other hand, with some spare equipment and a passionate database developer, distributing a database could be a relatively simple and straightforward task.
The most important thing to consider is how extensively your database system supports distribution. PostgreSQL, MySQL, and Oracle, for example, have a number of native and external methods for distributing their databases, but not all database systems are so robust or so focused on providing a distributed service. Research must be performed to determine whether the database system supports the sort of distribution required.
The field of distributed database management is relatively young, so the appropriate distribution model for a particular task may not be readily available. In a situation like this, designing one’s own distributed database system may be the best development option.
Regardless of the approach taken, distributing a database can be a very rewarding process when considering the improvement of the scalability and reliability of a system.