Clustering in the computer science world is the classification of data or object into different groups. It can also be referred to as partitioning of a data set into different subsets. Each data in the subset ideally shares some common traits.
Data cluster are created to meet specific requirements that cannot created using any of the categorical levels. One can combine data subjects as a temporary group to get a data cluster.
Data clusters are the products of an unsupervised classification of different patterns involving data items, observations or feature vectors. The clustering process has a broad appeal and usefulness as on the first steps in exploratory data analysis as reflected in many contexts by researchers from various disciplines. Clustering is widely used despite its difficulty being combinatorial in its very nature.
Data clustering is used in many exploratory process including pattern-analysis, decision making, grouping. It is also heavily used in machine learning situations like data mining, image segmentation, document retrieval and classification of data patterns.
There are two general algorithms used in data clustering. These categories are hierarchical and partitional. Hierarchical algorithms work by finding successive clusters with the use of previously established clusters. Hierarchical algorithms can be further subcategorized as a agglomerative ("bottom-up") or divisive ("top-down"). On the other hand, partitional algorithms work by determining all clusters at once and them partitioning them.
Within the data clustering taxonomy, the following issues exist:
Agglomerative vs. Divisive: This issue refers to the algorithmic structure and operation of data clusters. The agglomerative approach starts with each pattern in a distinct cluster (singleton) and then successively does merging of rest of the data until a certain condition is being met. The divisive approach starts with all clusters patterns within a single cluster and then splits them until a condition is satisfied.
Monothetic vs. Polythetic: This issue refers to the sequential or simultaneous use of features in the process of clustering. Most data clustering algorithms are polythetic in nature. This means that all features are done in computation of distance patterns. The monothetic approach is simpler. It considers features sequentially and then divides the given group of patterns.
Hard vs. Fuzzy: Hard clustering is done by allocation each pattern into one cluster during the clustering operation and in its final output. On the other hand, a fuzzy clustering is done by assigning degrees of membership in many clusters to each input pattern. A fuzzy clustering method can be converted to hard clustering method by assigning each pattern to another cluster having the largest measure of membership.
Deterministic vs. Stochastic: This issues can be said to be relevant to a partitional approach. This is designed to optimize squared error functions. With the use of traditional techniques or using a random search of each space that contains all possible labels, optimization can be realized and optimized.
Incremental vs. Non-incremental: This issue can be encountered in cases when the pattern set to be clustered is very large and some constraints are met with regards to the memory space or time affecting the algorithm’s architecture.
The advent of data mining where relevant data need to extracted from billions of disparate data within one or more repositories has furthered the development of clustering algorithms designed to minimize number of scans and therefore effect in lesser load for servers. Incremental clustering is based on the assumption that patterns can be considered one at a time and have them assigned to other existing clusters.
The process of data clustering is sometimes closely associated with such terms as cluster analysis, automatic classification and numerical taxonomy.