Clustering as a problem and as a practice in many different domains has proven to be quite perennial. Testifying to this is the presence of “clustering” or “cluster analysis” as a term in an important classification system.
Cluster analysis is a generic name for a large set of statistical methods that all aim at the detection of groups in a sample of objects, known as clusters. Essential to cluster analysis is that, in contrast to discriminant analysis, a group structure need not be known a priori. This makes cluster analysis attractive as an exploratory tool, therefore it is the subject of research in this essay.
The term «cluster analysis» was first introduced and used only in 1939 by the English scientist R. Trion. He believed that instead of factor analysis it is better to single out «bunch» of indicators, and proposed an appropriate method, which boiled down to searching for groups with a closely correlating attribute in each of them. The name of the discipline comes from the English word cluster, which is translated into Russian as a bunch («гроздь»). For this reason, for the first time in the USSR, cluster analysis was called «bunch analysis».
Most modern clustering methods were proposed in the 60s of the XX century. This time is characterized by a huge number of publications. The works of the following authors can be singled out as the most important: G. Ball and D. Hall, J. McKeen – k-means methods; R. Sokal and J. Snit, G. Lance and W. Williams, N. Jardine and R. Sibson – hierarchical procedures; and others. But, in spite of all the work done, a single classification system for cluster procedures does not exist today. The notion of a «cluster» cannot be precisely defined, which is one of the reasons why there are so many clustering algorithms. Different researchers use different cluster models, and understanding these “cluster models” is the key to understanding the differences between different algorithms. Typical cluster models include:
• connectivity models: for example, hierarchical clustering builds models based on distance connectivity;
• centroid models: for example, the k-means algorithm represents each cluster by a single mean vector;
• distribution models: clusters are modeled using statistical distributions, such as multivariate normal distributions used by the expectation-maximization algorithm;
• density models: for example, DBSCAN and OPTICS defines clusters as connected dense regions in the data space;
• subspace models: in biclustering (also known as co-clustering or two-mode-clustering), clusters are modeled with both cluster members and relevant attributes;
• group models: some algorithms do not provide a refined model for their results and just provide the grouping information;
• graph-based models: a clique, that is, a subset of nodes in a graph such that every two nodes in the subset are connected by an edge can be considered as a prototypical form of cluster;
• signed graph models: every path in a signed graph has a sign from the product of the signs on the edges;
• neural models: the most well-known unsupervised neural network is the self-organizing map and these models can usually be characterized as similar to one or more of the above models, and including subspace models when neural networks implement a form of Principal Component Analysis or Independent Component Analysis.
A “clustering” is essentially a set of such clusters, usually containing all objects in the data set. Additionally, it may specify the relationship of the clusters to each other, for example, a hierarchy of clusters embedded in each other.
It is important to understand, what is good clustering. Some experts underline three main aspects of it: a good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity; the quality of a clustering result depends on both the similarity measure used by the method and its implementation; the quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
Cluster analysis is used in many fields of science and human activity. He found his first applications in biology. Using cluster methods, it is convenient for biologists to classify living beings according to various signs, exploring the relationships and the history of the development of life on earth. Cluster methods have found their application in disciplines related to biology. So, in bioinformatics they are actively used to study complex networks of gene interaction and highlight their hidden properties. In medicine, using cluster methods, it is convenient to analyze the characteristics of diseases, the effectiveness of their treatment, and also use them to create knowledge bases and expert systems. In pharmaceuticals ¬– a study of the effectiveness of the use of drugs for various groups of people and much more.
Cluster analysis has long gone beyond biology. The following items can be given as examples of its use:
• marketing: help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs;
• informatics: simplification of work with information, data visualization, image segmentation, intelligent search;
• economics: analysis of markets and financial flows, deriving patterns on stock exchanges;
• linguistics: restoration of the evolutionary tree of languages;
• astronomy: selection of groups of stars and galaxies, automatic processing of satellite images;
• land use: identification of areas of similar land use in an earth observation database;
• insurance: identifying groups of motor insurance policy holders with a high average claim cost;
• city-planning: identifying groups of houses according to their house type, value, and geographical location;
• earth-quake studies: observed earth quake epicenters should be clustered along continent faults.
There are certain concepts and statistics associated with cluster analysis:
1. Agglomeration schedule in cluster analysis gives information on the objects or cases being combined at each stage of the hierarchical clustering process.
2. Cluster Centroid is the mean value of a variable for all the cases or objects in a particular cluster.
3. A dendrogram is a graphical device for displaying cluster results.
4. Distances between cluster centers in cluster analysis indicate how separated the individual pairs of clusters are. The clusters that are widely separated are distinct and therefore desirable.
5. Similarity/distance coefficient matrix in cluster analysis is a lower triangle matrix containing pairwise distances between objects or cases.
The history of cluster analysis dates back less than a hundred years, but it has managed to become an integral part of data processing in many sciences and other areas of human activity. Such an active development of clustering is largely associated with the growth of computing power and its widespread use. In the absence of such a technique, the very idea of cluster analysis – automation of the process of grouping objects – loses its meaning. Perhaps, for this reason, work in this direction was practically not carried out until the last century.
Today, cluster analysis is one of the most effective tools for processing large amounts of data and is used everywhere where computer technology is applied.