Cluster algorithms are techniques for assigning items to automatically created groups based on a measurement of similarity between items and groups.  Many clustering algorithms exist and again, it is very important to recognize that no algorithm will produce the “best” results for all situations. Some related techniques for unsupervised learning that are not typically thought of as clustering are described in the following section. Additional techniques that incorporate “labelled data” for supervised and active learning are addressed in the next chapter.

Information retrieval applications of clustering have included the organization of documents on the basis of terms that they contain and co-occurring citations, the automated construction of thesauri and lexicons, and in the creation of part of speech and sense taggers.  These applications share much in common with exploratory data analysis applications.

There are two major styles of clustering:

  • Partitioning.  Each example is assigned to one of k partitions (or in the case of fuzzy partitioning, each example is assigned a degree of membership to each partition).
  • Hierarchical clustering.  Examples are assigned to groups that are organized as a tree structure.  Each example is assigned to a leaf and to the nodes that are in the path up to the root node.

The simplest partitioning algorithm is known as k-means.  It works as follows. We initialize k distinct values that are initial guesses of the mean values for k distinct partitions.  Then iteratively, we assign all examples to the nearest partition (based on some metric) and we recompute the value of each mean.  We terminate when the mean values do not change within some tolerance.

The k-means algorithm is sometimes used to seed other algorithms.  It has the advantage that it is fast and easy to implement.  It has the disadvantage that it may produce poor results if the number of partitions guessed does not match well to the actual structure of the data. Variations include robust methods such as using a median rather than a mean and computing fuzzy membership.

Hierarchical clustering uses a similarity measure to cluster using either a top-down (or divisive) strategy or a bottom-up (or agglomerative) strategy.  Divisive strategies start with all examples in a single cluster.  A criterion governs how the cluster should be split and how those clusters in turn should be split. Agglomerative clusters start with each example in its own (singleton) cluster.  The most similar clusters are merged based on some criterion.  Agglomeration continues until either all clusters are merged or another stopping condition is met.

A dendogram is a particularly useful visualization for hierarchically organized clusters. An example dendogram appears below.


A dendrogram: The vertical axis shows a generalized measure of similarity among clusters. Here, at level 1 all eight points lie in singleton clusters. Points x6 and x7 are the most similar, and are merged at level 2, and so forth. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright  © 2001 by John Wiley & Sons, Inc.


Many different criteria exist for determining similarity between clusters.  These criteria are composed of two things: a measure of similarity between two examples and a strategy for evaluating the similarity between two groups of examples.  The measure of similarity is a metric which generally may be thought of as a dot product or, more rigorously, as a Mercer kernel. 

Similarity between two groups of examples may be computed in a variety of ways.  The most common methods and how they apply to agglomerative clustering follow.

  • Single link: At each step the most similar pair of examples not in the same cluster are identified and their containing clusters are merged. Easy to implement, but unsuitable for isolating poorly isolated clusters.
  • Complete link: The most dissimilar pairs of examples from different clusters are used to determine cluster similarity. The least dissimilar are merged. This method tends to create small, tightly bound clusters.
  • Group average link: The average of all similarity links are used to measure the similarity of clusters. This is a good compromise between the Single link and Complete link methods and performs well.
  • Ward’s method: The cluster pair whose merger minimizes the total within-group variance from the centroid is used. This naturally produces a notion of a “center of gravity” for clusters and typically performs well, although it is sensitive to outliers and inhomogenous clusters.

Many variations of these methods exist.


Support Vector Clustering and similar techniques

Support Vector Clustering maps examples in data space to a high dimensional feature space using a Gaussian kernel function

K(xi,xj) = e-q||xi-xj||2

with width parameter q. In feature space the smallest sphere that encloses the image of the data is identified. When mapped back to data space, the sphere forms a set of contours which enclose the data points.  These contours form cluster boundaries and may be viewed as a successive set of boundaries for different values of q.

SVC is a variant of a similar method that uses Parzen window estimates for density of the feature space.  Other approaches exist.  The advantages of these techniques include the ability to cluster over a wide variety of shapes, the ability to control the influence of outliers, and (for some variants) the ability to gracefully deal with overlapping clusters.


Other unsupervised learning techniques

If a dataset has been generated by a process that has an underlying distribution corresponding to a mixture of component densities and if those component densities are described by a set of unknown parameters, then those parameters can be estimated from the data using Bayesian or maximum-likelihood techniques.