This is part two of two looking at the performance of methods for determining the number of clusters in k-means clustering. For part one, see Optimal “k” when there’s no cluster? Pham vs Gap Part I.
All of the examples, as well as plots can be found at this Github repo under the directory
In the previous post, we looked at how the gap statistic and Pham et. al. method behave in several conditions under the reference distribution.
In this post we’ll look at how both methods behave under simulations performed in the gap statistic paper. Unlike Part I where we looked at how those methods behaved when there is no clustering, here we will look at how methods behave when there is clustering.
Prediction strength algorithm
We’ll review the prediction strength algorithm briefly here. The paper is pretty good, and the technique is pretty simple and clever.
First, let us define the term co-membership. Co-membership is defined by Tibshirani et. al. as an indicator variable which is 1 if the two observations belong to the same cluster, as the clustering is defined, and 0 otherwise.
The prediction strength idea depends heavily on cross validation and this idea of co-membership. The overview of the algorithm is below:
- Randomly partition all of your data into a training set and a testing set
- Cluster the training set, then cluster the test set
- Compute the co-membership matrix of the test set using the clustering from the training data (this is the “prediction”).
- Compute the co-membership matrix of the test set using the testing data (this is the “truth”)
- Evaluate how well the training clustering predicts the co-membership of the testing clustering
This procedure is performed using cross-validation several times for each candidate k. Then, for each k, the mean cross-validation prediction score is computed. The largest k such that the mean prediction strength is greater than some value (in the paper 0.8 or 0.9) is chosen as the optimal k.
I recreated the simulations from the gap statistic paper. For the most part, they are the same, or similar, but vary the sample size. I’ll quote the paper, then give any differences below.
For each configuration, I did 100 realizations. I did not include simulation (a) as Part I does a more in depth analysis of it.
All simulations were implemented in
R. I used the following implementations:
clusGap()from the the
clusterpackage for the gap statistic (see my parallelized fork of the cluster package)
fpcpackage for the prediction strength method
kselectionpackage for the Pham method
three clusters in two dimensions – the clusters are standard normal variable with (25, 25, 30) observations, centered at (0, 0), (0, 5), and (5, -3).
The implementation, (
sim_b()), follows this description exactly. There are three additional sample sizes tested. In total, these are the ones simulated: (25, 50, 25), (50, 100, 50), (125, 250, 125), (2500, 5000, 2500).
four clusters in three dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded;
The implementation (
sim_c()) follows the description pretty similarly, except I simulated in four dimensions instead of three due to a typo. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).
four clusters in 10 dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded; in this and the previous scenario, the settings are such that about half of the random realizations were discarded);
The implementation (
sim_d()) is exactly as described. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).
two elongated clusters in three dimensions – each cluster is generated as follows Set with taking 100 equally spaced values from -0.5 to 0.5 and then Gaussian noise with standard deviation 0.1 added to each feature. Cluster 2 is generated the same way, except that the value 10 is added to each feature at the end. The result is two elongated clusters, stretching out along the main diagonal of a three dimensional cube.
sim_e() is implemented exactly as described. The sample sizes tested were: 100, 250, 500, 7500. The sample size 15,000 was also tested, but isn’t displayed.
Correct classification of k
Perhaps the most important factor of all of these methods is accuracy. The figure below shows how each method did in the different scenarios:
The header of each subfigure shows the sample size for each simulation.
The table of counts can be seen below in the Github repository by loading
part2/results.RData and looking at the variable
X is the simulation identifier.
We note that simulation (b) is quite simple, and all methods do well when the sample size is large. As in the null analysis (Part I), the Pham statistic does quite poorly in small sample sizes, but does better as the sample size grows. The gap statistic and prediction strength statistic do quite well in all simulations here.
This simulation is a slightly more difficult than the one in the original paper as it is in higher dimension (4 vs 3) but with the same amount of variance. Still, the gap statistic does very well, whereas the Pham method habitually thinks there are two clusters instead of four. The prediction strength algorithm does better as the sample size increases. This behavior is expected, as when you have a small sample size, the probability of sampling all clusters representatively is lower.
This simulation is very similar to simulation (c), but is in a much higher dimension (10 vs 4) and with smaller variance. The gap statistic and prediction strength statistic do quite well in all situations, where again the Pham method does very poorly.
In the gap statistic paper, this simulation was handled quite well by the gap statistic, but very poorly by almost all other methods. To my surprise, the Pham method does well in this situation.
Here, we note the prediction strength method does quite poorly. A similar simulation was performed in the prediction strength paper, and they also note that it does poorly. In that paper there is a note mentioning that a good way to model an elongated cluster is by modeling several spheres. We note that as the sample size grows, so does the number of clusters selected. This is further evidence for the sphere argument. Below is a simulation with sample size 7,500 (per cluster), where k is selected using the prediction strength method. One can see that while we expected there to be two clusters, the optimal k selected (10) seems like a reasonable fit depending on your assumptions.
The timing summary can be seen in the figure below. Note that the x axes are
As expected, Pham is significantly faster — it’s not really doing much other than computing a statistic based on the within sum of squares. Interestingly, the gap statistic is faster for smaller sample sizes, but for larger sample sizes, the prediction strength method is faster than the gap statistic. There is a point for which it becomes more costly to generate uniform samples than to perform sub-sampling and cross-validation.
The simulations performed recreated the simulations from the original gap statistic paper. They are quite simple, but test a variety of different cases with varying sample sizes:
- Simple Gaussian clusters in low dimension
- Gaussian clusters with more variance in low dimension
- Gaussian clusters with higher variance in higher dimensions
- Elongated clusters in low dimension
The gap statistic does quite well in all cases. This isn’t so surprising as the simulations were from the paper, but I think in general the gap statistic has a pretty concrete formulation and is a well thought out algorithm. The Pham method did quite poorly in the two higher variance cases. This is a bit worrisome as the simulations aren’t that complex, and it’s likely that real data is much messier than the simulations.
In general we see that when the sample size is smaller, the prediction strength algorithm doesn’t do as well as in large sample size situations. This is somewhat expected as with smaller sample sizes it becomes less probable that you would generate representative samples from each cluster.
Given these test cases, perhaps the best choice is still the gap statistic if you have a moderate sample size. While this is true, the prediction strength algorithm will likely scale better with large data sets, and depending on your assumptions of the data, might perform comparably to the gap statistic. The Pham method is appealing in runtime as you basically get the statistic for free after running clustering with the selected k. That being said, it’s performance on these simple examples is quite poor. I don’t think I would personally trust it too much on real data. If you’re looking for something that is computationally pretty cheap and has decent performance, the Krzanowski and Lai method cited in the gap statistic and prediction strength papers seems to perform reasonably well on these simulations and has low computational costs. I haven’t personally benchmarked it, but it seems to do OK in those papers. As usual, there is no substitute for randomization.
K-means clustering is a fairly common and useful algorithm. One of the difficulties in using k-means is choosing the number of clusters, . I use the gap statistic pretty regularly for choosing k in my analyses as I’ve found it to perform reasonably well in practice. Recently, I’ve been thinking a bit about clustering on Spark with large data sets and while the gap statistic works well on small data sets, it can be a major pain with lots of samples.
This post isn’t meant to be authoritative, I was just curious to see how the Pham et. al. method and gap statistic perform under various sample sizes and dimensions when there is no clustering under a uniform distribution.
- is the number of data points (sample size)
- is the number of dimensions of each data point
- is the number of clusters
- is the maximum number of clusters ()
Why is the gap statistic so slow?
For a review of the gap statistic, checkout this great post by The Data Science Lab. The basics in the original paper are pretty easy to follow as well. The main reason it is relatively slow is that you generate samples from a uniform distribution over the same range as the original data, times. In each iteration, you then run the same clustering algorithm with your candidate values. This means that on top of clustering your original data times you have to cluster data sets, each of sample size . When and aren’t too large, it’s pretty tractable — but it can grow out of control rather quickly as to get stable estimates of the optimal , has to be somewhat large (the authors of the R packages cluster recommend as much as !).
Pham et. al.
The Pham et. al. method was also recently reviewed at the Data Science Lab. Computationally, the method is preferable to the gap statistic as it does not require a bunch of sampling — you only have to run k-means times and compute a statistic that depends on the within cluster sum of squares. That being said, while the approach is somewhat justified in the paper, it smells a lot like a heuristic, at least when compared to something like the gap statistic.
Why do these simulations?
Mostly curiosity. I will say that the gap statistic is a much more principled approach and I was curious how the Pham method would hold up. Also, the Pham simulations left quite a bit to be desired. As far as I could tell, the simulations weren’t repeated (or at least reported), which is generally bad form for such simple simulations.
Performance under the reference distribution
Both methods assume that when there is no clustering, the reference distribution is uniform over the -dimensional space. For the simulations here, samples were generated uniformly at random from the unit hypercube. I performed 100 realizations of each simulation, where I vary and . Clustering was performed using the standard R implementation of k-means with 25 starts.
For the Pham method, I used the kselection package with default parameters. For the gap statistic, I used the cluster package. The gap statistic was run with . You can find the R code in this repo, along with a RData file with the results.
Below is a figure with the k selected by the each algorithm in several different conditions. The title for each subfigure is : the dimension and sample size. The histograms are in “dodge” mode.
Generally, the gap statistic outperforms the Pham method, but as the dimensionality and sample size grow, they tend to perform comparably (under these conditions). The situations where the Pham method does quite poorly is with sample size and few dimensions. In such cases it doesn’t really make sense to not use the gap statistic as it can still be computed quickly.
A closer look at Pham
We can look at only the simulations which did not choose correctly to see what might be going awry. Below I’ve plotted for the incorrectly classified simulations when :
Each line is a different simulation and the dotted red line marks , the recommended threshold at which to choose the minimal value of . The proposed rule in the paper is to choose the minimal such that . If there are no values of , the algorithm returns . It’s pretty clear that for this particular setup, 0.85 is not a good threshold.
If we look at those which correctly chose , we note that many of the simulations almost crossed this threshold of 0.85.
Each simulation was run on a separate single core. The gap statistic is embarrassingly parallel, though the R implementation does not exploit this (I think I’m smelling a C++11 threaded implementation in my future). I will also note that this is a large shared machine, so it’s possible at some point there were other tasks competing for CPU time (the machine isn’t running resource management).
The summary of results can be seen below (histograms are in “stack” mode):
It’s pretty clear that for small sample sizes there is little time benefit from using the Pham method. Though, as the dimension and sample size grow, the gap statistic grows out of control. At , which in the grand scheme of things is a pretty small dataset, the gap statistic is on average over 200 times slower than the Pham method.
The Pham method is a somewhat convenient heuristic that has been proposed for determining the optimal number of clusters when the number of clusters isn’t known a priori. We looked at very simple simulations where there is no clustering in the data to see how the Pham method compares to the gap statistic. For small sample sizes and/or few dimensions, the gap statistic does significantly better. The Pham method especially has trouble with small sample sizes and few dimensions. This problem seems to arise from the selection of the threshold at which to choose the optimal k. The original paper suggests 0.85 which clearly isn’t best for all datasets.
While the low dimensional situation might seem nonexistent in common datasets, it’s somewhat common to reduce a high dimensional data set to low dimensional space before clustering using a method such as principal component analysis.
These simulations don’t really prove anything beyond the very naive conditions we tested. In a possible future post, I will recreate the simulations in the gap statistic paper and look for other papers which have comparable simulations to see how each method performs on data with clusters. I will also likely expand the methods tested to a larger group, so please comment with (automated) methods you think should be compared.