This is part two of two looking at the performance of methods for determining the number of clusters in k-means clustering. For part one, see Optimal “k” when there’s no cluster? Pham vs Gap Part I.
All of the examples, as well as plots can be found at this Github repo under the directory
In the previous post, we looked at how the gap statistic and Pham et. al. method behave in several conditions under the reference distribution.
In this post we’ll look at how both methods behave under simulations performed in the gap statistic paper. Unlike Part I where we looked at how those methods behaved when there is no clustering, here we will look at how methods behave when there is clustering.
Prediction strength algorithm
We’ll review the prediction strength algorithm briefly here. The paper is pretty good, and the technique is pretty simple and clever.
First, let us define the term co-membership. Co-membership is defined by Tibshirani et. al. as an indicator variable which is 1 if the two observations belong to the same cluster, as the clustering is defined, and 0 otherwise.
The prediction strength idea depends heavily on cross validation and this idea of co-membership. The overview of the algorithm is below:
- Randomly partition all of your data into a training set and a testing set
- Cluster the training set, then cluster the test set
- Compute the co-membership matrix of the test set using the clustering from the training data (this is the “prediction”).
- Compute the co-membership matrix of the test set using the testing data (this is the “truth”)
- Evaluate how well the training clustering predicts the co-membership of the testing clustering
This procedure is performed using cross-validation several times for each candidate k. Then, for each k, the mean cross-validation prediction score is computed. The largest k such that the mean prediction strength is greater than some value (in the paper 0.8 or 0.9) is chosen as the optimal k.
I recreated the simulations from the gap statistic paper. For the most part, they are the same, or similar, but vary the sample size. I’ll quote the paper, then give any differences below.
For each configuration, I did 100 realizations. I did not include simulation (a) as Part I does a more in depth analysis of it.
All simulations were implemented in
R. I used the following implementations:
clusGap()from the the
clusterpackage for the gap statistic (see my parallelized fork of the cluster package)
fpcpackage for the prediction strength method
kselectionpackage for the Pham method
three clusters in two dimensions – the clusters are standard normal variable with (25, 25, 30) observations, centered at (0, 0), (0, 5), and (5, -3).
The implementation, (
sim_b()), follows this description exactly. There are three additional sample sizes tested. In total, these are the ones simulated: (25, 50, 25), (50, 100, 50), (125, 250, 125), (2500, 5000, 2500).
four clusters in three dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded;
The implementation (
sim_c()) follows the description pretty similarly, except I simulated in four dimensions instead of three due to a typo. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).
four clusters in 10 dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded; in this and the previous scenario, the settings are such that about half of the random realizations were discarded);
The implementation (
sim_d()) is exactly as described. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).
two elongated clusters in three dimensions – each cluster is generated as follows Set with taking 100 equally spaced values from -0.5 to 0.5 and then Gaussian noise with standard deviation 0.1 added to each feature. Cluster 2 is generated the same way, except that the value 10 is added to each feature at the end. The result is two elongated clusters, stretching out along the main diagonal of a three dimensional cube.
sim_e() is implemented exactly as described. The sample sizes tested were: 100, 250, 500, 7500. The sample size 15,000 was also tested, but isn’t displayed.
Correct classification of k
Perhaps the most important factor of all of these methods is accuracy. The figure below shows how each method did in the different scenarios:
The header of each subfigure shows the sample size for each simulation.
The table of counts can be seen below in the Github repository by loading
part2/results.RData and looking at the variable
X is the simulation identifier.
We note that simulation (b) is quite simple, and all methods do well when the sample size is large. As in the null analysis (Part I), the Pham statistic does quite poorly in small sample sizes, but does better as the sample size grows. The gap statistic and prediction strength statistic do quite well in all simulations here.
This simulation is a slightly more difficult than the one in the original paper as it is in higher dimension (4 vs 3) but with the same amount of variance. Still, the gap statistic does very well, whereas the Pham method habitually thinks there are two clusters instead of four. The prediction strength algorithm does better as the sample size increases. This behavior is expected, as when you have a small sample size, the probability of sampling all clusters representatively is lower.
This simulation is very similar to simulation (c), but is in a much higher dimension (10 vs 4) and with smaller variance. The gap statistic and prediction strength statistic do quite well in all situations, where again the Pham method does very poorly.
In the gap statistic paper, this simulation was handled quite well by the gap statistic, but very poorly by almost all other methods. To my surprise, the Pham method does well in this situation.
Here, we note the prediction strength method does quite poorly. A similar simulation was performed in the prediction strength paper, and they also note that it does poorly. In that paper there is a note mentioning that a good way to model an elongated cluster is by modeling several spheres. We note that as the sample size grows, so does the number of clusters selected. This is further evidence for the sphere argument. Below is a simulation with sample size 7,500 (per cluster), where k is selected using the prediction strength method. One can see that while we expected there to be two clusters, the optimal k selected (10) seems like a reasonable fit depending on your assumptions.
The timing summary can be seen in the figure below. Note that the x axes are
As expected, Pham is significantly faster — it’s not really doing much other than computing a statistic based on the within sum of squares. Interestingly, the gap statistic is faster for smaller sample sizes, but for larger sample sizes, the prediction strength method is faster than the gap statistic. There is a point for which it becomes more costly to generate uniform samples than to perform sub-sampling and cross-validation.
The simulations performed recreated the simulations from the original gap statistic paper. They are quite simple, but test a variety of different cases with varying sample sizes:
- Simple Gaussian clusters in low dimension
- Gaussian clusters with more variance in low dimension
- Gaussian clusters with higher variance in higher dimensions
- Elongated clusters in low dimension
The gap statistic does quite well in all cases. This isn’t so surprising as the simulations were from the paper, but I think in general the gap statistic has a pretty concrete formulation and is a well thought out algorithm. The Pham method did quite poorly in the two higher variance cases. This is a bit worrisome as the simulations aren’t that complex, and it’s likely that real data is much messier than the simulations.
In general we see that when the sample size is smaller, the prediction strength algorithm doesn’t do as well as in large sample size situations. This is somewhat expected as with smaller sample sizes it becomes less probable that you would generate representative samples from each cluster.
Given these test cases, perhaps the best choice is still the gap statistic if you have a moderate sample size. While this is true, the prediction strength algorithm will likely scale better with large data sets, and depending on your assumptions of the data, might perform comparably to the gap statistic. The Pham method is appealing in runtime as you basically get the statistic for free after running clustering with the selected k. That being said, it’s performance on these simple examples is quite poor. I don’t think I would personally trust it too much on real data. If you’re looking for something that is computationally pretty cheap and has decent performance, the Krzanowski and Lai method cited in the gap statistic and prediction strength papers seems to perform reasonably well on these simulations and has low computational costs. I haven’t personally benchmarked it, but it seems to do OK in those papers. As usual, there is no substitute for randomization.