# Optimal “k” when there’s no cluster? Pham vs Gap Part I

*K*-means clustering is a fairly common and useful algorithm. One of the difficulties in using *k*-means is choosing the number of clusters, . I use the gap statistic pretty regularly for choosing *k* in my analyses as I’ve found it to perform reasonably well in practice. Recently, I’ve been thinking a bit about clustering on Spark with large data sets and while the gap statistic works well on small data sets, it can be a major pain with lots of samples.

#### Disclaimer

This post isn’t meant to be authoritative, I was just curious to see how the Pham *et. al.* method and gap statistic perform under various sample sizes and dimensions when there is no clustering under a uniform distribution.

## Some notation

- is the number of data points (sample size)
- is the number of dimensions of each data point
- is the number of clusters
- is the maximum number of clusters ()

## Why is the gap statistic so slow?

For a review of the gap statistic, checkout this great post by The Data Science Lab. The basics in the original paper are pretty easy to follow as well. The main reason it is relatively slow is that you generate samples from a uniform distribution over the same range as the original data, times. In each iteration, you then run the same clustering algorithm with your candidate values. This means that on top of clustering your original data times you have to cluster data sets, each of sample size . When and aren’t *too* large, it’s pretty tractable — but it can grow out of control rather quickly as to get stable estimates of the optimal , has to be somewhat large (the authors of the R packages cluster recommend as much as !).

## Pham *et. al.*

The Pham *et. al.* method was also recently reviewed at the Data Science Lab. Computationally, the method is preferable to the gap statistic as it does not require a bunch of sampling — you only have to run *k*-means times and compute a statistic that depends on the within cluster sum of squares. That being said, while the approach is somewhat justified in the paper, it smells a lot like a heuristic, at least when compared to something like the gap statistic.

## Why do these simulations?

Mostly curiosity. I will say that the gap statistic is a much more principled approach and I was curious how the Pham method would hold up. Also, the Pham simulations left quite a bit to be desired. As far as I could tell, the simulations weren’t repeated (or at least reported), which is generally bad form for such simple simulations.

# Performance under the reference distribution

Both methods assume that when there is no clustering, the reference distribution is uniform over the -dimensional space. For the simulations here, samples were generated uniformly at random from the unit hypercube. I performed 100 realizations of each simulation, where I vary and . Clustering was performed using the standard R implementation of *k*-means with 25 starts.

For the Pham method, I used the kselection package with default parameters. For the gap statistic, I used the cluster package. The gap statistic was run with . You can find the R code in **this repo**, along with a RData file with the results.

## General results

Below is a figure with the *k* selected by the each algorithm in several different conditions. The title for each subfigure is : the dimension and sample size. The histograms are in “dodge” mode.

Generally, the gap statistic outperforms the Pham method, but as the dimensionality and sample size grow, they tend to perform comparably (under these conditions). The situations where the Pham method does quite poorly is with sample size and few dimensions. In such cases it doesn’t really make sense to not use the gap statistic as it can still be computed quickly.

## A closer look at Pham

We can look at only the simulations which did not choose correctly to see what might be going awry. Below I’ve plotted for the incorrectly classified simulations when :

Each line is a different simulation and the dotted red line marks , the recommended threshold at which to choose the minimal value of . The proposed rule in the paper is to choose the minimal such that . If there are no values of , the algorithm returns . It’s pretty clear that for this particular setup, 0.85 is not a good threshold.

If we look at those which correctly chose , we note that many of the simulations almost crossed this threshold of 0.85.

## Timing

Each simulation was run on a separate single core. The gap statistic is embarrassingly parallel, though the R implementation does not exploit this (I think I’m smelling a C++11 threaded implementation in my future). I will also note that this is a large shared machine, so it’s possible at some point there were other tasks competing for CPU time (the machine isn’t running resource management).

The summary of results can be seen below (histograms are in “stack” mode):

It’s pretty clear that for small sample sizes there is little time benefit from using the Pham method. Though, as the dimension and sample size grow, the gap statistic grows out of control. At , which in the grand scheme of things is a pretty small dataset, the gap statistic is on average over 200 times slower than the Pham method.

# Discussion

The Pham method is a somewhat convenient heuristic that has been proposed for determining the optimal number of clusters when the number of clusters isn’t known a priori. We looked at very simple simulations where there is no clustering in the data to see how the Pham method compares to the gap statistic. For small sample sizes and/or few dimensions, the gap statistic does significantly better. The Pham method especially has trouble with small sample sizes and few dimensions. This problem seems to arise from the selection of the threshold at which to choose the optimal *k*. The original paper suggests 0.85 which clearly isn’t best for all datasets.

While the low dimensional situation might seem nonexistent in common datasets, it’s somewhat common to reduce a high dimensional data set to low dimensional space before clustering using a method such as principal component analysis.

## Future work

These simulations don’t really prove anything beyond the very naive conditions we tested. In a possible future post, I will recreate the simulations in the gap statistic paper and look for other papers which have comparable simulations to see how each method performs on data with clusters. I will also likely expand the methods tested to a larger group, so please comment with (automated) methods you think should be compared.

Hi Harold,

Thank you for the clear, easy-to-understand posts. At the end of this one, you mention reducing dimensionality prior to clustering. I wonder if you could elaborate on this some in a future post? Two questions in particular come to mind: 1) what is an appropriate number of dimensions to reduce to prior to clustering, and 2) how do you identify those dimensions? For the first question, this is particularly difficult for highly-dimensional data, where you might have 10^4 dimensions compared to 10^1-10^3 samples. Should the dimensions always be less than the number of samples? There are several cluster packages in R that get angry if there are more dimensions than samples. For the second question, I left it general on purpose, even though you suggest PCA. Using PCA as an example, there are several tools out there for determining how correlated or significant a dimension is to a particular principal component (or less preferable, you could just select the dimensions with the highest absolute values on the relevant eigenvectors). However, when dealing with 10^4 dimensions, I have found that there are typically a few thousand dimensions that are highly correlated or significant with a set of principal components. This still seems like too many to me (and for my data in particular, I know that there are only a few hundred that are really relevant). Your thoughts would be greatly appreciated!

Hi Harold, check out our paper for a systematic comparison of clustering methods:

http://www.nature.com/articles/srep06207

Critical limitations of consensus clustering in class discovery

Regards,

Yasin Senbabaoglu