Can your lab read 250 papers over the summer? With leaderboard they can

A demo is worth a million words. – Harold Pimentel

First, the demo [1]:

Pritchard lab has been doing a “paper reading club” every summer for the last few years. The general idea is to share papers you read, whether you recommend others read it, and some ideas you liked or ideas you might be skeptical of in the paper.

This year I built a leaderboard application that keeps track who read what and provides some summaries of the data. The goal of the leaderboard is to provide a place to share and to facilitate conversation about papers amongst people who share common interests.

From an end-user perspective one simply logs the paper, a recommendation, and comments to a Slack channel (see engineering section) and the reviews/summaries are automatically populated.

Given some reviews from users, leaderboard provides a few key summaries:

  • a searchable list of all the papers that have been read.
  • count of how many people read each paper and what their recommendations are.
  • network analysis of the papers read.
  • and perhaps the most important metric: how many in the group have read more than Dear Leader.

In total, 10 of us logged papers on a regular basis and over the summer we logged a total of 249 papers (approximately 250)!

So what is the prize? Well, Jonathan got us all a cake (yum!). Despite my pleading, we allowed members of the lab who did not participate to also eat the cake. Of course, there is also the metaphorical cake: reading more papers.

Below is an artist’s interpretation of the first top 3 readers with their “Congratulation!” cake:

Winners of Summer 2019 paper reading club

From left to right: Jeff Spence (39), Roshni Patel (39), and Alyssa Lyn Fortier (46).

An overview of the app follows.


This is the main page showing high-level summary statistics:

  • a histogram of the ranking.
  • a more detailed table showing the exact count of each person.
  • a time series plot showing how many papers have been read by each person.

DOI info

This is a quick way to get an idea of what people think of papers and the most read papers. Each paper is listed with the users that read it and their recommendations.
Additionally, basic information is listed (e.g. authors, link to paper, date of publication, etc.).

network analysis

Each node in the network is a person and node size is scaled by the number of papers read.
Each edge indicates shared papers read and the size of the edge is proportional to how much sharing.

word clouds

Word clouds of recommendations and comments.
Pretty silly, but fun nonetheless.

raw data

The full data set which is also searchable.
This is useful in case you are interested in detailed comments about a specific paper.

personal experiences

The format is pretty fun and honestly pretty useful. It creates a healthy, friendly competition among the lab to read consistently and write reviews in a meaningful way. I spent a lot time reflecting on what I have to say about papers because I know others in the lab will be reading it.

Perhaps unsurprisingly recommendations tend to be pretty helpful. People tend to be fairly honest and put a fair amount of thought into their reviews. It is also nice to get an idea of what people are thinking about and what they found interesting or not interesting about seminal papers.

I don’t think others did this, but I refereed a number of papers this summer and posted my full reviews. I figure this sort of post might be useful to young trainees who may not have much experience reviewing papers.

Finally, the suspense adds another dimension of fun. Each week I tried to provide a summary of the rankings (this should be automated in the future). There were numerous dark horses and the competition was quite fierce at times. You can see from the time series plot, Roshni (roshni_patel) was the leader for quite some time. Then Jeff (jspence) silently posted a bunch of papers directly on the spreadsheet without anyone noticing and became the leader. Shortly thereafter Alyssa (alyssa_lyn_fortier) was like “I’m having none of that!” [2] and logged tons of papers in just two weeks.

a paper worth posting?

A common question: is this paper and my review worth posting? There is no mandate but my general rule has been if I have read a paper closely enough to have an informed opinion about it, my review counts.


The general workflow looks like this:

The app itself is a Shiny R app hosted on for free. Whenever there is a new post to the #paper_reading_hook Slack channel, a Zapier daemon is listening (setup by Nasa in the lab). A post to #paper_reading_hook looks like:


An example:

10.1038/nmeth.4324 YES one of the best papers I have ever read. you MUST read it!

Zapier then parses a regex pattern (([^ ]+) ([^ ]+) (.*)) and then outputs to a private Google Sheet in the format

doi handle  date    recommend   comments

The actual UI looks like this:

When the app is opened:

  • It checks for new entries.
    • If there is a new DOI, it uses rcrossref to pull all relevant information about the paper (e.g. authors, date, etc.).
    • It then aggregates the data, joins some tables and writes to a different private table that contains some summaries.
  • The app then displays the current state.

discussion and future work

Right now the application is only scaled for the lab. An immediate improvement would be putting a proper database in place of the Google Sheet. In the long-term, I think it could be quite interesting to use publicly with some modifications:

  • Allow anonymous posts but have some form of community regulation.
    • I like the arXiv model of having someone else vouch for you before you post.
    • a voting system could help regulate comments/weed the way the noise.
  • Collaborative filtering of course for recommendations.
  • Modeling of the content of the papers themselves.

These ideas are of course nothing new, but I will say the format is quite fun.

Additionally, the UI could use a fair amount of work. It is very much a project born out of my “spare time” in between running kallisto. If kallisto were slower, then maybe leaderboard would be better. You can blame Páll Melsted for that.

If you are interested in contributing or using leaderboard in your own lab, the code is available free as in beer:


Firstly, the Pritchard lab for participating, providing feedback, and making it fun! Jonathan for giving us the freedom and supporting the project. Nasa was instrumental in getting the hook working and I think we would’ve gotten much less participation without it.

Thanks, all!

[1] The demo and screenshots, except the time series/front page are obviously bogus. They are a simulation from our papers and comment distribution, completely at random. This was done to protect the private comments of those in the lab.
[2] At least she said that in my head.


The current state of R support in Atom


This is a guide to the current state of R support in Atom. I’m not writing this as an expert on Atom, merely as a user.

This guide is geared towards Mac as I do not have experience on either Linux or Windows.

All of these packages can be installed with the Atom package manager either via commandline (apm install) or in the GUI (Preferences → Install).

With all of these packages, don’t forget to take a look at the command palette (cmd-shift-p) which will give you quick access to commands both bound and unbound to keys.

DISCLAIMER: I am the maintainer of r-exec.

syntax highlighting

standard R files

The syntax highlighting for *.R files is quite good and support is built into the language-r package. It builds basic syntax highlighting and was originally ported from the r.tmbundle TextMate package. There hasn’t been much activity on the GitHub repository as of late, but I think the package is quite stable so there isn’t too much work to be done (in terms of basic functionality). I have been using it without any major issues for at least a year.

R Markdown

Early 2016 I added very crude support in language-gfm for R markdown blocks. This will give you standard functionality out of the box (language-gfm is a base package), however, it is a bit clunky. Earlier this year language-markdown added very good support, and in my opinion, is a much better package all around for editing all sorts of Markdown files. The biggest benefit of language-markdown over language-gfm for editing Rmd is proper syntax parsing of decorators. An example is {r eval=TRUE} parses nicely and a typo like {r eva=TRUE} would be obvious (see image below).

Example of language-markdown parsing syntactically (in)correct Rmd decorators


As far as I can tell, there really isn’t any good syntax highlighting for LaTeX style Rnw knitr files in Atom. But with the ease of R Markdown, unless you are writing a lot of math, I would suggest against using Rnw anyway. I have looked at adding support for it and it doesn’t seem like too much work. However, I have little incentive since I don’t really write Rnw. If you are interested in picking up this project, let me know and I might be able to guide you a little bit.


The Atom autocompletion engine (autocomplete-plus, bundled with Atom) is quite good and has providers for many languages including R support which is relatively recent. Without the specific R support autocomplete-plus, (bundled with Atom) does a decent job at finding text within your buffers and auto completing it. Installing autocomplete-R will give you more elaborate autocompletion of built-in R functions along with their arguments. The author has expressed interest in providing this functionality for other packages as well.

The image below showcases this functionality:

Example of autocomplete-R

running code

Personally, my workflow includes using iTerm2 for remote execution of code, along with RStudio server, and sometimes locally executing code in As far as I know, the only package with capability to send code to all of these applications in Atom is r-exec. It is simple to configure which application you are sending code to (either in the Packages menu or type cmd-shift-p and begin typing r exec to see a list of options).

r-exec has the ability to send the current line, paragraph, function, and RMarkdown block. It also has additional features to jump to the next logical block (line, paragraph, etc.) after a command has been sent.
Recently, support for ‘smart’ insertion of operators (currently assignment and the pipe operator) has been implemented. This means if there is whitespace on the left or right of the cursor, it will not insert additional whitespace when an operator has been inserted via the operator shortcut.

In the near future I plan adding support for internal Atom terminals (e.g. Term3). I also plan to add support for building RMarkdown internally. If others can help, it would be great to get support on Linux and Windows.

There are also other packages that are available for executing code which I have no experience with but I am sure others use:

  • hydrogen which adds in-line support for Jupyter notebooks (Python, Julia).
  • build-rscript which will run Rscript on a R file.

navigating code

Atom has a pretty good ctags engine, atom-ctags. Matching rules for tags have to be specified in a file before you can begin using it with R. Some guidance on this can be found here.
After the tags have been built for a specific project, you can jump between symbols using ctrl-option-down when the cursor is above a specific symbol.

static code analysis (linting)

Linting can be done using the linter-lintr package. It uses the popular R linter lintr and has pretty standard warning/error reporting in the Atom ‘gutter’.

The figure below (pulled from the lintr website) displays the functionality nicely:

example of lintr

summary and links

Bluetooth keyboard switch with Arduino

tl;dr I made an Arduino device that can be hooked up via USB to a computer that you wish to control and connected via Bluetooth to a controlling computer. Go to the “in action” section for a demo.

I control my computer by voice these days via a combination of VoiceCode + Dragon Dictate and some other supporting hardware/software.
A few weeks ago I was working on some servers which didn’t have a network connection yet forcing me to type by keyboard.
In these sorts of situations it is nice to be able to continue using my setup but to be able to do things on other machines.

Enter the replay keyboard device I put together with an Arduino.
You simply connect it to a computer you would like to replay commands (control) on, connect your computer via Bluetooth, then type away!

You can find all of the software here:

replay keys

Some of my use cases are:

  • Controlling servers when there is no network (yay for failures!).
  • Setting up new computers/devices that require keyboard input.
  • Pair programming/working with others on their computer.
  • Screwing with friends/enemies.

What follows is a high level summary of how the device works.


  • Arduino Micro
    • I used the Micro to use the Keyboard library which is only available on Atmega32u4 chipsets (Micro, Leonardo, Duo). This will NOT work on the Uno style chipsets.
    • The Micro is also nice because you can simultaneously use Serial1 (Bluetooth) and Serial (terminal output for debugging).
    • The Arduino is nice because it is lightweight and most modern operating systems have appropriate drivers already. This means you can plug it into any machine without having to install and write another driver. I imagine you can do this with many other types of devices (like a Raspberry Pi) but the Arduino starts up within seconds and it is easy to develop software for.
  • KEDSUM Arduino Wireless Bluetooth Transceiver Module Slave 4Pin Serial + DuPont Cable (basic version)
    • In principle, I think any Bluetooth transceiver should work.
    • My main gripe with this one is that the documentation isn’t that great and some of the commands didn’t work that were advertised in the documentation.



There really isn’t too much to this. Connect the 5V directly to VCC, ground to ground, and TX (Arduino) to RX (Bluetooth) and RX (Arduino) to TX (Bluetooth).
The Arduino Micro is a good choice because it has an TX/RX (Serial1) pins that are isolated from the USB TX/RX (Serial) communication.
In theory, I think this should work with SoftwareSerial, but it didn’t work for me.
Maybe someone knows why this didn’t work?
This Bluetooth device works over SoftwareSerial with the Uno, for what it is worth.


  • The Arduino and Bluetooth transceiver do not have error correction built into communication, so in some sense they aren’t completely reliable. Unless what you are doing mission-critical, it hasn’t really been an issue for me. In principle, this is something you could develop.
  • The stream that is sent over Serial is only a byte which poses some challenges.


The general procedure is as follows:

  1. A key is typed inside the window on the source device.
  2. The key is logged and encoded using a specification.
  3. The encoded sequence of characters is sent byte-by-byte over Bluetooth to the Arduino hooked up to the device you wish to replay keys on.
  4. The Arduino receives the sequence of characters and presses them on the worker device.

All of the communication is done using one byte ASCII since this is easy to do with an Arduino.

Logger (source device)

This program is essentially a key logger.
It is a python program running a Tkinter GUI that listens to all of the keystrokes while the window is active.
Originally, I had planned on using ncurses but quickly realized that many of the keystrokes that I would like to send would not be captured.
Additionally, because it is running in a terminal, sometimes weird things happen to modifier keys.

The encoding for a key combination is as follows:

[N][optional modifiers][key]


  • N is the number of keys pressed at one time.
  • optional modifiers is a set of modifiers. See for possibilities
  • key the key being pressed

For example, the encoding for ctrl + shift + a would be 3<s>a.


Since this is a program in user space, certain keystrokes cannot be captured, such as operating system shortcuts (e.g. cmd + tab).
Of course, you could bind aliases to such things.
For example, I bind ctrl + 0 to cmd + space since I rarely ever need to type ctrl + 0.

Arduino (replay device)

This program is a typical Arduino program with setup and loop functions (see replay_arduino).
The main loop waits until it sees a byte coming into the Bluetooth serial stream.
It then simply reads the number of keystrokes to expect, then parses the keys for that specific command.
It then presses all of the keys one-by-one until the combination has been completed, then releases them.
This all happens very quickly, possibly under a millisecond.

The only complication that I encountered was bytes being read at incorrect times due to the buffer not being filled.
Basically, if there is nothing in the stream, then the Arduino will simply execute the command while not reading anything from the stream and go forward.
The way that I dealt with this is by implementing a spinlock until there is something available in the stream (see safe_read()).

In action

Here is a demo:

Future work

The Arduino Micro has the ability to send mouse coordinates.
It would be cool to replay all of the mouse movements/commands.

I haven’t used it enough to notice issues with dropping packets, however, I imagine that at some point I will experience this.
If it ever becomes a major issue, it would be nice to implement some sort of error correction and recovery scheme.

The power of randomness: Pham vs Gap Part II

This is part two of two looking at the performance of methods for determining the number of clusters in k-means clustering. For part one, see Optimal “k” when there’s no cluster? Pham vs Gap Part I.


All of the examples, as well as plots can be found at this Github repo under the directory part2.


In the previous post, we looked at how the gap statistic and Pham et. al. method behave in several conditions under the reference distribution.

In this post we’ll look at how both methods behave under simulations performed in the gap statistic paper. Unlike Part I where we looked at how those methods behaved when there is no clustering, here we will look at how methods behave when there is clustering.

For references on the gap statistic or the Pham method, please see Part I. Unlike Part I, in this post I’ve also included the “prediction strength” method, also by Tibshirani et. al..

Prediction strength algorithm

We’ll review the prediction strength algorithm briefly here. The paper is pretty good, and the technique is pretty simple and clever.

First, let us define the term co-membership. Co-membership is defined by Tibshirani et. al. as an indicator variable which is 1 if the two observations belong to the same cluster, as the clustering is defined, and 0 otherwise.

The prediction strength idea depends heavily on cross validation and this idea of co-membership. The overview of the algorithm is below:

  1. Randomly partition all of your data into a training set and a testing set
  2. Cluster the training set, then cluster the test set
  3. Compute the co-membership matrix of the test set using the clustering from the training data (this is the “prediction”).
  4. Compute the co-membership matrix of the test set using the testing data (this is the “truth”)
  5. Evaluate how well the training clustering predicts the co-membership of the testing clustering

This procedure is performed using cross-validation several times for each candidate k. Then, for each k, the mean cross-validation prediction score is computed. The largest k such that the mean prediction strength is greater than some value (in the paper 0.8 or 0.9) is chosen as the optimal k.


I recreated the simulations from the gap statistic paper. For the most part, they are the same, or similar, but vary the sample size. I’ll quote the paper, then give any differences below.

For each configuration, I did 100 realizations. I did not include simulation (a) as Part I does a more in depth analysis of it.

All simulations were implemented in R. I used the following implementations:

  • clusGap() from the the cluster package for the gap statistic (see my parallelized fork of the cluster package)
  • prediction.strength() from the fpc package for the prediction strength method
  • kselection() from the kselection package for the Pham method

Simulation (b)

three clusters in two dimensions – the clusters are standard normal variable with (25, 25, 30) observations, centered at (0, 0), (0, 5), and (5, -3).

The implementation, (sim_b()), follows this description exactly. There are three additional sample sizes tested. In total, these are the ones simulated: (25, 50, 25), (50, 100, 50), (125, 250, 125), (2500, 5000, 2500).

Simulation (c)

four clusters in three dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as N(0, 5I) (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded;

The implementation (sim_c()) follows the description pretty similarly, except I simulated in four dimensions instead of three due to a typo. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).

Simulation (d)

four clusters in 10 dimensions – each cluster was randomly chosen to have 25 or 50 standard normal observations, with centres randomly chosen as N(0, 1.9I) (any simulation with clusters having a minimum distance less than 1.0 units between them was discarded; in this and the previous scenario, the settings are such that about half of the random realizations were discarded);

The implementation (sim_d()) is exactly as described. The sample sizes tested were: (25, 50), (50, 100), (125, 250), (2500, 5000).

Simulation (e)

two elongated clusters in three dimensions – each cluster is generated as follows Set x_1 = x_2 = x_3 = t with t taking 100 equally spaced values from -0.5 to 0.5 and then Gaussian noise with standard deviation 0.1 added to each feature. Cluster 2 is generated the same way, except that the value 10 is added to each feature at the end. The result is two elongated clusters, stretching out along the main diagonal of a three dimensional cube.

sim_e() is implemented exactly as described. The sample sizes tested were: 100, 250, 500, 7500. The sample size 15,000 was also tested, but isn’t displayed.


Correct classification of k

Perhaps the most important factor of all of these methods is accuracy. The figure below shows how each method did in the different scenarios:


The header of each subfigure shows the sample size for each simulation.

The table of counts can be seen below in the Github repository by loading part2/results.RData and looking at the variable X_table, where X is the simulation identifier.

Simulation (b)

We note that simulation (b) is quite simple, and all methods do well when the sample size is large. As in the null analysis (Part I), the Pham statistic does quite poorly in small sample sizes, but does better as the sample size grows. The gap statistic and prediction strength statistic do quite well in all simulations here.

Simulation (c)

This simulation is a slightly more difficult than the one in the original paper as it is in higher dimension (4 vs 3) but with the same amount of variance. Still, the gap statistic does very well, whereas the Pham method habitually thinks there are two clusters instead of four. The prediction strength algorithm does better as the sample size increases. This behavior is expected, as when you have a small sample size, the probability of sampling all clusters representatively is lower.

Simulation (d)

This simulation is very similar to simulation (c), but is in a much higher dimension (10 vs 4) and with smaller variance. The gap statistic and prediction strength statistic do quite well in all situations, where again the Pham method does very poorly.

Simulation (e)

In the gap statistic paper, this simulation was handled quite well by the gap statistic, but very poorly by almost all other methods. To my surprise, the Pham method does well in this situation.

Here, we note the prediction strength method does quite poorly. A similar simulation was performed in the prediction strength paper, and they also note that it does poorly. In that paper there is a note mentioning that a good way to model an elongated cluster is by modeling several spheres. We note that as the sample size grows, so does the number of clusters selected. This is further evidence for the sphere argument. Below is a simulation with sample size 7,500 (per cluster), where k is selected using the prediction strength method. One can see that while we expected there to be two clusters, the optimal k selected (10) seems like a reasonable fit depending on your assumptions.



The timing summary can be seen in the figure below. Note that the x axes are log2 scaled.


As expected, Pham is significantly faster — it’s not really doing much other than computing a statistic based on the within sum of squares. Interestingly, the gap statistic is faster for smaller sample sizes, but for larger sample sizes, the prediction strength method is faster than the gap statistic. There is a point for which it becomes more costly to generate uniform samples than to perform sub-sampling and cross-validation.


The simulations performed recreated the simulations from the original gap statistic paper. They are quite simple, but test a variety of different cases with varying sample sizes:

  • Simple Gaussian clusters in low dimension
  • Gaussian clusters with more variance in low dimension
  • Gaussian clusters with higher variance in higher dimensions
  • Elongated clusters in low dimension

The gap statistic does quite well in all cases. This isn’t so surprising as the simulations were from the paper, but I think in general the gap statistic has a pretty concrete formulation and is a well thought out algorithm. The Pham method did quite poorly in the two higher variance cases. This is a bit worrisome as the simulations aren’t that complex, and it’s likely that real data is much messier than the simulations.

In general we see that when the sample size is smaller, the prediction strength algorithm doesn’t do as well as in large sample size situations. This is somewhat expected as with smaller sample sizes it becomes less probable that you would generate representative samples from each cluster.

Given these test cases, perhaps the best choice is still the gap statistic if you have a moderate sample size. While this is true, the prediction strength algorithm will likely scale better with large data sets, and depending on your assumptions of the data, might perform comparably to the gap statistic. The Pham method is appealing in runtime as you basically get the statistic for free after running clustering with the selected k. That being said, it’s performance on these simple examples is quite poor. I don’t think I would personally trust it too much on real data. If you’re looking for something that is computationally pretty cheap and has decent performance, the Krzanowski and Lai method cited in the gap statistic and prediction strength papers seems to perform reasonably well on these simulations and has low computational costs. I haven’t personally benchmarked it, but it seems to do OK in those papers. As usual, there is no substitute for randomization.

Optimal “k” when there’s no cluster? Pham vs Gap Part I

K-means clustering is a fairly common and useful algorithm. One of the difficulties in using k-means is choosing the number of clusters, k. I use the gap statistic pretty regularly for choosing k in my analyses as I’ve found it to perform reasonably well in practice. Recently, I’ve been thinking a bit about clustering on Spark with large data sets and while the gap statistic works well on small data sets, it can be a major pain with lots of samples.


This post isn’t meant to be authoritative, I was just curious to see how the Pham et. al. method and gap statistic perform under various sample sizes and dimensions when there is no clustering under a uniform distribution.

Some notation

  • N is the number of data points (sample size)
  • p is the number of dimensions of each data point
  • k is the number of clusters
  • K is the maximum number of clusters (k = \{1, 2, \ldots, K\})

Why is the gap statistic so slow?

For a review of the gap statistic, checkout this great post by The Data Science Lab. The basics in the original paper are pretty easy to follow as well. The main reason it is relatively slow is that you generate N samples from a uniform distribution over the same range as the original data, B times. In each b = \{1, 2, \ldots, B\} iteration, you then run the same clustering algorithm with your candidate k values. This means that on top of clustering your original data K times you have to cluster KB data sets, each of sample size N. When p and N aren’t too large, it’s pretty tractable — but it can grow out of control rather quickly as to get stable estimates of the optimal k, B has to be somewhat large (the authors of the R packages cluster recommend as much as B = 500!).

Pham et. al.

The Pham et. al. method was also recently reviewed at the Data Science Lab. Computationally, the method is preferable to the gap statistic as it does not require a bunch of sampling — you only have to run k-means K times and compute a statistic that depends on the within cluster sum of squares. That being said, while the approach is somewhat justified in the paper, it smells a lot like a heuristic, at least when compared to something like the gap statistic.

Why do these simulations?

Mostly curiosity. I will say that the gap statistic is a much more principled approach and I was curious how the Pham method would hold up. Also, the Pham simulations left quite a bit to be desired. As far as I could tell, the simulations weren’t repeated (or at least reported), which is generally bad form for such simple simulations.

Performance under the reference distribution

Both methods assume that when there is no clustering, the reference distribution is uniform over the p-dimensional space. For the simulations here, samples were generated uniformly at random from the unit hypercube. I performed 100 realizations of each simulation, where I vary N and p. Clustering was performed using the standard R implementation of k-means with 25 starts.

For the Pham method, I used the kselection package with default parameters. For the gap statistic, I used the cluster package. The gap statistic was run with B = 100. You can find the R code in this repo, along with a RData file with the results.

General results

Below is a figure with the k selected by the each algorithm in several different conditions. The title for each subfigure is p, N: the dimension and sample size. The histograms are in “dodge” mode.

optimal_kGenerally, the gap statistic outperforms the Pham method, but as the dimensionality and sample size grow, they tend to perform comparably (under these conditions). The situations where the Pham method does quite poorly is with sample size and few dimensions. In such cases it doesn’t really make sense to not use the gap statistic as it can still be computed quickly.

A closer look at Pham

We can look at only the simulations which did not choose k = 1 correctly to see what might be going awry. Below I’ve plotted f(k) for the incorrectly classified simulations when N = 50, p = 2:


Each line is a different simulation and the dotted red line marks f(k) = 0.85, the recommended threshold at which to choose the minimal value of f(k). The proposed rule in the paper is to choose the minimal k such that f(k) < 0.85. If there are no values of f(k) < 0.85, the algorithm returns k = 1. It’s pretty clear that for this particular setup, 0.85 is not a good threshold.

If we look at those which correctly chose k = 1, we note that many of the simulations almost crossed this threshold of 0.85.



Each simulation was run on a separate single core. The gap statistic is embarrassingly parallel, though the R implementation does not exploit this (I think I’m smelling a C++11 threaded implementation in my future). I will also note that this is a large shared machine, so it’s possible at some point there were other tasks competing for CPU time (the machine isn’t running resource management).

The summary of results can be seen below (histograms are in “stack” mode):


It’s pretty clear that for small sample sizes there is little time benefit from using the Pham method. Though, as the dimension and sample size grow, the gap statistic grows out of control. At p = 20, N = 1000, which in the grand scheme of things is a pretty small dataset, the gap statistic is on average over 200 times slower than the Pham method.


The Pham method is a somewhat convenient heuristic that has been proposed for determining the optimal number of clusters when the number of clusters isn’t known a priori. We looked at very simple simulations where there is no clustering in the data to see how the Pham method compares to the gap statistic. For small sample sizes and/or few dimensions, the gap statistic does significantly better. The Pham method especially has trouble with small sample sizes and few dimensions. This problem seems to arise from the selection of the threshold at which to choose the optimal k. The original paper suggests 0.85 which clearly isn’t best for all datasets.

While the low dimensional situation might seem nonexistent in common datasets, it’s somewhat common to reduce a high dimensional data set to low dimensional space before clustering using a method such as principal component analysis.

Future work

These simulations don’t really prove anything beyond the very naive conditions we tested. In a possible future post, I will recreate the simulations in the gap statistic paper and look for other papers which have comparable simulations to see how each method performs on data with clusters. I will also likely expand the methods tested to a larger group, so please comment with (automated) methods you think should be compared.

In RNA-Seq, 2 != 2: Between-sample normalization

In this post I will discuss between-sample normalization (BSN), which is often called library size normalization (LSN), though, I don’t particularly like that term. Here, I will refer to it as BSN. The post is guided towards scientists that don’t come from a mathematics background and people who are new to RNA-Seq.

Not so random aside:

After many chats/rants with Nick Bray about RNA-Seq, we decided to abolish the term LSN. The reason being that library size normalization is a pretty misleading phrase, and it is especially confusing to biologists (rightfully so) who think of the library as the cDNA in your sample after prep. LSN term doesn’t address the fact that what you’re measuring is relative and thus counts from different mixtures OR different depths aren’t equivalent.

So, let’s get started. The general point of BSN is to be able to compare expression features (genes, isoforms) ACROSS experiments. For a look at within-sample normalization (comparing different features within a sample), see my previous post. Most BSN methods address two issues:

  1. Variable sequencing depth (the total number of reads sequenced) between experiments
  2. Finding a “control set” of expression features which should have relatively similar expression patterns across experiments (e.g. genes that are not differentially expressed) to serve as a baseline

First, let’s address issue 1.

Adjusting for different sequencing depths

Adjusting for total counts is probably the first normalization technique that comes to mind when comparing two sequencing experiments, and indeed, it is inherently used when FPKM is computed without adjusting the “total number of reads” variable in the denominator. The concept is rather simple: I sequenced 1M reads in the first experiment and 2M reads in the second, thus, I should scale my counts by the total number of reads and they should be equivalent. In fact, this is generally what we want to do, but there are a few important caveats.

Consider the (important) case where a feature is differentially expressed. Below we have two toy experiments sequenced at different depths1, each sampling five genes (G1, G2, G3, G4, FG).


The control experiment has 10 reads sequenced and the treatment experiment has 100 reads sequenced, so one way to do the total count normalization (which is proportional to FPKM) would be to divide by the total counts. The results are shown in the following table:

Gene Control Counts Treatment Counts Control Normalized Treatment Normalized
G1 2.00 6.00 0.20 0.06
G2 2.00 6.00 0.20 0.06
G3 2.00 6.00 0.20 0.06
G4 2.00 6.00 0.20 0.06
FG 2.00 76.00 0.20 0.76


The result is quite preposterous! If one compares the control and treatment proportions, it seems that every gene is differentially expressed. We are typically under the assumption that the majority of the genes do not change in expression. Under that assumption, it is much more plausible that genes one through four remain equally expressed, whereas “funky gene” (FG) is the only differentially expressed gene.

Noting that FG is the only one likely to have changed, our normalization strategy could be different. We might consider normalizing by the sum of the total counts while omitting FG since its funkiness is throwing off our normalization. Thus, for the control sample we would now divide by 8 and in the treatment we would divide by 24:

Gene Control Counts Treatment Counts Control Normalized (-FG) Treatment Normalized (-FG)
G1 2.00 6.00 0.25 0.25
G2 2.00 6.00 0.25 0.25
G3 2.00 6.00 0.25 0.25
G4 2.00 6.00 0.25 0.25
FG 2.00 76.00 0.25 3.17


And now we are back to a more reasonable looking result where only “funky gene” appears to have changed expression. The astute reader might also note that the relative proportions have changed in the RNA-Seq experiment, but the absolute amount of transcripts from G1-G4 probably did not. Indeed, as an exercise you can scale the “control” experiment to 100 reads and notice that you will still see the effect of all of the genes being differentially expressed if using total count normalization2.

You might be thinking “So.. in this case it was pretty obvious because funky gene is pretty crazy. How do we do this in general?” That’s a great question! Thus, enter the whole slew of RNA-Seq between-sample normalization methods.

Between-sample normalization techniques

There have been several quite good papers published on between-sample normalization (BSN), some of which I’ve collected below. In almost all the methods, the techniques are simply different ways of finding a “control set” of expression features and using these to estimate your total depth, as opposed to using all of your data to estimate your total depth. While I offered a toy example on why it is important to have a control set of features, more complete examples can be found in Bullard et. al. and Robinson and Oshlack. The basic idea is that highly-variable expression features and differentially expressed features can eat a lot of your sequencing depth, thus throwing off the relative abundance of your expression features.

Mathematical framework

Let X_{ij} be the observed counts for expression feature i (maybe a gene or transcript) in sample j. We assume X_{ij} is a sample from some distribution with the following property: E[X_{ij}] = \mu_{ij} = \beta_i \cdot s_j where \beta_i is a mean expression rate for feature i and s_j is the sample specific scaling factor for condition j. This basically means that every single sample from this experimental condition share some common expression rate, but have a different sequencing depth that scales log-linearly.

Under this model, if we assume s_j is fixed, we can solve for \beta_i using maximum-likelihood or method-of-moments pretty easily. Unfortunately, as you saw earlier we don’t actually know the value of s_j, so we must estimate it.

PoissonSeq: an example

Instead of summarizing every method, I want to run through a method that I think is simple, intuitive, and under-appreciated: PoissonSeq by Li et. al. We define the set S as the set of “control” features that we wish to find. Initially, we set S to contain every single feature.

  1. Set S = \{\text{all expression features}\}
  2. Determine which features are outside of the control set — those with variance relatively large or relatively small. To do this, we compute a goodness-of-fit statistic for every feature:

    \displaystyle \text{GOF}_i = \sum_{j = 1}^n \frac{(X_{ij} - \hat{s}_j X_{i\cdot})^2}{\hat{s}_j X_{i\cdot}}

  3. Pooling all \text{GOF}_i together, we have a distribution. We can then compute S as the “center” of the distribution — basically anything in the inter-quartile range of this distribution.
  4. Given S, estimate s_j by:

    \displaystyle \hat{s}_j = \frac{\sum_{i \in S} X_{ij}} {\sum_{i \in S} X_{i\cdot}}

  5. If \hat{s}_j has converged, you’re done. If not, repeat steps 2 – 4.

The key to this method is the goodness-of-fit statistic. If your goodness-of-fit statistic is large, then you have more variance than you would have expected and thus this particular expression feature shouldn’t be considered as part of your control set.

While this method assumes the Poisson distribution, Li et. al. show that it is still robust under the assumption of the negative binomial (the assumed distribution under most differential expression tools).

Performance of methods

This is an area that definitely needs work, but is rather difficult to measure performance in (other than simulations, which I have my own gripes with). The size factor is basically a nuisance parameter for differential expression inference and while there has been lots of great work in the area, I’m not familiar with any orthogonally validated “bake-off” papers comparing all existing methods.

That being said, I think it is absolutely negligent to still use total count normalization (or no normalization at all!), as it has been well understood to perform poorly in almost all circumstances.

Among the most popular and well-accepted BSN methods are TMM and DESeq normalization.

Combining within-sample normalization and between-sample normalization

Every normalization technique that I have seen assumes you are modeling counts, so the assumptions might be violated if you are using them directly on TPM or FPKM. While this is true, I think most techniques will give reasonable results in practice. Another possibility is to apply a BSN technique to the counts, then perform your within-sample normalization. This area has not been studied well, though we are actively working on it.

List of BSN papers

Here are the most relevant publications that I know of. I probably missed a few. If I missed anything relevant, please post it in the comments. If I agree, I’ll add it :).


  1. I normally hate pie charts, but I think in this case they serve an illustrative purpose. Please use pie charts with caution :).
  2. This is where 2 != 2 comes from. You can sequence the same amount of reads, but each read means something different in each experiment.

What the FPKM? A review of RNA-Seq expression units

This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood. I’ll try to clear up a bit of the confusion here.

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.


Throughout this post “read” refers to both single-end or paired-end reads. The concept of counting is the same with either type of read, as each read represents a fragment that was sequenced.

When saying “feature”, I’m referring to an expression feature, by which I mean a genomic region containing a sequence that can normally appear in an RNA-Seq experiment (e.g. gene, isoform, exon).

Finally, I use the random variable X_i to denote the counts you observe from a feature of interest i. Unfortunately, with alternative splicing you do not directly observe X_i, so often \mathbb E[X_i] is used, which is estimated using the EM algorithm by a method like eXpress, RSEM, Sailfish, Cufflinks, or one of many other tools.


“Counts” usually refers to the number of reads that align to a particular feature. I’ll refer to counts by the random variable X_i. These numbers are heavily dependent on two things: (1) the amount of fragments you sequenced (this is related to relative abundances) and (2) the length of the feature, or more appropriately, the effective length. Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:

\widetilde{l}_i = l_i - \mu_{FLD} + 1,

where \mu_{FLD} is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length. This means you can’t sum the counts over a set of features to get the expression of that set (e.g. you can’t sum isoform counts to get gene counts).

Counts are often used by differential expression methods since they are naturally represented by a counting model, such as a negative binomial (NB2).

Effective counts

When eXpress came out, they began reporting “effective counts.” This is basically the same thing as standard counts, with the difference being that they are adjusted for the amount of bias in the experiment. To compute effective counts:

\text{effCounts}_i = X_i \cdot \dfrac{l_i}{\widetilde{l}_i}.

The intuition here is that if the effective length is much shorter than the actual length, then in an experiment with no bias you would expect to see more counts. Thus, the effective counts are scaling the observed counts up.

Counts per million

Counts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced (N) times one million. This unit is related to the FPKM without length normalization and a factor of 10^3:

\text{CPM}_i = \dfrac{X_i}{\dfrac{N}{10^6}} = \dfrac{X_i}{N}\cdot 10^6

I’m not sure where this unit first appeared, but I’ve seen it used with edgeR and talked about briefly in the limma voom paper.

Within sample normalization

As noted in the counts section, the number of fragments you see from a feature depends on its length. Therefore, in order to compare features of different length you should normalize counts by the length of the feature. Doing so allows the summation of expression across features to get the expression of a group of features (think a set of transcripts which make up a gene).

Again, the methods in this section allow for comparison of features with different length WITHIN a sample but not BETWEEN samples.


Transcripts per million (TPM) is a measurement of the proportion of transcripts in your pool of RNA.

Since we are interested in taking the length into consideration, a natural measurement is the rate, counts per base (X_i / \widetilde{l}_i). As you might immediately notice, this number is also dependent on the total number of fragments sequenced. To adjust for this, simply divide by the sum of all rates and this gives the proportion of transcripts i in your sample. After you compute that, you simply scale by one million because the proportion is often very small and a pain to deal with. In math:

\text{TPM}_i = \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6.

TPM has a very nice interpretation when you’re looking at transcript abundances. As the name suggests, the interpretation is that if you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen of type i, given the abundances of the other transcripts in your sample. The last “given” part is important. The denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.

I’m fairly certain TPM is attributed to Bo Li et. al. in the original RSEM paper.


Reads per kilobase of exon per million reads mapped (RPKM), or the more generic FPKM (substitute reads with fragments) are essentially the same thing. Contrary to some misconceptions, FPKM is not 2 * RPKM if you have paired-end reads. FPKM == RPKM if you have single-end reads, and saying RPKM when you have paired-end reads is just weird, so don’t do it :).

A few years ago when the Mortazavi et. al. paper came out and introduced RPKM, I remember many people referring to the method which they used to compute expression (termed the “rescue method”) as RPKM. This also happened with the Cufflinks method. People would say things like, “We used the RPKM method to compute expression” when they meant to say they used the rescue method or Cufflinks method. I’m happy to report that I haven’t heard this as much recently, but I still hear it every now and then. Therefore, let’s clear one thing up: FPKM is NOT a method, it is simply a unit of expression.

FPKM takes the same rate we discussed in the TPM section and instead of dividing it by the sum of rates, divides it by the total number of reads sequenced (N) and multiplies by a big number (10^9). In math:

\text{FPKM}_i = \dfrac{X_i}{ \left(\dfrac{\widetilde{l}_i}{10^3}\right) \left( \dfrac{N}{10^6} \right)} = \dfrac{X_i}{\widetilde{l}_i N} \cdot 10^9  .

The interpretation of FPKM is as follows: if you were to sequence this pool of RNA again, you expect to see \text{FPKM}_i fragments for each thousand bases in the feature for every N/10^6 fragments you’ve sequenced. It’s basically just the rate of fragments per base multiplied by a big number (proportional to the number of fragments you sequenced) to make it more convenient.

Relationship between TPM and FPKM

The relationship between TPM and FPKM is derived by Lior Pachter in a review of transcript quantification methods (\text{TPM}_i = \hat{\rho}_i \cdot 10^6) in equations 10 – 13. I’ll recite it here:

\begin{aligned}  \text{TPM}_i &= \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6 \\  &\propto \dfrac{X_i}{\widetilde{l}_i \cdot N} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j \cdot N} } \right) \\  &\propto \dfrac{X_i}{\widetilde{l}_i \cdot N} \cdot 10^9   \end{aligned}  .

If you have FPKM, you can easily compute TPM:

\begin{aligned}  \text{TPM}_i &= \left( \dfrac{\text{FPKM}_i}{\sum_j \text{FPKM}_j } \right) \cdot 10^6  \end{aligned}  .

Wagner et. al. discuss some of the benefits of TPM over FPKM here and advocate the use of TPM.

I hope this clears up some confusion or helps you see the relationship between these units. In the near future I plan to write about how to use sequencing depth normalization with these different units so you can compare several samples to each other.

R code

I’ve included some R code below for computing effective counts, TPM, and FPKM. I’m sure a few of those logs aren’t necessary, but I don’t think they’ll hurt :).

countToTpm <- function(counts, effLen)
    rate <- log(counts) - log(effLen)
    denom <- log(sum(exp(rate)))
    exp(rate - denom + log(1e6))

countToFpkm <- function(counts, effLen)
    N <- sum(counts)
    exp( log(counts) + log(1e9) - log(effLen) - log(N) )

fpkmToTpm <- function(fpkm)
    exp(log(fpkm) - log(sum(fpkm)) + log(1e6))

countToEffCounts <- function(counts, len, effLen)
    counts * (len / effLen)

# An example
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)

# assume a mean(FLD) = 203.7
countDf$effLength <- countDf$length - 203.7 + 1
countDf$tpm <- with(countDf, countToTpm(count, effLength))
countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
countDf$effCounts <- with(countDf, countToEffCounts(count, length, effLength))