# In RNA-Seq, 2 != 2: Between-sample normalization

In this post I will discuss between-sample normalization (BSN), which is often called library size normalization (LSN), though, I don’t particularly like that term. Here, I will refer to it as BSN. The post is guided towards scientists that don’t come from a mathematics background and people who are new to RNA-Seq.

Not so random aside:

After many chats/rants with Nick Bray about RNA-Seq, we decided to abolish the term LSN. The reason being that library size normalization is a pretty misleading phrase, and it is especially confusing to biologists (rightfully so) who think of the library as the cDNA in your sample after prep. LSN term doesn’t address the fact that what you’re measuring is relative and thus counts from different mixtures OR different depths aren’t equivalent.

So, let’s get started. The general point of BSN is to be able to compare expression features (genes, isoforms) ACROSS experiments. For a look at within-sample normalization (comparing different features within a sample), see my previous post. Most BSN methods address two issues:

1. Variable sequencing depth (the total number of reads sequenced) between experiments
2. Finding a “control set” of expression features which should have relatively similar expression patterns across experiments (e.g. genes that are not differentially expressed) to serve as a baseline

# Adjusting for different sequencing depths

Adjusting for total counts is probably the first normalization technique that comes to mind when comparing two sequencing experiments, and indeed, it is inherently used when FPKM is computed without adjusting the “total number of reads” variable in the denominator. The concept is rather simple: I sequenced 1M reads in the first experiment and 2M reads in the second, thus, I should scale my counts by the total number of reads and they should be equivalent. In fact, this is generally what we want to do, but there are a few important caveats.

Consider the (important) case where a feature is differentially expressed. Below we have two toy experiments sequenced at different depths1, each sampling five genes (G1, G2, G3, G4, FG).

The control experiment has 10 reads sequenced and the treatment experiment has 100 reads sequenced, so one way to do the total count normalization (which is proportional to FPKM) would be to divide by the total counts. The results are shown in the following table:

Gene Control Counts Treatment Counts Control Normalized Treatment Normalized
G1 2.00 6.00 0.20 0.06
G2 2.00 6.00 0.20 0.06
G3 2.00 6.00 0.20 0.06
G4 2.00 6.00 0.20 0.06
FG 2.00 76.00 0.20 0.76

The result is quite preposterous! If one compares the control and treatment proportions, it seems that every gene is differentially expressed. We are typically under the assumption that the majority of the genes do not change in expression. Under that assumption, it is much more plausible that genes one through four remain equally expressed, whereas “funky gene” (FG) is the only differentially expressed gene.

Noting that FG is the only one likely to have changed, our normalization strategy could be different. We might consider normalizing by the sum of the total counts while omitting FG since its funkiness is throwing off our normalization. Thus, for the control sample we would now divide by 8 and in the treatment we would divide by 24:

Gene Control Counts Treatment Counts Control Normalized (-FG) Treatment Normalized (-FG)
G1 2.00 6.00 0.25 0.25
G2 2.00 6.00 0.25 0.25
G3 2.00 6.00 0.25 0.25
G4 2.00 6.00 0.25 0.25
FG 2.00 76.00 0.25 3.17

And now we are back to a more reasonable looking result where only “funky gene” appears to have changed expression. The astute reader might also note that the relative proportions have changed in the RNA-Seq experiment, but the absolute amount of transcripts from G1-G4 probably did not. Indeed, as an exercise you can scale the “control” experiment to 100 reads and notice that you will still see the effect of all of the genes being differentially expressed if using total count normalization2.

You might be thinking “So.. in this case it was pretty obvious because funky gene is pretty crazy. How do we do this in general?” That’s a great question! Thus, enter the whole slew of RNA-Seq between-sample normalization methods.

# Between-sample normalization techniques

There have been several quite good papers published on between-sample normalization (BSN), some of which I’ve collected below. In almost all the methods, the techniques are simply different ways of finding a “control set” of expression features and using these to estimate your total depth, as opposed to using all of your data to estimate your total depth. While I offered a toy example on why it is important to have a control set of features, more complete examples can be found in Bullard et. al. and Robinson and Oshlack. The basic idea is that highly-variable expression features and differentially expressed features can eat a lot of your sequencing depth, thus throwing off the relative abundance of your expression features.

## Mathematical framework

Let $X_{ij}$ be the observed counts for expression feature $i$ (maybe a gene or transcript) in sample $j$. We assume $X_{ij}$ is a sample from some distribution with the following property: $E[X_{ij}] = \mu_{ij} = \beta_i \cdot s_j$ where $\beta_i$ is a mean expression rate for feature $i$ and $s_j$ is the sample specific scaling factor for condition $j$. This basically means that every single sample from this experimental condition share some common expression rate, but have a different sequencing depth that scales log-linearly.

Under this model, if we assume $s_j$ is fixed, we can solve for $\beta_i$ using maximum-likelihood or method-of-moments pretty easily. Unfortunately, as you saw earlier we don’t actually know the value of $s_j$, so we must estimate it.

### PoissonSeq: an example

Instead of summarizing every method, I want to run through a method that I think is simple, intuitive, and under-appreciated: PoissonSeq by Li et. al. We define the set $S$ as the set of “control” features that we wish to find. Initially, we set $S$ to contain every single feature.

1. Set $S = \{\text{all expression features}\}$
2. Determine which features are outside of the control set — those with variance relatively large or relatively small. To do this, we compute a goodness-of-fit statistic for every feature:

$\displaystyle \text{GOF}_i = \sum_{j = 1}^n \frac{(X_{ij} - \hat{s}_j X_{i\cdot})^2}{\hat{s}_j X_{i\cdot}}$

3. Pooling all $\text{GOF}_i$ together, we have a distribution. We can then compute $S$ as the “center” of the distribution — basically anything in the inter-quartile range of this distribution.
4. Given $S$, estimate $s_j$ by:

$\displaystyle \hat{s}_j = \frac{\sum_{i \in S} X_{ij}} {\sum_{i \in S} X_{i\cdot}}$

5. If $\hat{s}_j$ has converged, you’re done. If not, repeat steps 2 – 4.

The key to this method is the goodness-of-fit statistic. If your goodness-of-fit statistic is large, then you have more variance than you would have expected and thus this particular expression feature shouldn’t be considered as part of your control set.

While this method assumes the Poisson distribution, LiĀ et. al. show that it is still robust under the assumption of the negative binomial (the assumed distribution under most differential expression tools).

## Performance of methods

This is an area that definitely needs work, but is rather difficult to measure performance in (other than simulations, which I have my own gripes with). The size factor is basically a nuisance parameter for differential expression inference and while there has been lots of great work in the area, I’m not familiar with any orthogonally validated “bake-off” papers comparing all existing methods.

That being said, I think it is absolutely negligent to still use total count normalization (or no normalization at all!), as it has been well understood to perform poorly in almost all circumstances.

Among the most popular and well-accepted BSN methods are TMM and DESeq normalization.

## Combining within-sample normalization and between-sample normalization

Every normalization technique that I have seen assumes you are modeling counts, so the assumptions might be violated if you are using them directly on TPM or FPKM. While this is true, I think most techniques will give reasonable results in practice. Another possibility is to apply a BSN technique to the counts, then perform your within-sample normalization. This area has not been studied well, though we are actively working on it.

# List of BSN papers

Here are the most relevant publications that I know of. I probably missed a few. If I missed anything relevant, please post it in the comments. If I agree, I’ll add it :).

## Footnotes

1. I normally hate pie charts, but I think in this case they serve an illustrative purpose. Please use pie charts with caution :).
2. This is where 2 != 2 comes from. You can sequence the same amount of reads, but each read means something different in each experiment.

# What the FPKM? A review of RNA-Seq expression units

This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood. I’ll try to clear up a bit of the confusion here.

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

## Preliminaries

Throughout this post “read” refers to both single-end or paired-end reads. The concept of counting is the same with either type of read, as each read represents a fragment that was sequenced.

When saying “feature”, I’m referring to an expression feature, by which I mean a genomic region containing a sequence that can normally appear in an RNA-Seq experiment (e.g. gene, isoform, exon).

Finally, I use the random variable $X_i$ to denote the counts you observe from a feature of interest $i$. Unfortunately, with alternative splicing you do not directly observe $X_i$, so often $\mathbb E[X_i]$ is used, which is estimated using the EM algorithm by a method like eXpress, RSEM, Sailfish, Cufflinks, or one of many other tools.

## Counts

“Counts” usually refers to the number of reads that align to a particular feature. I’ll refer to counts by the random variable $X_i$. These numbers are heavily dependent on two things: (1) the amount of fragments you sequenced (this is related to relative abundances) and (2) the length of the feature, or more appropriately, the effective length. Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length. In practice, the effective length is usually computed as:

$\widetilde{l}_i = l_i - \mu_{FLD} + 1$,

where $\mu_{FLD}$ is the mean of the fragment length distribution which was learned from the aligned read. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias.

Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length. This means you can’t sum the counts over a set of features to get the expression of that set (e.g. you can’t sum isoform counts to get gene counts).

Counts are often used by differential expression methods since they are naturally represented by a counting model, such as a negative binomial (NB2).

### Effective counts

When eXpress came out, they began reporting “effective counts.” This is basically the same thing as standard counts, with the difference being that they are adjusted for the amount of bias in the experiment. To compute effective counts:

$\text{effCounts}_i = X_i \cdot \dfrac{l_i}{\widetilde{l}_i}$.

The intuition here is that if the effective length is much shorter than the actual length, then in an experiment with no bias you would expect to see more counts. Thus, the effective counts are scaling the observed counts up.

### Counts per million

Counts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced ($N$) times one million. This unit is related to the FPKM without length normalization and a factor of $10^3$:

$\text{CPM}_i = \dfrac{X_i}{\dfrac{N}{10^6}} = \dfrac{X_i}{N}\cdot 10^6$

I’m not sure where this unit first appeared, but I’ve seen it used with edgeR and talked about briefly in the limma voom paper.

## Within sample normalization

As noted in the counts section, the number of fragments you see from a feature depends on its length. Therefore, in order to compare features of different length you should normalize counts by the length of the feature. Doing so allows the summation of expression across features to get the expression of a group of features (think a set of transcripts which make up a gene).

Again, the methods in this section allow for comparison of features with different length WITHIN a sample but not BETWEEN samples.

### TPM

Transcripts per million (TPM) is a measurement of the proportion of transcripts in your pool of RNA.

Since we are interested in taking the length into consideration, a natural measurement is the rate, counts per base ($X_i / \widetilde{l}_i$). As you might immediately notice, this number is also dependent on the total number of fragments sequenced. To adjust for this, simply divide by the sum of all rates and this gives the proportion of transcripts $i$ in your sample. After you compute that, you simply scale by one million because the proportion is often very small and a pain to deal with. In math:

$\text{TPM}_i = \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6$.

TPM has a very nice interpretation when you’re looking at transcript abundances. As the name suggests, the interpretation is that if you were to sequence one million full length transcripts, TPM is the number of transcripts you would have seen of type $i$, given the abundances of the other transcripts in your sample. The last “given” part is important. The denominator is going to be different between experiments, and thus is also sample dependent which is why you cannot directly compare TPM between samples. While this is true, TPM is probably the most stable unit across experiments, though you still shouldn’t compare it across experiments.

I’m fairly certain TPM is attributed to Bo Li et. al. in the original RSEM paper.

### RPKM/FPKM

Reads per kilobase of exon per million reads mapped (RPKM), or the more generic FPKM (substitute reads with fragments) are essentially the same thing. Contrary to some misconceptions, FPKM is not 2 * RPKM if you have paired-end reads. FPKM == RPKM if you have single-end reads, and saying RPKM when you have paired-end reads is just weird, so don’t do it :).

A few years ago when the Mortazavi et. al. paper came out and introduced RPKM, I remember many people referring to the method which they used to compute expression (termed the “rescue method”) as RPKM. This also happened with the Cufflinks method. People would say things like, “We used the RPKM method to compute expression” when they meant to say they used the rescue method or Cufflinks method. I’m happy to report that I haven’t heard this as much recently, but I still hear it every now and then. Therefore, let’s clear one thing up: FPKM is NOT a method, it is simply a unit of expression.

FPKM takes the same rate we discussed in the TPM section and instead of dividing it by the sum of rates, divides it by the total number of reads sequenced ($N$) and multiplies by a big number ($10^9$). In math:

$\text{FPKM}_i = \dfrac{X_i}{ \left(\dfrac{\widetilde{l}_i}{10^3}\right) \left( \dfrac{N}{10^6} \right)} = \dfrac{X_i}{\widetilde{l}_i N} \cdot 10^9$.

The interpretation of FPKM is as follows: if you were to sequence this pool of RNA again, you expect to see $\text{FPKM}_i$ fragments for each thousand bases in the feature for every $N/10^6$ fragments you’ve sequenced. It’s basically just the rate of fragments per base multiplied by a big number (proportional to the number of fragments you sequenced) to make it more convenient.

### Relationship between TPM and FPKM

The relationship between TPM and FPKM is derived by Lior Pachter in a review of transcript quantification methods ($\text{TPM}_i = \hat{\rho}_i \cdot 10^6$) in equations 10 – 13. I’ll recite it here:

\begin{aligned} \text{TPM}_i &= \dfrac{X_i}{\widetilde{l}_i} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j}} \right) \cdot 10^6 \\ &\propto \dfrac{X_i}{\widetilde{l}_i \cdot N} \cdot \left( \dfrac{1}{\sum_j \dfrac{X_j}{\widetilde{l}_j \cdot N} } \right) \\ &\propto \dfrac{X_i}{\widetilde{l}_i \cdot N} \cdot 10^9 \end{aligned}.

If you have FPKM, you can easily compute TPM:

\begin{aligned} \text{TPM}_i &= \left( \dfrac{\text{FPKM}_i}{\sum_j \text{FPKM}_j } \right) \cdot 10^6 \end{aligned}.

Wagner et. al. discuss some of the benefits of TPM over FPKM here and advocate the use of TPM.

I hope this clears up some confusion or helps you see the relationship between these units. In the near future I plan to write about how to use sequencing depth normalization with these different units so you can compare several samples to each other.

## R code

I’ve included some R code below for computing effective counts, TPM, and FPKM. I’m sure a few of those logs aren’t necessary, but I don’t think they’ll hurt :).

countToTpm <- function(counts, effLen)
{
rate <- log(counts) - log(effLen)
denom <- log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}

countToFpkm <- function(counts, effLen)
{
N <- sum(counts)
exp( log(counts) + log(1e9) - log(effLen) - log(N) )
}

fpkmToTpm <- function(fpkm)
{
exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}

countToEffCounts <- function(counts, len, effLen)
{
counts * (len / effLen)
}

################################################################################
# An example
################################################################################
cnts <- c(4250, 3300, 200, 1750, 50, 0)
lens <- c(900, 1020, 2000, 770, 3000, 1777)
countDf <- data.frame(count = cnts, length = lens)

# assume a mean(FLD) = 203.7
countDf$effLength <- countDf$length - 203.7 + 1
countDf$tpm <- with(countDf, countToTpm(count, effLength)) countDf$fpkm <- with(countDf, countToFpkm(count, effLength))
with(countDf, all.equal(tpm, fpkmToTpm(fpkm)))
countDf\$effCounts <- with(countDf, countToEffCounts(count, length, effLength))