# In RNA-Seq, 2 != 2: Between-sample normalization

In this post I will discuss between-sample normalization (BSN), which is often called library size normalization (LSN), though, I don’t particularly like that term. Here, I will refer to it as BSN. The post is guided towards scientists that don’t come from a mathematics background and people who are new to RNA-Seq.

**Not so random aside:**

After many chats/rants with Nick Bray about RNA-Seq, we decided to abolish the term LSN. The reason being that library size normalization is a pretty misleading phrase, and it is especially confusing to biologists (rightfully so) who think of the library as the cDNA in your sample after prep. LSN term doesn’t address the fact that what you’re measuring is relative and thus counts from different mixtures OR different depths aren’t equivalent.

So, let’s get started. The general point of BSN is to be able to compare expression features (genes, isoforms) ACROSS experiments. For a look at within-sample normalization (comparing different features within a sample), see my previous post. Most BSN methods address two issues:

- Variable sequencing depth (the total number of reads sequenced) between experiments
- Finding a “control set” of expression features which should have relatively similar expression patterns across experiments (e.g. genes that are not differentially expressed) to serve as a baseline

First, let’s address issue 1.

# Adjusting for different sequencing depths

Adjusting for total counts is probably the first normalization technique that comes to mind when comparing two sequencing experiments, and indeed, it is inherently used when FPKM is computed without adjusting the “total number of reads” variable in the denominator. The concept is rather simple: I sequenced 1M reads in the first experiment and 2M reads in the second, thus, I should scale my counts by the total number of reads and they should be equivalent. In fact, this is generally what we want to do, but there are a few important caveats.

Consider the (important) case where a feature is differentially expressed. Below we have two toy experiments sequenced at different depths^{1}, each sampling five genes (G1, G2, G3, G4, FG).

The control experiment has 10 reads sequenced and the treatment experiment has 100 reads sequenced, so one way to do the total count normalization (which is proportional to FPKM) would be to divide by the total counts. The results are shown in the following table:

Gene | Control Counts | Treatment Counts | Control Normalized | Treatment Normalized |
---|---|---|---|---|

G1 | 2.00 | 6.00 | 0.20 | 0.06 |

G2 | 2.00 | 6.00 | 0.20 | 0.06 |

G3 | 2.00 | 6.00 | 0.20 | 0.06 |

G4 | 2.00 | 6.00 | 0.20 | 0.06 |

FG | 2.00 | 76.00 | 0.20 | 0.76 |

The result is quite preposterous! If one compares the control and treatment proportions, it seems that every gene is differentially expressed. We are typically under the assumption that the majority of the genes do not change in expression. Under that assumption, it is much more plausible that genes one through four remain equally expressed, whereas “funky gene” (FG) is the only differentially expressed gene.

Noting that FG is the only one likely to have changed, our normalization strategy could be different. We might consider normalizing by the sum of the total counts while omitting FG since its funkiness is throwing off our normalization. Thus, for the control sample we would now divide by 8 and in the treatment we would divide by 24:

Gene | Control Counts | Treatment Counts | Control Normalized (-FG) | Treatment Normalized (-FG) |
---|---|---|---|---|

G1 | 2.00 | 6.00 | 0.25 | 0.25 |

G2 | 2.00 | 6.00 | 0.25 | 0.25 |

G3 | 2.00 | 6.00 | 0.25 | 0.25 |

G4 | 2.00 | 6.00 | 0.25 | 0.25 |

FG | 2.00 | 76.00 | 0.25 | 3.17 |

And now we are back to a more reasonable looking result where only “funky gene” appears to have changed expression. The astute reader might also note that the relative proportions have changed in the RNA-Seq experiment, but the absolute amount of transcripts from G1-G4 probably did not. Indeed, as an exercise you can scale the “control” experiment to 100 reads and notice that you will still see the effect of all of the genes being differentially expressed if using total count normalization^{2}.

You might be thinking “So.. in this case it was pretty obvious because funky gene is pretty crazy. How do we do this in general?” That’s a great question! Thus, enter the whole slew of RNA-Seq between-sample normalization methods.

# Between-sample normalization techniques

There have been several quite good papers published on between-sample normalization (BSN), some of which I’ve collected below. In almost all the methods, the techniques are simply different ways of finding a “control set” of expression features and using these to estimate your total depth, as opposed to using all of your data to estimate your total depth. While I offered a toy example on why it is important to have a control set of features, more complete examples can be found in Bullard *et. al.* and Robinson and Oshlack. The basic idea is that highly-variable expression features and differentially expressed features can eat a lot of your sequencing depth, thus throwing off the relative abundance of your expression features.

## Mathematical framework

Let be the observed counts for expression feature (maybe a gene or transcript) in sample . We assume is a sample from some distribution with the following property: where is a mean expression rate for feature and is the sample specific scaling factor for condition . This basically means that every single sample from this experimental condition share some common expression rate, but have a different sequencing depth that scales log-linearly.

Under this model, if we assume is fixed, we can solve for using maximum-likelihood or method-of-moments pretty easily. Unfortunately, as you saw earlier we don’t actually know the value of , so we must estimate it.

### PoissonSeq: an example

Instead of summarizing every method, I want to run through a method that I think is simple, intuitive, and under-appreciated: PoissonSeq by Li *et. al.* We define the set as the set of “control” features that we wish to find. Initially, we set to contain every single feature.

- Set
- Determine which features are outside of the control set — those with variance relatively large or relatively small. To do this, we compute a goodness-of-fit statistic for every feature:
- Pooling all together, we have a distribution. We can then compute as the “center” of the distribution — basically anything in the inter-quartile range of this distribution.
- Given , estimate by:
- If has converged, you’re done. If not, repeat steps 2 – 4.

The key to this method is the goodness-of-fit statistic. If your goodness-of-fit statistic is large, then you have more variance than you would have expected and thus this particular expression feature shouldn’t be considered as part of your control set.

While this method assumes the Poisson distribution, Li *et. al.* show that it is still robust under the assumption of the negative binomial (the assumed distribution under most differential expression tools).

## Performance of methods

This is an area that definitely needs work, but is rather difficult to measure performance in (other than simulations, which I have my own gripes with). The size factor is basically a nuisance parameter for differential expression inference and while there has been lots of great work in the area, I’m not familiar with any orthogonally validated “bake-off” papers comparing all existing methods.

That being said, I think it is absolutely negligent to still use total count normalization (or no normalization at all!), as it has been well understood to perform poorly in almost all circumstances.

Among the most popular and well-accepted BSN methods are TMM and DESeq normalization.

## Combining within-sample normalization and between-sample normalization

Every normalization technique that I have seen assumes you are modeling counts, so the assumptions might be violated if you are using them directly on TPM or FPKM. While this is true, I think most techniques will give reasonable results in practice. Another possibility is to apply a BSN technique to the counts, then perform your within-sample normalization. This area has not been studied well, though we are actively working on it.

# List of BSN papers

Here are the most relevant publications that I know of. I probably missed a few. If I missed anything relevant, please post it in the comments. If I agree, I’ll add it :).

- Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
- A scaling normalization method for differential expression analysis of RNA-seq data
- Normalization, testing, and false discovery rate estimation for RNA-sequencing data
- Universal Count Correction for High-Throughput Sequencing
- Differential expression analysis for sequence count data
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
- A normalization strategy for comparing tag count data

## Footnotes

- I normally hate pie charts, but I think in this case they serve an illustrative purpose. Please use pie charts with caution :).
- This is where 2 != 2 comes from. You can sequence the same amount of reads, but each read means something different in each experiment.

Well done! I have been thinking about the normalization problem for a while yet I have never seen a paper explaining the essence of the problem as clear as this article. You might be interested in my recent paper (http://arxiv.org/pdf/1405.4538v1.pdf) which treats normalization and differential expression as a joint problem and solve both of them simultaneously.

Thanks for the great article. I wonder if you have any thoughts or know of any work being done to evaluate the application of such BSN to metatranscriptomic datasets, in which transcripts in each sample are coming from numerous different microbes in a community. Here in addition to the variance due to expression you also have highly variable relative abundance of the organisms themselves. In this case, it would make sense to me to use the ratio of cDNA:DNA counts to do the BSN — thus accounting for the difference in relative abundance of organisms across samples?

I’ve seen several papers apply TMM (Marchetti et al. 2012 PNAS 7: E317) or DEseq ((Rivers et al., 2013 ISME J 7: 2315; Tsementzi et al., 2014 Environ Microbiol Rep 6: 640)) to metatranscriptomic data, but I haven’t seen these issues addressed explicitly yet. Thanks.

Thank you, this article has been long awaited.

I have raw counts and FPKM for a group of samples from RSEM, I am thinking of applying third quartile normalization across the samples before differential gene analysis, Are they tradeoffs of applying it to counts vs FPKM?

Very nice article. I liked specially the thought:

“Every normalization technique that I have seen assumes you are modeling counts, so the assumptions might be violated if you are using them directly on TPM or FPKM. While this is true, I think most techniques will give reasonable results in practice. ”

Trinity developers seems to recommend TPM or FPKM normalization for some analysis.

https://groups.google.com/forum/#!searchin/trinityrnaseq-users/TMM$20$2B$20TPM/trinityrnaseq-users/zKdXhlv72aM/P-nsV4j5EQAJ

Cheers,

Lucas

Thanks for the great post. I have actually the same question than Greg. I was wondering whether you know of a work that used the method you describe to normalize metT data or any method that besides dividing the reads of a given gene by the total reads it also considers a bench of house keeping genes that can be used for normalization to take into account different species abundance.

Thanks for any thoughts on this.

This paper seems to address normalization quite thoroughly, shame that I can’t find their R scripts in the supplementary data. http://bib.oxfordjournals.org/content/14/6/671.full.pdf

Cheers,

Duy

Hi Harold,

Thank you for taking the time to write this explanation. However, I got confused about the meaning of j in the Mathematical framework:

Let X_{ij} be the observed counts for expression feature i (maybe a gene or transcript) in sample j. We assume X_{ij} is a sample from some distribution with the following property: E[X_{ij}] = \mu_{ij} = \beta_i \cdot s_j where \beta_i is a mean expression rate for feature i and s_j is the sample specific scaling factor for condition j. T

First you say that index j is the sample. Then you “sample” from a distribution and you say that s_j is the “sample” specific scaling factor for condition j.

So is index j a condition or a sample? And is the meaning of “sample” in the first sentence the same as the meaning of “sample” in the second sentence? Or does “sample” in the first sentence mean “condition” in the second sentence?

I ask this because I am trying to normalize data where I have 3 conditions, and 3 replicates for each condition (9 samples total). Does j correspond to a single sample or does it correspond to the whole condition? The number of reads in each replicate can vary by as much as an order of magnitude, so normalization between samples and also across conditions will be critical.

I will also look at the Poisson-seq paper to see if that is a bit more clear.