How to...choose a reference genome?

Hi everyone! As you know, recently our team has been working hard to put together an Ultimate Guide to the Genestack Platform. This guide consists of a general introduction to bioinformatics and the basics of sequencing analysis, as well as a comprehensive set of descriptions of Genestack platform architecture, use cases for various apps and tips and tricks on how to get the most out of the platform. Before we publish the full guide, every week, we publish extracts from it in the form of blog post. You can view the previously published extracts here:

  1. On building pipelines and results reproducibility
  2. On choosing an appropriate mapper

In this week's post, we'll walk you through the process of choosing an appropriate reference genome. One way or another, most bioinformatics analysis pipelines, regardless of the data type analysed, require the use of a reference genome. For instance,  we use reference genomes in DNA methylation analysis, in differential gene expression analysis, and analysis of the transcriptomic heterogeneity within populations of cells studies. The choice of a reference genome can increase the quality and accuracy of the downstream analysis or it can have a harmful effect on it. For instance, it has been shown that the choice of a gene annotation has a big impact on RNA-seq data analysis, but also on variant effect prediction[ 1, 2].

On Genestack, you can find several reference genomes for some of the most common model organisms. We are adding more and more reference genomes of model organisms to this list regularly. reference genomes names  For some organisms we provide several genomes, e.g.  there are 3 reference genomes for H. sapiens. What are the differences between these reference genomes“ And how do you chose the correct one“  The answer is not so straightforward and depends on several factors - let's discuss each of them:   

1) Versions of the reference genome For instance:  Homo sapiens GRCh37.75 (unmasked) vs GRCh38.80 (unmasked) The numbers correspond to versions (or "builds") of the reference genome - the higher the number, the more recent the version. We generally recommend you use the latest version possible. One thing to remember is that for the newest genome builds, it's likely that resources such as genome annotations and functional information will be limited, as it takes time for Ensembl/ UCSC to integrate additional genomic data with the new build. You can read more about it a blog post from Genome Spot blog and in this article from Bio-IT.  

2) One organism - many strains K12 and O103:H2 are two different strains of E.coli. K12 is an innocuous strain commonly used in various labs around the world. O103:H2 is a pathogenic strains, commonly isolated from human cases in Europe. Depending on your experiment, you should choose a matching reference genome.  

3) Masked, soft-masked and unmasked genomes There are three types of Ensembl reference genomes: unmasked, soft-masked and masked. Generally speaking, it's recommended to use unmasked reference genomes builds for alignment. Masking is used to detect and conceal interspersed repeats and low complexity DNA regions so that they could be processed properly by alignment tools. There are two types of masked reference genomes: masked genomesMasked reference genomes are also known as hard-masked DNA sequences. Repetitive and low complexity DNA regions are detected and replaced with 'N's. The use of masked genome may adversely affect the analysis results, leading to wrong read mapping and incorrect variant calls.  

When should you use a masked genome“  We generally don't recommend using masked genome, as it relates to the loss of information (after mapping, some "unique" sequences may not be truly unique) and does not guarantee 100% accuracy and sensitivity (e.g. masking cannot be absolutely perfect). Moreover, it can lead to the increase in number of falsely mapped reads. soft masked genomes   In soft-masked reference genomes, repeats and low complexity regions are also detected but in this case they are masked by converting to a lowercase variants of the base (e.g. acgt).  

When should you use a soft-masked genome“ The soft-masked sequence does contain repeats indicated by lowercase letters, so the use of soft-masked reference could improve the quality of the mapping without detriment to sensitivity. But it should be noted that most of the alignment tools do not take into account soft-masked regions, for example BWA, tophat, bowtie2 tools always use all bases in alignment weather they are in lowercase nucleotide or not. That is why, there is no actual benefit from the use of soft masked genome in comparison with unmasked one.   Sometimes you'll also see repeat-masked genomes. What are they“ Masking can be performed by special tools, like RepeatMasker. 
The tool goes through DNA sequence looking for repeats and low-complexity regions. By default, the tool replaces the found bases with "Ns". unmasked genome  When should you use an unmasked genome“  We recommend you use unmasked genome when you don't want to lose any information. If you want to perform filtering, it's better to do it after the mapping step.  

Example:  To perform WES analysis, we recommend you use an unmasked reference genome of the latest releases and assemblies (e.g. Homo sapiens / GRCh37.75 (unmasked) for human samples).  

Conclusions:  Bioinformatics community is divided on the topic of the use of reference genomes. The opinion we presented here - to always use unmasked genome and perform filtering after the mapping step -  is our point of view, but if you would like to read more on the topic, we suggest you take a look at papers we included as references to this blog post. We hope you enjoyed this post, please let us know your thoughts by commenting below. If you want to get in touch with our team, you can do this using the chat window on the platform or emailing us at contact@genestack.com.  

1. McCarthy DJ, Humburg P, Kanapin A, Rivas MA, Gaulton K, Cazier JB, Donnelly P. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 2014;6(3):26; 2. Frankish A, Uszczynska B, Ritchie GR, Gonzalez JM, Pervouchine D, Petryszak R, et al. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics. 2015;16 (Suppl 8):S2