As one of the widely used targeted sequencing method, whole exome sequencing (WES) has become more and more popular in clinical and basic research. Albeit, the exome (protein-coding regions of the genome) makes up 1% of the genome, it contains about 85% of known disease-related variants (van Dijk E.L. et al, 2014), making whole-exome sequencing a fast and cost-effective alternative to whole genome sequencing (WGS).
In this tutorial we'll provide a comprehensive description of the various steps required for WES analysis, explain how to build your own data flow, and finally, discuss the results obtained in such analysis.
Setting up an exome sequencing experiment
First off, let's choose exome sequencing data. You can upload your own data using 'Import' button or search through all public experiments we have on the platform. Our analysis will be based on data coming from Clark et al. 2011. Let's find this experiment in the platform and open it Experiment Viewer: The authors compared the performance of three major commercial exome sequencing platforms: Agilent's SureSelect Human All Exon 50Mb, Roche/Nimblegen's SeqCap EZ Exome Library v2.0 and Illumina's TruSeq Exome Enrichment; all applied to the same human blood sample. They found that, the Nimblegen platform provides increased enrichment efficiency for detecting variants but covers fewer genomic regions than the other platforms. However, Agilent and Illumina are able to detect a greater total number of variants than Nimblegen platform. Exome sequencing and whole genome sequencing were also compared, demonstrating that WES allows for the detection of additional variants missed by WGS. And vice versa, there is a number of WGS-specific variants not identified by exome sequencing. This study serves to assist the community in selecting the optimal exome-seq platform for their experiments, as well as proving that whole genome experiments benefit from being supplemented with WES experiments.
Whole Exome Sequencing Data Analysis pipeline
A typical data flow of WES analysis consists of the following steps:
- Quality control of raw reads
- Preprocessing of raw reads
- Mapping reads onto a reference genome
- Targeted Sequencing quality control
- Quality control of mapped reads
- Post-alignment processing
- Variant calling
- Effect annotation of the found variants
- Variant prioritisation in Variant Explorer
Let's look at each step separately to get a better idea of what it really means.
1. Quality control of raw reads
Low-quality reads, PCR primers, adaptors, duplicates and other contaminants, that can be found in raw sequencing data, may compromise downstream analysis. Therefore, quality control (QC) is essential step in your analysis to understand some relevant properties of raw data, such as quality scores, GC content and base distribution, etc. In order to assess the quality of the data we'll run the Raw Reads QC data flow:
Genestack FastQC application generates basic statistics and many useful data diagnosis plots. Here is some of them for sample enriched by Aligned SureSelect 50M: Basic statistics tells you about basis data metrics such as reads type, number of reads, GC content and total sequence length. Sequence length distribution module reports if all sequences have the same length or not. Per sequence GC content graph shows GC distribution over all sequences. A roughly normal distribution indicates a normal random library. However, as in our case, if the data is contaminated or there are some systematic bias, you'll see an unusually shaped or shifted GC distribution: Per base sequence quality plots show the quality scores across all bases at each position in the reads. By default you see low quality zone and mean quality line. If the median is less than 25 or the lower quartile is less than 10, you'll get warnings. Per sequence quality scores report allows you to see frequencies of quality values in a sample. The reads are of good quality if the peak on the plot is shifted to the right, to the max quality score. In our case, almost all of the reads are of good quality (>30): Per base sequence content plots show nucleotide frequencies for each base position in the reads. In a random library, there could be only a little difference between A, T, C, G nucleotides, and the lines representing them should be parallel with each other. The black N line indicates the content of unknown N bases which shouldn't be presented in the library. In our case, you can notice the absence of unknown nucleotides and a slight difference in A-T and G-C frequencies: Sequence duplication levels plots represent the percentage of the library made up of sequences with different duplication levels. In simple words, 44% of reads are unique, 26% of reads are repeated twice, 13% - three times, 4% - more than 10 times, etc. All these duplicates are grouped to give the overall duplication level. You can use "Filter Duplicated Reads" application to remove duplicates in raw reads data, however we'll get rid of them after mapping step. Application also detects overrepresented sequences that may be an indication of primer or adaptor contamination. We have run QC on all the data in the experiment and put the reports in Raw reads QC reports for Clark et al (2011) folder, so that you can open all of them in Multiple QC Report application to analyse results: You see that total number of exome sequencing reads is 124,112,466 for Agilent SureSelect, 184,983,780 for Nimblegen SeqCap and 112,885,944 for Illumina TruSeq platform. The whole genome library yielded more than one billion total raw reads.
2. Preprocessing of raw reads
With the comprehensive raw reads QC reports generated by FastQC app, you're able to determine whether any preprocessing steps such as trimming, filtering, or adaptor clipping are necessary prior to alignment. Here is the list of all preprocess apps that Genestack suggests you to improve the quality of your raw reads: Our preprocessing procedure will include "Trim Adaptors and Contaminants" step. Once the quality of raw data has been checked, let's start planning and building our Whole Exome Sequencing Analysis data flow:
To build any data flow in Genestack, choose one of the samples and start to preprocess or analyse it. Each app suggests you to add next analytical step or use relevant viewers: Note that you can create as many files as you want and run the computation process later. Now let's create a data flow from the pipeline we built. For the last created file choose "Create new Data Flow" in Manage section: This takes us to the "Data Flow Editor" app page where you can rename, describe your pipeline and change sources. Let's 'clear files' and click "Run dataflow". Now we're on the "Data Flow Runner" application page. Just choose sources - experiment assays and human reference genome - and click "Run Data Flow". Note that if you choose several raw reads files, the multi-sample variant calling will be performed. However, in order to compare our results, we need to run this data flow for each sample separately. After that, the app suggests you to choose the explore app where you can start initialization now for whole your analysis or delay it till later: Let's delay it. After that the app suggests you to choose the app where you can also start computation: In order to start computation for each data flow step separately, click on file name and choose 'start initialization'. Let's run "Trim Adaptors and Contaminants" step: All the data are preprocessed and stored in Trimmed raw reads for Clark et al (2011) folder.
3.Mapping reads onto a reference genome
After raw data QC and preprocessing, the next step is to map exome sequencing data to the reference genome with high efficiency and accuracy. Genestack supports two Unspliced mappers, one is based on Bowtie2, another - on BWA. We'll use the last one since it is fast and allows gapped alignments which are essential for accurate SNP and indels (insertion/deletions) identification. The following video illustrates how to start computation on this step and analyse mapping results in Genome Browser:
When mappings will be complete, open all 4 files in Genome browser to compare their read coverage. Let's look for specific gene or region, for example, HBA1 and HBA2 genes encoding alpha-globin chains of hemoglobin. With WGS technology, you can see coverage in both protein-coding and non-coding sequences: As for WES technology, you are interested only in exome. That's why, you see coverage for HBA1 and HBA2 coding regions and do not see it in non-coding ones. To compare read coverage between different enrichment platforms, you can build a coverage track: In most cases you'll see a significant coverage for sample enriched by Nimblegen. Moreover, each platform targets particular exomic segments based on combinations of the RefSeq, UCSC, Ensembl and other databases. That's why, you may expect difference in coverage for specific gene-coding regions. To further use mapped reads, go to the Mapped reads for Clark et al (2011) folder.
4. Targeted Sequencing Quality Control
Besides quality control of the raw sequencing reads, it is also crucial to assess whether the target capture has been successful, i.e. if most of the reads actually fell on the target, if the targeted bases reached sufficient coverage, etc. However, by default the application allows you to compute enrichment statistics for reads mapped only on exome. If you go to the app page, change the value to "Both exome and target file" and select the appropriate target annotation file, you get both exome and/or target enrichment statistics. To do this step, you can "generate reports" for each mapping separately or run our Targeted Sequencing Quality Control public data flow (with default values) for several samples at once and analyse the output reports in Multiple QC Report app:
In this tutorial, we are looking at three exome enrichment platforms from Agilent, Nimblegen and Illumina and assessing their overall targeting efficiency by measuring base coverage over all targeted bases and on-target coverage for each platform: A typical target-enrichment WES experiment results in ¶90% of target-bases covered at coverage ≥ 1x. This value tends to decrease as the coverage threshold increases. How fast this percentage decreases with the coverage increment depends on the specific experimental design. Not surprisingly, all the technologies give high coverage of their respective target regions, with the Nimblegen platform giving the highest coverage: about 94% of the targeted bases were covered at least twice, 93% at ≥ 10x and 87% at ≥ 50x. With Agilent, 91% of bases were covered at ≥ 2x, 86% at ≥ 10x and 66% at ≥ 50x. With Illumina TruSeq enrichment, 91% of bases were covered at ≥ 2x, 86% at ≥ 10x and only 50% at ≥ 50x. These results are very similar to the paper results: Regarding the overall percentage of reads mapped on the target, in a typical experiment one may expect ¶70%. Looking at the plot, you see the highest 77% and 74% values for samples enriched by Nimblegen and Agilent platforms respectively and only 48% - for Illumina TruSeq. Also, it's not surprising that we notice the biggest mean coverage on target with ≥ 2x coverage for Nimblegen samples, since this platform contains overlapping oligonucleotide probes that cover the bases it targets multiple times, making it the highest density platform of the three. Agilent baits reside immediately adjacent to one another across the target exon intervals. Illumina relies on paired-end reads to extend outside the bait sequences and fill in the gaps (Clark M.J. et al, 2011): Target annotations used in this tutorial can be found in Public Data, Genome annotations folder or in Target Annotations for Clark et al (2011) tutorial folder, where we put them for your convenience. Besides the target enrichment statistics, you can assess the percentage of exome bases with coverage started from ≥ 2x and the overall proportion of reads mapped on exome: All targeted sequencing QC reports are collected in Mapped reads enrichment reports for Clark et al (2011) folder.
5. Quality control of mapped reads
The reads may look OK on the Raw Reads quality control step but some biases only show up during mapping process: low coverage, experimental artefacts, etc. The detection of such aberrations is an important step because it allows you to drive an appropriate downstream analysis. You can "generate reports" for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples and analyse the output reports in Multiple QC Report app:
Output report includes mapping statistics such as:
- Mapped reads: total reads which mapped to the reference genome;
- Unmapped reads: total reads which failed to map to the reference genome;
- Mapped reads with mapped mate: total paired reads where both mates were mapped;
- Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
- Mapped reads with "properly" mapped mate: total paired reads where both mates were mapped with the expected orientation;
- Mapped reads with "improperly" mapped mate: total paired reads where one of the mates was mapped with unexpected orientation.
The Coverage by chromosome plot shows a read coverage at each base on each chromosome and patch (if it's presented) defined by lines in different colours: If your reads are paired, the application additionally calculates insert size statistics, such as median and mean insert sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot shows the insert size length frequencies: All complete QC reports for mapped reads are stored in Mapped reads QC reports for Clark et al (2011) folder. You can open all of them at once in Multiple QC Report app to interactively analyse and compare mapping statistics between samples: Speaking of mapping results, for each sample, almost all of the reads is mapped properly and there is a small percentage of partially or improperly mapped reads.
6. Post-alignment processing
After mapping reads to the reference genome, it's recommended to remove duplicates before variant calling, with the purpose of eliminating PCR-introduced bias due to uneven amplification of DNA fragments. That's why we run "Remove Duplicated Mapped Reads" app: Preprocessed mapped reads are stored in Filtered mapped reads for Clark et al (2011) folder.
7. Variant calling
8. Effect annotation
9. Variant prioritisation in Variant Explorer
The variants can be also interactively analysed in Genestack Variant Explorer application:
Let's select Illumina sample and open it in Variant Explorer to look at the detected variants: There are 1,350,608 mutations were identified. Imagine that we are interested only in high-quality nonsense variants: click 'QUALITY' header to apply sorting and set 'NONSENSE' in 'FUNCTIONAL CLASS'. You see that the number of mutations is decreased significantly. We have only 104 nonsense variants: You can use other filters and sorting criteria and look through the 'Filters history' to check how many variants were detected after applying specific filter in comparison to the number of mutations we had on the previous filtering step: When the variants are sorted and filtered, you can share them with your colleagues, export them as tsv file clicking on 'Download table' and attach it to your papers and other reports. So, what can we conclude from our findings“ Are the results for WES samples really comparable to a WGS one“ If there are any key differences in performance between the three enrichment platforms“ And what target capture technology is better to select when planning the exome experiment“ Answering these questions we found that neither of whole exome and whole genome technologies managed to cover all sequencing variants. First, WGS can not and will not replace exome sequencing as due to genome characteristics there will always be regions that are not covered sufficiently for variant calling. Regarding WES, it shows high coverage but only towards the target regions. Second, WGS has its value in identifying variants in regions that are not covered by exome enrichment technologies. These can be regions where enrichment fails, non-coding regions as well as regions that are not present on the current exome designs. That's why, for covering really all variants, it might be worth to think about doing both WGS and WES experiments in parallel. Both technologies complement each other.
- Clark M. J., et al. Performance comparison of exome DNA sequencing technologies. Nature biotechnology 2011; 29(10):908-914.
- Ebersberger I., et al. Genomewide comparison of DNA sequences between humans and chimpanzees. The American Journal of Human Genetics 2002, 70:1490-1497.
- Mills R.E., et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Research 2006; 16:1182-1190.
- van Dijk E.L., et al. Ten years of next-generation sequencing technology. Trends in Genetics 2014; 30:418-426.