New raw sequencing data quality control tool available on Genestack

We're excited to announce that we launched a new app: FastQC Report.fastqcIt performs quality control checks on raw sequencing data and generates a bunch of QC reports telling you in which areas there may be problems. These reports cover the following:

  1. basic statistics
  2. sequence length distribution
  3. per sequence GC content
  4. per sequence quality scores
  5. per base sequence content
  6. sequence duplication level
  7. overrepresented sequences

We'll show you examples of the above using FastQC reports prepared for data from 2013 publication by Hibaoui et al. on early developmental transcriptional signature of Down syndrome. This experiment (along with hundreds of thousands of others) is available on the Genestack in our Public Experiments Index. The reports can be found here. Let's go through all the reports generted by the FastQC app: Basic Statistics report tells you about the basic stats of your data: reads type, number of reads, GC content and total sequence length.basic statisticsSequence length distribution report tells you about whether or not your sequences have the same length.sequence length distributionPer sequence GC content graph will help you access whether your data is contaminated. In a random library you should expect a roughly normal GC content distribution. Looking at the type of peaks on the graph (sharp/broad) will give you hints on the types of contaminants in your data set.per sequence GC contentPer base sequence quality plots show a range of quality scores for each position in the reads. A good sample will have qualities all above 28. This report will help you decide if your data needs trimming - e.g. in a situation where the quality of the reads degrades over time.per base sequence qualityPer sequence quality score plot shows the frequencies of quality scores in a sample. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible. Errors seen in this report warn you about a general loss of quality of your raw reads within a run.per sequence quality scorePer base sequence content plot shows the four nucleotides' proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other. A common problem that can be discovered looking at this graph is a presence of overrepresented sequences in the sample.On the picture belowe we can see a bias at the beginning of the reads, which is common for RNA-Seq data.per base sequence contentSequence duplication level graph shows the proportion of the library made up of sequences with different duplication levels. Sequences with 1, 2, 3, 4, etc duplicates are grouped to give the overall duplication level. In the example below we can see that in this sample, 15% of sequences are duplicated 10 times, 7% are duplicated 100 times and so on. Generally,  there are two potential types of duplicates in a library-technical or biological with both being shown on this graph.sequence duplication levelsLast, but not least, the Overrepresented Sequences report will warn you if any of the sequences makes up more than 0.1% of the entire sample. overrepresented sequences The app is based on FastQC, developed by Simon Andrews and his colleagues at the Babraham Institute (Cambridge, UK). All of these reports will later help you decide if your data needs preprocessing, and if so, what type of preprocessing. After you run several FastQC reports (which you could do using our QC dataflow that was discussed in the Differential Gene Expression Analysis tutorial), you could compare them using our Multiple QC Report app. Learn more about this app here. multiple qc report We hope you'll enjoy our new app. Let us know what you think about it, Genestack team