One of the most widespread applications of RNA-seq technology is differential gene expression (DGE) analysis. By understanding how gene expression levels change across different experimental conditions, we can gain clues about gene function and learn how genes work together to carry out biological processes.
In this tutorial we will use Genestack applications to identify differentially expressed (DE) genes and further annotate them according to biological process, molecular function and cellular component. The whole analysis includes the following steps:
- Setting up an RNA-seq experiment
- Quality control of raw reads
- Preprocessing of raw reads
- Mapping RNA-seq reads onto a reference genome
- Quality control of mapped reads
- Calculate read coverage for genes
- Differential gene expression analysis
- GO-based enrichment analysis
Let's deal with these steps one by one.
1. Setting up an RNA-seq experiment
In the left-hand folder tree, you can see various public data that is freely available to you: public experiments, data flows, reference genomes, etc. After searching through all public experiments available on the platform you can select one and use it further in the analysis.
Let's go to the tutorials and choose Testing Differential Gene Expression on Genestack Platform. The RNA-Seq experiment we will use comes from Hibaoui et al. 2013 and is publicly available on Genestack. Open it in Experiment Viewer to see more details:
The authors investigated the transcriptional signature of Down syndrome (trisomy 21) during development, and analysed mRNA of induced pluripotent stem cells (iPSCs) derived from fetal fibroblasts of monozygotic twins discordant for trisomy 21: three replicates from iPSCs carrying the trisomy, and four replicates from normal iPSCs. They identified down-regulated genes expressed in trisomic samples and involved in multiple developmental processes, specifically in nervous system development. Genes up-regulated in Twin-DS-iPSCs are mostly related to the regulation of transcription and different metabolic processes.
To reproduce these results, we'll use Differential Gene Expression Analysis data flow. But before let's check the quality of raw reads to decide whether we should improve it or not.
2. Quality control of raw reads
Raw sequencing reads can include PCR primers, adaptors, low quality bases, duplicates and other contaminants coming from the experimental protocols. That's why we recommend you to check the quality of your raw data looking at such aspects as GC percentage, per base sequence quality scores, and other quality statistics. The easiest way to do this is to run the Raw Reads QC data flow:
After generating reports, you'll be able to review various statistics and plots for each sample. Here is some of them for
Per sequence GC content. In a random library you expect a roughly normal GC content distribution. An unusually shaped or shifted distribution could indicate a contamination.
Per base sequence quality plot depicts the range of quality scores for each position in the reads. A good sample will have qualities all above 28:
Per sequence quality scores plot shows the frequencies of quality scores in a sample. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible. In our case, the majority of Illumina reads have a good quality - in the range from 35 to 40 (the best score is 41):
Per base sequence content shows the four nucleotides' proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other:
There is a bias at the beginning of the reads, which is common for RNA-Seq data. This occurs during RNA-seq library preparation, when "random" primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads.
Sequence duplication levels plots represent the proportion of the library made up of sequences with different duplication levels. Sequences with 1, 2, 3, 4, etc duplicates are grouped to give the overall duplication level.
Looking at these plots, you may notice 15% of sequences duplicated more than 10 times, 6% of sequences are repeated more than 100 times, etc. The overall rate of duplication is about 40 %. Nevertheless, while analysing transcriptome sequencing data, we should not remove these duplicates because we don't know whether they represent PCR duplicates or high gene expression of our samples.
We have run QC on all the data in the experiment and collected reports in Raw reads QC reports for Hibaoui et al (2013) folder.
3. Preprocessing of raw reads
Once the quality of raw data has been checked, we can go back to the main Differential Gene Expression Analysis data flow and choose sources:
You can upload your samples directly into the data flow or select them from the available datasets. Let's choose sources - 7 sequencing assays from the experiment and a human reference genome, and create resulting files in a specific folder.
QC reports can not only provide you with the information on the data quality but can also help you to decide how to preprocess the data in order to improve its quality and get more reliable results in further analysis. There are various Genestack applications that allow you to do preprocessing:
We'll run Trim Adaptors and Contaminants app:
All resulting files are collected in Trimmed raw reads for Hibaoui et al (2013) folder.
4. Mapping RNA-seq reads onto a reference genome
When all files were created, you can run the whole analysis here, choosing Expression Navigator for genes. But first, let's align RNA-seq reads to the reference genome across splicing junctions and then compare mappings in Genome Browser.
Find Spliced Mapping step, click on "7 files". In "Explore" section choose "Genome Browser" and start initialization there.
We run Spliced Mapping app with default parameters. To change them go to the app page and choose "Edit parameters" button. If you want to learn more about the app and its options, click on the app name and then on "About application".
Completed Mapped Reads files can be found in Mapped reads files for Hibaoui et al (2013) folder. Let's open some of them in Genome Browser to analyse reads coverage on chromosome 21 on the region chr21:30007376-40007694 (10 Mb):
Here is a combined track for all trisomic and control samples:
As you see, the majority of chr21 genes are indeed more expressed in the trisomic samples than in the euploid ones, which is consistent with the overall up-regulation of chr21 genes in individuals with Down syndrome.
5. Quality control of mapped reads
The optional step is to check how mapping went using Mapped Reads QC Report app. You can "generate reports" for each mapping separately or just run Mapped Reads Quality Control data flow for multiple samples:
Output report includes mapping statistics such as:
- Mapped reads: total reads which mapped to the reference genome;
- Unmapped reads: total reads which failed to map to the reference genome;
- Mapped reads with mapped mate: total paired reads where both mates were mapped;
- Mapped reads with partially mapped mate: total paired reads where only one mate was mapped;
- Mapped reads with "properly" mapped mate: total paired reads where both mates were mapped with the expected orientation;
- Mapped reads with "improperly" mapped mate: total paired reads where one of the mates was mapped with unexpected orientation.
For paired reads, you can look at insert size statistics, such as median and mean insert sizes, median absolute deviation and standard deviation of insert size. The Insert size distribution plot is generated:
We already prepared all QC reports for mapped reads and put them in Mapped reads QC reports for Hibaoui et al (2013) folder. You can open all of them in Multiple QC Report app to view mapping statistics interactively:
Overall, more than 80 % of reads are mapped. It includes properly and partially mate pairs. Less than 11 % of reads are unmapped among the samples.
6. Calculate read coverage for genes
After mapping, we can count reads mapped to annotated features (genes, exons, etc) running Quantify Raw Coverage in Genes app:
To run the app, click on "7 files" and then "Start initialization". For our analysis we counted reads mapping within exons, grouping them by gene_id and assigning reads to all exons they overlap with. We calculated read coverage in all samples and collected resulting files in Raw gene counts for Hibaoui et al (2013) folder.
7. Differential gene expression analysis
The final step is to identify genes whose patterns of expression differ according to different experimental conditions. In this tutorial, we are looking for variation in gene expression for trisomic samples compared to the control ones.
Open Expression Navigator file, re-group samples and start the analysis:
We prepared two Differential Expression Statistics files (considering the DE genes reported by both packages) and stored them in Differential gene expression analysis for Hibaoui et al (2013) folder. As an example, let's analyse DE genes reported by DESeq2 package.
You can see the table with top genes that are differentially expressed in one particular group compared to the average of the other groups. The table shows the corresponding Log FC (log fold change), Log CPM (log counts per million), p value, and FDR (false discovery rate) for each gene. Genes with positive Log FC are considered to be up-regulated in the selected group, ones with negative Log FC are down-regulated.
In the "Trisomy 21" group we identified 4426 low expressed genes (NR2F1, XIST, NEFM, etc.) and 4368 highly over-expressed genes (ZNF518A, MYH14, etc.). By selecting the checkbox next to a gene, more detailed information about that gene is displayed: its Ensembl identifier, description and location:
There are several options to filter/sort the genes displayed on the Top DEG Table. You can filter them by p-value, minimum Log FC, minimum CPM and regulation type. By default, the genes are ranked by their FDR.
Let's find genes that are most over-expressed in the "Trisomy 21" group, by lowering the Max P-Value threshold and increasing the Min LogFC and Min LogCPM thresholds. Change P-Value to 0.001, Regulation to "Up", set both Min LogFC and Min LogCPM equal to 2 and apply sorting by LogFC.
As consistent with paper results, there is a number of zinc finger protein genes that are up-regulated in Twin-DS-iPSCs:
Interactive counts graph shows gene normalised counts across samples. This allows you to observe how a gene's expression level varies within and across groups. Select several genes to compare expression level distributions between them:
If you move cursor to the top right corner of the graph, 3 icons will appear:
- Filter icon lets you filter the graph data by samples, groups, features, etc.
- Data icon will display all the data contained in the graph, and allow you to save the table data locally.
- Camera icon lets you save the displayed graph locally. Add labels to the graph and change its appearance by modifying the parameters on display when you right-click the graph area.
8. GO-based enrichment analysis
To further characterise the biological processes that might be affected in trisomic samples, we performed the downstream gene ontology (GO) analysis of the DGE genes. For this, we'll use GO Enrichment Analysis app, which performs the classic Fisher's exact test based on gene counts, against GO annotations.
Open the app on one of the completed Differential Expression Statistics files:
Changing the group and thresholds criteria in the Filter Options, you can set what DE genes can be further used for enrichment analysis. Let's run GO app twice, analysing down- and up-regulated genes separately.
Analysing down-regulated genes we observed significant enrichment for genes involved in multiple developmental processes specifically in organ development and morphogenesis, embryonic development and morphogenesis and system development. Also, we found genes associated with nervous system-related terms and specifically with nervous system development, brain development, neurogenesis, generation of neurons, neuron differentiation and axonogenesis. Moreover, there are some terms linked to cellular adhesion (i.e. biological adhesion, cell adhesion, cell-cell adhesion) and to the cadherin signalling pathway. Here is the first 8 of 1017 GO terms related to the biological processes:
Additionally, you can also review molecular functions or cellular components that could be affected in Twin-DS-iPSCs. Just change the "Top GO Terms" from Biological Process category to the corresponding one.
By comparison, this is the table from the paper which listed the first 20 biological processes that might be affected due to trisomy 21:
Our GO analysis of the genes up-regulated in trisomic samples revealed enrichment for functions related to different metabolic and biological processes, regulation of transcription and DNA-dependent transcription. Here is the list of the first 8 of 339 biological process GO terms:
Look at the GO terms associated with up-regulated genes and reported in the paper:
All these biological processes can be found in our results. The difference is in GO counts. But we expected it, because the ontologies are not complete, they are being expanded constantly during the association of gene products from the collaborating databases. If you'd like to check it out, open differential expression statistics files stored in folder GO enrichment analysis for Hibaoui et al (2013).
This is all for the tutorial. Why don't you try repeating these steps with your own data or using our public experiments You can try it right now! Just open the tutorial data flow or create your own one by adding new steps, changing sources and default options.
If you have any questions and comments, please submit them below or email us at email@example.com.
Follow us on Twitter: @genestack.