Whole-genome bisulfite sequencing data analysis on Genestack platform

Watch all videos here Bisulfite sequencing approaches are currently considered a "gold standard" for detecting DNA methylation. One of them, whole-genome bisulfite sequencing (WGBS), provides single-base resolution of methylated cytosines in genomic DNA.


Investigating the methylation profile of DNA is extremely valuable because this type of epigenetic modification controls gene expression, and is involved in such processes as embryonic development, genomic imprinting, X-chromosome inactivation and cell differentiation. Since methylation takes part in so many cellular processes, it can be expected that aberrant methylation can be associated with various diseases for example different types of cancer.

In this tutorial we will show you how to bioinformatically analyse the whole genome DNA methylation with Genestack applications. 

The entire pipeline includes the following steps:

  1. Setting up a WGBS experiment
  2. Quality control of bisulfite sequencing reads
  3. Preprocessing of the raw reads: trimming adaptors, contaminants and low quality bases
  4. Bisulfite sequencing mapping of the preprocessed reads onto a reference genome
  5. Merging the mapped reads
  6. Quality control of the mapped reads
  7. Methylation ratio analysis
  8. Exploring the genome methylation levels in Genome Browser

The data flow used in this tutorial, has been previously prepared by our team and put into tutorial folder for your convenience.

1. Setting up a WGBS experiment

To go through all these steps, we need to choose one of the WGBS experiments from Genestack Public Experiments which are the part of our Public data. Feel free to reproduce the workflow on your own data uploaded with the Data Importer.  What's more, you could open the folder "Tutorials"  in the Public data and select the "Whole-genome Bisulfite Sequencing Data Analysis on Genestack Platform". In the folder you will find the WGBS experiment used in this tutorial, processed data and all the other needed files. 

Experiment by Rodriguez et al., 2014 can serve as a clear example of the applications of WGBS to genome-wide DNA 5-methylcytosine landscape profiling.

public experiments

To learn more just open the experiment in Experiment Viewer:


Briefly, the authors performed WGBS on DNA obtained from mouse hematopoietic stem cells (HSCs) to investigate the mechanisms that could promote changes in DNA methylation and contribute to malignant transformation. They discovered extended DNA regions of low methylation -- "Canyons"-- that are distinct from CpG islands and shores and cover conserved domains frequently containing transcription factors. Then, as DNA methyltransferase 3a (Dnmt3a) encoding gene is often mutated in human leukemias, the authors also compared DNA methylation patterns in purified wild type and Dnmt3a conditional knockout mouse HSCs. And it was revealed that the loss of DNA Dnmt3a can influence the Canyon size.

Now let's start reproducing these results with data flows pre-prepared by Genestack.

2. Quality control of bisulfite sequencing reads

Quality control of raw reads is an essential step in our pipeline to ensure that the data is of high quality and is suitable for further analysis. The raw data may be contaminated with PCR primers and adapter dimers or it can contain bases of poor quality. Moreover, the bisulfite sequencing protocol utilizes sodium bisulfite conversion of unmethylated cytosines to thymidines and that reduces the sequence complexity resulting in errors during alignment.

With our platform you can quickly and easily obtain an interactive FastQC Report that covers different quality aspects of raw data such as general sequence quality, total number of reads, GC content distribution, read quality distribution and much more. To start the analysis we can use public Raw Reads Quality Control data flow. There is no need to repeat the same steps for each sample -- the pipelines for all 16 assays from our experiment can easily be built simultaneously using this public data flow. In order to run the raw reads QC data flow, select all of the files from the experiment, right click on them, select "Run data flow on selection" and choose "Raw Reads Quality Control".

choose DFIn the Data Flow Runner page you can see 16 raw reads samples from the experiment.DF_FastQCTo start computation click on the button "Run Data Flow" and create resulting files. You will be suggested to initialize the created files at once or delay it till later.

Start initialization

The computation will take a while. You can track the progress of the report generation using our Task Manager that can be found at the top of the page. All created QC reports are located in the folder "Created files". You can explore each report in FastQC Report app or compare QC-statistics for several samples using Multiple QC Report app.

View Resuts

If you decide to delay initialization, you can generate statistical reports for all the chosen samples directly on Multiple QC Report app page or from the Data Flow Runner page by clicking on "16 files" and selecting "Start initialization". Thereby, with just one click, the process will begin for all the samples we have selected.


We will demonstrate the examples of QC-reports using files previously prepared by our team.

Go to the folder with processed files for our tutorial and click on the complete Multiple QC Report for Raw Reads we have created for all 16 raw read files in order to compare the quality of our raw reads.Multiple QC plot for RawReads

Looking at the plot we can see the number of nucleotides counted for each individual sample obtained from  Dnmt3a KO (blue) or WT HSCs (red) samples. Additionally, on the app page you could specify the statistics and metainfo which will be displayed on the plot and sort the samples by specific QC statistics or metainfo keys of choice.

Now let's look at the FastQC report for one of the assays, for example, "ko3a_b2l4 Bisulfite-Seq".  All prepared FastQC reports for all the samples are stored in the FastQC reports for Rodriguez et al., 2014 folder.

Per sequence GC content graph shows the GC content across the whole length of each read. Ideally, we will see the normal distribution of GC content. Our results reflect some deviation from from normal distribution: unusual sharp shape of the central peak may indicate the presence of contaminants in our library, for example adaptor dimers.Per sequence GC content

On the Per base sequence quality plots we can see that all bases in our sequence have the quality score equal or more than 30, which corresponds to 99.9% base calling accuracy. The quality is degraded in the last bases, but it is an expected behaviour corresponding to the sequencing chemistry.

per base sequence quality Per sequence quality score graph shows an average quality distribution over the set of sequences. It will help us see if there are any problems with sequencing run, for example a significant proportion of low quality sequences can be a signal of a systematic problem. In our case the overwhelming majority of reads are of a high quality (more than 30).

fastqc per sequence quality scoresLet's move on to the Per base sequence content graphs. The fact that our data failed this metric indicates that the base distribution is not uniform, namely the difference between A and T, or G and C is greater than 20%. Indeed, we can see fluctuations in base compositions over the entire read length. This should not alarm us, because bisulfite treatment converts the most of the cytosines to thymines and that obviously affects the base composition. Looking at the plot we can see that the number of thymines is approximately 50%, while cytosines are almost absent.fastqc per base seq contentSequence duplication levels metric allows us to assess the duplication level as well as the number of sequences that are not unique in the raw data. According to the plot, we have more than 30% of non-unique sequences of the total in the assay. Such a high duplication level can be linked to PCR artefacts, contaminants or sequencing of the same area several times.fastqc sequence duplication levelsThe application also detects Overrepresented sequences that may correspond to primer or adapter contamination. Indeed, in our case two over-represented sequences were found in our assay.

Here they are:

FastQC (overrepresented sequences)

These contaminants can strongly influence the results of analysis and should be trimmed.

3. Preprocessing of raw reads: trimming adaptors, contaminants and low quality bases.

After checking the quality of our data,  we can proceed with appropriate steps for improving the original raw data in order to get reliable results in the downstream analysis. 

The authors analysed two biological replicates for two murine phenotypes: wild type (WT) HSCs and conditionally Dnmt3a knocked out (KO) HSCs. Moreover, each biological replicate of WT or Dnmt3a KO HSCs condition has several technical replicates. Let's select the raw reads "m12_b4l1 Bisulfite-Seq", "m12_b4l2 Bisulfite-Seq" and "m12_b3 Bisulfite-Seq" that are three technical replicates for the second biological replicate of WT HSCs from our experiment and right click on them. Click on "Run data flow on selection" and choose from the list of suggested variants "Data Flow for WGBS data analysis (for Rodriguez et al., 2014)". After that you will find yourself on the Data flow Runner page, where all the steps of our pipeline are schematically represented. 


In the first block you will see the source files we have just selected. Also you need to specify reference genome onto which our reads will be mapped. So "Choose sources", find appropriate murine reference genome and "Select".

File chooser (ref-genome)


Let's run data flow by click on the corresponding button and take a closer look at all the steps of our pipeline. As we will describe below, we will run this data flow several times to obtain methylation ratios for biological replicates of the two tested phenotypes separately.

The first part of our pipeline is preprocessing of raw sequencing data. Based on the QC statistics we highly recommend you to remove adapters and contaminants, trim low quality bases and remove duplicates. And we also remove duplicates during Methylation Ratio Analysis, but you can also use a separate preprocess application Remove Duplicated Reads.

Firstly, we can easily remove the found overrepresented sequences from WGBS data using Trim adapters and contaminants app:

DF trim adaptors and contaminants

Later, to avoid mismatches in read mapping, we should remove low quality bases from the sequencing reads. Trim low quality bases application allows you to get rid of nucleotide bases with a low phred33 quality which corresponds to an error threshold equal to 1%.

df trim low quality bases

All preprocessed files are freely accessible in the folders Trim adaptors for Rodriguez et al., 2014 and Trim low quality bases for Rodriguez et al., 2014.

4. Bisulfite Sequencing Mapping of the preprocessed reads onto a reference genome

Following the preprocessing, our data is of improved quality and we can move on to the next step -- alignment of trimmed reads onto the reference genome.

df bisulfte seq mapping

We run Bisulfite Sequencing Mapping with default parameters. Click on the app name to move to its page where you can change the parameters of alignment and learn more about the app clicking on the "about".


In the Mapped reads for Rodriguez et al., 2014 folder you can find all the Mapped Reads.

5. Merging of the mapped reads obtained from technical replicates

It is essential for any experiment to create a good and adequate control. Using replicates can improve quality and precision of your experiment especially when comparing several experimental conditions. There are two types of replicates: technical and biological. When the same biological sample is prepared and sequenced separately, this gives us technical replicates. But if different biological samples are treated the same way but prepared separately, we are talking about biological replicates. If there are no changes in sample preparation, technical replicates could be considered as controls for the samples under the same experimental conditions.

As we remember, our experiment contains both biological and technical replicates for 12-month-old wild-type HSCs and Dnmt3a-KO HSCs. As authors do not mention using different experimental conditions for technical replicates, we can merge them before the calculation of methylation ratios. We will not merge biological replicates, because significant biological variability may be present between two sample growing separately.

Here is the data flow to merge three technical replicates for the second biological replicate of WT HSCs:

df merge mapped reads

Use the same data flow to merge mapped reads for technical replicates of remaining samples. As a result you will get 4 merged mapped reads for both analysed murine phenotypes. You can also use prepared Merged Mapped Reads files by opening the Merged mapped reads for Rodriguez et al., 2014 folder.

6. Quality control of mapped reads

Before proceeding to the next analysis step we would recommend you to check the quality of mapped reads because the reads might look alright in raw reads quality control check but some issues,  such as low coverage, homopolymer biases, experimental artifacts, only appears after alignment.

Sequencing coverage is an important quality metric, as biases in sample preparation, sequencing and read mapping can result in genome regions that lack coverage or in regions with much higher coverage than theoretically expected. You could explore the coverage for merged mapped reads in our Genome Browser. Go to the folder created, find merged mapped reads file we have created and click on the name of the merged mapped reads files, go to "explore" option and select the Genome Browser. Also go to our tutorial folder, find there  "Coverage for merged mapped reads for Rodriguez et al., 2014" file and open it in Genome browser app to explore results for initialized mapped reads.

In Genome Browser you have the option of viewing the region according to its coordinates or alternatively, you can search for a given gene or transcript name. Let's check the coverage in the location of the large methylation-depleted region associated with Pax6 gene.

GB_search by gene

High-quality reads should have uniform and high enough coverage. Our last three samples are very good for the further processing. At the same time, the coverage for the biological replicate 1 of 12 months WT HSCs is uneven -- with gaps and regions with coverage up to x133. That is why, such results may generate biases during methylation ratios calculation.

GB coverage (v2) Sequencing coverage

You can explore mapping quality for each individual file: right click merged reads file name, go to "explore" and select "Mapped reads QC report". On the opened app page you should click "generate reports" to start calculation of quality control statistics. To check the quality of all mapped reads or merged mapped reads simultaneously use the "Mapped Reads Quality Control" public data flow as we done previously for Raw Reads Quality Control.

Mapped reads QC report describes results of alignment of preprocessed reads onto the reference genome, and for example for paired-end reads it contains some mapping and insert size statistics, coverage  by chromosome plot and insert size distribution.

When the computation is finished, you can open Mapped reads QC report for each individual file or explore the mapping statistics for all of them using Multiple QC Plotter.In order to do that find all of the merged find the generated mapped reads QC-reports, select all the four, and with context menu open all of them in Multiple QC plotter. Below you can see QC-report for merged mapped reads generated by our team: Merged mapped reads QC

As we can see,  at most 6% of the reads are improperly mapped out of around 65% of mapped reads among all the samples.

Remember that you can explore QC reports not only for merged mapped reads but also for mapped reads.

All the prepared QC reports are located in the folders "Mapped reads QC for Rodriguez et al., 2014" and "Merged Mapped Reads QC for Rodriguez et al., 2014".

7. Methylation ratio analysis

When sequencing reads for both murine phenotypes are mapped onto reference genome with high enough quality, we can move on to the very last step in our pipeline - determining DNA methylation status of every single cytosine position on the both strands. In order to do that, go back to the Data Flow Runner page. Click on the "Methylation Ratio Analysis" to go to the app page where you can see source files and command line options that could be easily changed.

DF methylation analysis

Then return to the data flow click on "action", "add files", chose the remaining merged mapped reads, and start initialization.

During this step we apply several options to remove technical biases in WGBS data:

  1. Trim N end-repairing fill-in bases set to "3". This option allows to trim 3 bases from the read end to remove the DNA overhangs created during read end-repair in library preparation. It is important because this end repair procedure may introduce artefacts if the repaired bases contain methylated cytosines.
  2. Report only unique mappings
  3. Discard duplicated reads option to remove duplicated reads which have identical sequences and could be the result of library preparation. These reads could be mapped to the same position and distort results of downstream analysis.

The folder Methylation Ratios for Rodriguez et al., 2014 contains all the resulting files of methylation ratios estimation.

8. Exploring the genome methylation levels in Genome Browser

We can explore the distribution of genome methylation levels counted for both murine phenotypes in Genome Browser.

As it was mentioned before, "Canyons" are the large unmethylated DNA regions inside of a highly methylated locus that often harbour genes playing a role in hematopoiesis and being deregulated in human leukemia. Some Canyons can be exceptionally large, for example one associated with the Pax6 homeobox gene encoding a homeobox-containing protein regulating  transcription is extended over 25 kb:

Genome Browser (Pax6; only WTs)

Let's compare our methylation ratios distribution in these region with author's results:


As DNMT3a is often mutated in human leukemias, authors also examined the impact of loss of Dnmt3a on the Canyon size. For this they compared low-methylated regions in HSCs conditionally inactivated for Dnmt3a to WT HSCs:

Genome Browser (WT vs KO)This investigation revealed the fact that methylation loss in Dnmt3a KO HSCs leads to the formation of new Canyons. Lack of Dnmt3a does not affect regions inside Canyons but it results in changes of Canyon boundaries: Canyon size can be decreased due to hypermethylation or increased due to of hypomethylation. Moreover, at DNA regions containing cluster of Canyons in WT HSCs,  larger Canyons ("Grand Canyons")  can be formed. We can see it on the example of  HoxB regions in which Canyons are interrupted by short stretches of higher methylation.

All these findings suggest that Dnmt3a can be crucial for maintaining methylation in the Canyon boundaries.

Now, let's take a look at the original track for the same Canyon cluster to compare the results:


This experiment is a part of the large research of changes in DNA methylation profile including different methodologies such as, for example, whole genome bisulfite sequencing and CMS-seq to reveal genome-wide distribution of mCs and hmCs, RNA-Seq to analyse expression of Canyon-associated genes. This incredible work was turned into a research paper, and the data sets can be found in our Public Experiments!

That's it for the tutorial, we hope you will enjoy working on your data with Genestack! Later you can return back to the tutorial if necessary. If you have any questions, suggestions etc, please leave them in comments below or email us.