The Challenge
Differentiating between technical noise and true biological variability is a key issue in single-cell RNA-sequencing analysis. Normalization methods and bioinformatic tools commonly used for bulk RNA-seq don’t handle the high cell-to-cell variability very well. One solution is to add, or ‘spike-in’, a known amount of specific RNA transcripts to quantify technical noise, and analyse the data with dedicated computational techniques (e.g. Brennecke et al, 2013). However, no good solution existed for spike-in-free datasets. The client tasked Genestack with the challenge to develop a solution for processing and analysing such data in an intuitive and visual manner.
Our Solution
Our expert team tackled this challenge by developing two custom Genestack applications: Single-Cell Analyser and Single-Cell Visualiser.
Single-Cell Analyser uses two noise models: the Brennecke et al (2013) model and a novel, custom-developed approach for analysis of spike-in free datasets. This identifies a set of genes showing significant biological variability across cells, and then groups cells into clusters with similar gene expression patterns.
The Single-Cell Visualiser provides interactive visualizations to explore these clusters. The app utilises both industry standard methods, such as PCA, as well as new tools, such as the t-SNE algorithm, which is better suited for segregating clusters of samples with similar gene expression patterns in the presence of technical noise. Genestack's Single Cell Analyser and Visualiser have a key feature: an improved, effective way of automating cluster identification. Previously, automatic cluster identification was done in the visualiser by cutting the heatmap dendrogram at a fixed number of nodes.
Using the new method, it is possible to divide cells into clusters automatically using a well-known k-means algorithm. This way, each cell will be assigned to a parental cluster based on its gene expression profile. Moreover, the algorithm allows you to determine the optimal cluster number using the "elbow method".
Impact
By combining t-SNE algorithm with the k-means clustering algorithm user can easily perform both sample visualisation and automated cell classification into cell subpopulations, much more effectively than with standard PCA or dendrogram cutting. The applications have been tested and thoroughly validated using both publicly available datasets and in-house client’s data in-house. This included a published Nature dataset with data from 800 cells, which was easily handled by harnessing the power of the cloud. The client’s team of biologists was able to comfortably use the applications to run their own analyses with minimal training.