The Human Cell Atlas (HCA) is an exciting new global research initiative funded by the Chan-Zuckerberg foundation, which aims at building a comprehensive map of all the thousands of cell types in the human body.
The diversity of cells in the human body has been extensively explored since the invention of microscopy. The current number of recorded cell types in humans is estimated around 200. However, this classification is primarily based on cell morphology (i.e. how a cell looks under a microscope). But with the advent of new methods like single-cell RNA-seq, which allow us to gain insight into a cell's behaviour at the molecular level, the current definition of cell type is being challenged, as high transcriptomic heterogeneity can be observed within traditionally defined cell types.
The aim of the HCA is therefore to provide a more systematic classification of cell types, using genetic and molecular features. This would further our understanding of human health and diseases, and start building a bridge between our knowledge of the different scales of the human body: from the molecular, single-cell scale, to tissues, organs and organ systems.
A few weeks ago, Genestack attended the HCA meeting which took place in Stockholm, focusing on the computational challenges and methods required to build such a cell atlas. Here are some of the topics that have been discussed.
In terms of technologies, it seems that single-cell RNA-seq will be dominant in the project, mainly because the method has had time to mature over the past few years and has proven to be scalable - as illustrated by the recent release by 10X Genomics of a dataset comprising expression profiles of a whopping 1.3 million cells.
However, many other technologies are also likely to play an important role in the atlas, such as single-cell ATAC-seq, but also lower-throughput methods like single-cell proteomics, protein interaction assays, chromatin accessibility, or imaging, which have also started to proliferate over the past few years.
Sources of variability
This section draws, among others, on Nir Yosef's fascinating talk at the Stockholm HCA meeting.
When it comes to single-cell data analysis, and specifically single-cell RNA-seq, a key challenge to address is the many sources of variability which can affect the outcome of the experiment. To name a few:
-human samples can be collected in different ways, and stored in different conditions
-in the case of droplet-based systems, when isolating cells into droplets, some of them can be empty, contain more than one cell, or contain dead cells
-mRNA capture and the reverse transcription process can lead to coverage biases
-repeated rounds of PCR can induce amplification biases; however those can be addressed by using unique molecular identifiers to tag PCR duplicates
The use of standardised sample collection protocols, as well as thorough QC and reads filtering steps can partially address these issues.
Another critical way of addressing biases and ironing out technical differences between samples is normalisation. Normalisation is basically a numerical transformation of the RNA-seq expression matrix which aims to reduce noise in the data. Such methods has been thoroughly investigated in the context of bulk RNA-seq data analysis; typical methods include DESeq normalisation, FPKMs and TPMs. However, one particular problem specific to single-cell RNA-seq is the abundance of zeros in the expression matrix (due to lack of coverage, but also to the fact that in a specific cell at a specific time, many genes will not be expressed). This difference in the distribution of expression values leads to incorrect size factor estimates when applying bulk RNA-seq normalisation methods to single-cell data. Therefore, alternative normalisation strategies using for instance pooling data across cells have been suggested.
Another strategy is to rely on the use of spike-in RNAs (such as the ERCC set). These are synthetic transcripts introduced in known quantities into each cell prior to sequencing. By measuring the spike-in abundances estimated by sequencing, and comparing them to the known true abundances, we can produce an estimate of technical noise.
However, this strategy has limitations because of cost, but also because these synthetic transcripts do not exhibit the same biochemical properties as real human transcripts - therefore, some biases will not be reflected when measuring their abundances.
Noise and replicates
Some deeper problems also arise when trying to eliminate sources of noise in single-cell RNA-seq data. First of all, because no two cells are the same, and we can't sequence the same cell twice, there is no such thing as true technical replicates in single-cell sequencing, unlike for bulk RNA-seq.
Moreover, some sources of variability are more subtle and can be confounded with biological signals of interest, depending on what you're looking for. For instance, a parameter which has a strong influence on expression at the single-cell level is the cell cycle: at the time of capture and sequencing, different cells are in different phases of their cell cycle, and this information may or may not be relevant to your experiment. You may therefore choose to use a numerical transformation to deconvolute cell cycle information from the expression data.
This goes to show that there is no silver bullet for single-cell normalisation, and the processing steps should be chosen depending on what is defined as "signal" and "noise" in the context of each experiment.
This meeting was a great opportunity to discuss the data analysis pitfalls of single-cell RNA-seq and other single-cell technologies, and get an update on the progress of the HCA project. The high-level compute architecture for data ingestion, data analysis and downstream visualisation of the HCA data is already well-defined, and this is great news for the future value that this project will bring to the research and bioinformatics community as a whole; as Dana Pe'er noted in one of her talks, the atlas will only be as good as our ability to navigate it.