Bioinformatics reproducibility and why it matters

Guest post: Dr. Jelena Aleksic is a computational biologist at the Stem Cell Institute at Cambridge University. She works on the role of RNA methylation in neurodevelopmental disease. jelena

When it comes to my profession, there are certain things that keep me up at night. With DNA sequencing getting ever cheaper, and datasets getting ever bigger and more diverse, genomics as a field is undergoing explosive growth. And this is, on the one hand, tremendously exciting. We have already made such huge strides in the last decade, and I have no doubt that genomics will go on to transform healthcare as we know it. Yet the nagging doubt that keeps me up at night is: how much of the research is really solid“ How much of it will really hold“ You don't have to go far afield to find horror stories of bioinformatics reproducibility. That bit where someone shuffled an Excel column down by one row and changed all the results, or perhaps flipped the ones and zeros and inverted the outcome of the experiment. Other times, maybe the mistakes are not quite as blatant, but that doesn't mean that they're not there. And it's something that's difficult to avoid.

Challenges in the field today

The truth is - most people working in bioinformatics acknowledge that there are major challenges in this area. Our field is fast evolving, so the data formats constantly change. New experiments mean new data analysis techniques are required. The genome assemblies and gene annotations also change frequently. Did you know that the outcome of a GO enrichment analysis can change depending on which database you use the gene identifiers from“ Comparing results and synchronising identifiers is similarly fraught with difficulty, and keeping extensive metadata is essential. And yet this requires a lot of effort, and pretty much every lab will have their own custom system of dealing with it. Another part of the problem is how bioinformatics is practiced in academic science. Since it's a new field, there are relatively few structured training programs in place. A lot of us jumped in from other fields, and do our best to pick it up as we go along. Furthermore, a lot of bioinformaticians work in isolation, and there are substantial quantities of quick code hacks that are never shared, meaning that only one person has ever seen them. Between this and the rapidly changing requirements, even with high skills and the best of intentions, mistakes are inevitable.

Data analysis with Genestack

For all these reasons, I find Genestack really exciting. Lovingly created by bioinformatics geeks who clearly know their stuff, I believe this is an initiative that systematically solves a number of major problems in bioinformatics today. Lets start, for example, with the issue of file formats. Confession time: when I first started learning bioinformatics, it took me a whole year to figure out the difference between 0-based and 1-based formats, because, come on, why would there be a difference in the coordinate system between a .bed and a .gtf“ So, that's a whole year of off-by-one errors in about 50% of my scripts. Now it's the first thing I tell my students. However, using the Genestack platform, it's just not something you have to worry about - it's "formatless", which means all the format conversions happen behind the scenes, and you just have to worry about the biology.

Data sharing

Another major advantage is the easy data sharing and access to publicly available data. I love how much genomics data is publicly available today, but it would be a lie to say that it was easily so. First you need to find it, then download it, at times use an obscure package to unarchive it. And then you need to reanalyse it, as well as keep track of the metadata for potentially hundreds or thousands of studies, depending on how ambitious your data integration fantasies are. All of this takes a lot of time, and it's fiddly and can easily lead to errors. On the other hand, a cloud-based solution like Genestack can actually solve this problem in a very powerful way. In the current version, they already have data loaded from about 2000 or so studies deposited in GEO, and they're gradually working on making the collection comprehensive. What this means is that the data is instantly accessible, you don't have to worry about formats, and someone else has gone to the trouble of collating the metadata (which in some cases is both manually curated and merged from multiple databases). Similarly, sharing my own data with colleagues is pretty straightforward, and implemented GoogleDocs style. This is nice because I'm pretty sure that more instant access to analysis results will please my boss. And actually, sharing project files with my colleagues would mean people other than me could easily rerun the analysis, and would also save me the time spent e.g. processing GEO files for biologist colleagues who want to have a look at them. As part of the Genestack platform, anyone can access GEO data and process it, and the results can be visualised in a genome browser that's included in the platform. At the moment, the best replacement I have is the ¶15 fiddly steps required to prepare aligned data for UCSC upload (I'm not joking. bam -> sam -> bedgraph -> add "chr" to chromosome names -> chrMT is called chrM -> bigwig -> upload to public server -> copy link -> use perl script to autogenerate upload code -> google colours I want -> tweak upload code to include desired colours for each condition -> upload -> check that it worked -> success! And don't even ask if you want the data normalized). And I would gladly trade the sense of pride of having gotten that to work, for a system where the multiple format conversions, transfers and uploads are unnecessary.


Keeping track of data provenance is another huge issue. Most bioinformatics research today is not replicable, because in order to do so, you would need to have access to all the scripts, the specific versions of data used, and also the specific versions of software used, because the results can change a lot when particular packages are updated. With the exception of a minority of people who really care about reproducibility and go out of their way to make everything accessible (Titus Brown, I'm looking at you), this isn't common practice. Because, lets face it, it's a lot of work. However, with the Genestack platform, basically all of this is automatically done for you. The platform keeps track of data provenance, saves all versions of tools and packages ever used, so that a particular piece of analysis can be perfectly reproduced years down the line, and also saves the complete workflow online.

The future

Like all new systems, there are various things still to tweak and develop. However, I feel that this is an incredibly promising platform, because it finally solves all those bioinformatics problems and inconsistencies in a consistent and comprehensive way. And lets face it - as a field, we really need that. Furthermore, it offers a lot of potential for growth. The analysis tools on the platform at the moment are relatively limited, but it is easy for users to add their own scripts and methods, and in the long term even release these as apps for other users. This means that the more the community grows, the more of a useful resource this becomes for everyone. Genestack describes itself as a "universal collaborative ecosystem". I very much hope it becomes a busy and thriving one.