This article was originally published in Splice on 9th of February 2016. You can find the original version here.
The challenges of bioinformatics
Let’s take a look back at how much genome sequencing used to cost during the span of the last 15 years. In 2001, for the price of sequencing 1 human genome, a person could buy 1000 Porsche 911 cars. If you were deciding between spending your money on cars or on genome sequencing in mid-2008, for the price of knowing your genetic code you could have bought 10 Porsches. Within a year this would drop to only 1 Porsche. How does the situation look now? Ever since Illumina launched the HiSeq X Ten Sequencer, able to deliver the first 1,000$ genome, your sequencing money would only cover 1% of the costs of your dream car. The money issue is no longer standing in the way of personalised medicine.
If that’s the case – where is the medical revolution scientists have been talking about for the last 15 years?
New rate-determining steps
No one has predicted this fast paced evolution of sequencing technologies and now the world of computational biology is facing some major challenges regarding issues such as data storage, management, security and analysis. These are the new rate-determining steps of introducing genomics into our everyday lives.
Slow evolution of computers
Lower cost and hence greater scale of genomic sequencing is producing enormous amounts of data and since computers are not evolving as fast as sequencing technologies, we find ourselves facing major CPU and storage problems. Bioinformatics is one of the first industries that adopted the cloud. Cloud infrastructures are flexible and dynamic, providing users with possibilities of scaling their allocated resources up and down according to their needs. In comparison to using a computer cluster to increase the CPU and storage potential, bioinformatics in the cloud can be performed by an individual user, for example a PhD student working in a lab without a strong bioinformatics base.
The formatting pain
Any person doing bioinformatics at any scale will come across a universal problem: countless amount of file formats. Researchers estimate that they spend about 80% of their time on data grooming and only 20% on actual data analysis. Lack of standardised file types and inconsistent data formatting means every new program results in a new data format. This creates a challenge when one wants to use any of the publicly available datasets. After searching and downloading the data, it is essential to analyse it, check its quality and suitability for the study one wants to carry out and not lose all the metadata in the process. These are time-consuming tasks that more often than not lead to errors. The solution to this problem is automating these tedious data-grooming tasks, giving researchers more time to focus on data analysis. For instance, the Genestack platform is “format-free”, meaning that when data is uploaded onto the platform in any of the possible formats, it ‘loses’ the format and becomes a meaningful biological object, with all objects of the same kind acting identically regardless of underlying formatting differences.
The reproducibility struggle
Other common complaints involve the problems with reproducibility and metadata organisation, such as incorrectly annotated genes or lack of data annotation whatsoever. Keeping track of the data provenance is essential and details such as scripts or specific versions of tools used must be carefully recorded, so that someone can reproduce the analysis in the future. This is crucial, since reproducibility is an absolute necessity for cumulative science. Noting down all scripts and parameters is incredibly time consuming and automation of this process is a great advantage, and saves significant time and effort for researchers.
During a genomics conference I asked a bioinformatics trainee what is the worst part of his job. His answer was: “unrealistic expectations of scientists”. Wouldn’t it be easier for everyone if a person who designed and performed the experiment would also be the person analysing the data? But – most of the scientists have very little or no coding experience. There is a growing need for tools and platforms that would enable scientists to analyse their sequencing data without the expert knowledge of programming languages such as Python or R. This problem is growing more and more important, as there is a huge discrepancy between the numbers of lab researchers and bioinformaticians in any research facility.
Future of genomics in the clinic
The world of genomics is rapidly changing the landscape of healthcare as we know it. With prices of genome sequencing dropping below $1000, personalised medicine and treatment plans based on your genetic make up will become our everyday reality. What are the challenges of using NGS tools in the clinic? The most important ones include data security, storage, analysis and interpretation.
Raw sequencing runs generate hundreds of gigabytes of data from a single measurement, and this means current clinical data management infrastructure is not enough to handle such enormous amounts of data. With the development of cloud computing, it seems realistic that this way of storing and managing data will soon be more and more common in the clinical setting. However, many remain uncertain whether cloud computing will meet the standards of data security and archiving and how will it comply with regulatory requirements. As a result, new and integrated better systems and methods are required so we can unleash the full potential of genomics. In my next article I’ll describe the project that our team at Genestack, together with our partners, have been working on to bring the benefits of using a cutting-edge bioinformatics platform to the clinic.