Originally published by International Biopharmaceutical Industry
Analytics is the discovery, interpretation, and communication of meaningful patterns and knowledge in recorded data. In the life sciences, analytics are crucial for understanding the omics revolution and being on the cutting-edge of precision medicine. Using data analytics for fact-based decision-making is key to improving patient outcomes. It enables faster drug discovery, target validation through to clinical trials, and access to personalised medicine. With the cost of sequencing being driven down, the amount of Life Science Data produced is increasing at an unfathomable rate. However, more data does not necessarily mean better data or more knowledge. Like all data across industries, you get out what you put in. So how do we improve what analytics we get out?
The Importance of Metadata
Metadata is – put simply – information about information, which provides valuable context to your recorded data. The primary reason for collecting associated metadata is to enable you to understand and use your data better.
- Metadata can aid with reproducing your data correctly. For instance, if you want to decode a picture with colours as they were intended, you need to use the right colour profile (e.g. sRGB).
- It can help find larger units of information based on small snapshots of information. For example, you may want to find a picture of interest, say from a holiday trip last year. Most likely you have copied the picture over from your smartphone along with a hundred others. If you can remember the date you can use it as a criterion in your search. If you can’t, you probably have it in your calendar somewhere.
- Integrating your data into your repositories becomes easier with associated metadata.
- In addition, you may be able to draw conclusions on the contents of data without needing to access and analyse it directly. If you took a lot of pictures around the time of sunset you can make a reasonable guess that the weather was clear that particular evening.
- Finally, there is almost never a good reason not to save metadata – it has minimal storage requirements compared to the data it is describing.
Standardising templates and the terminology used when managing data and metadata enables you to understand what data you have, prevents duplication and supports collaboration with your peers. Possibly the most important area for data analytics is making data discoverable to others in your own organisation, and beyond, to maximise the use of data and enhance analysis. If every Life Science Dataset produced had complete and standardised metadata associated with it, sourcing the most appropriate data to input into your analysis pipelines would be much easier. Data quality is also a key concern, but associated metadata can instantly help you identify if the dataset is useful for your analysis or not, before you spend time retrieving the file.
I have met with people of varying levels of experience, across both large and small pharmaceutical and biotech organisations, in many different departments, in order to understand their complex omics workflows and how to create a collaborative omics ecosystem. The most common and pressing problem encountered by all, is how their organisation should manage the volumes of data generated and its associated metadata. With more data being produced each year and across different departments and countries, the problem for large organisations is not a lack of data, but rather how to aggregate it. If you are not managing the data you produce efficiently, then you are not making the full use of it, regardless of the analytics you perform.
Absorbing Knowledge: Understanding What You See
Aggregation and access to the masses of both private and publicly available data is only part of the analytics workflow challenge. The analysis of omics and biomedical data requires numerous different and often complex skill sets, which is now seeing the evolving collaboration between different roles and departments, which might previously have been siloed. IT specialists, data architects, bioinformaticians and scientists now need to work closely together in order to collect, analyse and make sense of the volumes of data generated.
For this to work successfully, organisational culture change is often required to overcome the gaps between the different roles to enable these interactions, and sharing of data, empowering bioinformaticians to use their expertise in developing new and cutting-edge algorithms and tools to interrogate the data, whilst providing scientists access to simple standard analysis tools so they can focus on finding context in the data to answer the scientific question. Before beginning any data analysis, you first need to know what is the question being asked?
With a shortage of skilled bioinformaticians and the growing volume of Life Science Data sets available, the industry needs to focus on enabling scientists to analyse their data, without requiring them to learn how to write code or use the Linux command line. This is the reality for many organisations now and prompts many discussions on how to address this, whether diverting budget and resources to develop infrastructure internally, or looking at commercial solutions that can be integrated into an organisation's existing infrastructure easily.
Data visualisation is a key component of any solution. Organisations need visual tools to help identify and interpret meaningful patterns in complex information and large volumes of data. There is no one solution that will address all areas that an organisation wishes to view. Creating a flexible, agile ecosystem that will integrate with existing workflows and analytical tools will help organisations create their own best-of-breed ecosystems that can grow and evolve as technology continues to grow and evolve. Many organisations have already paid the price of migrating to a new software tool offering the complete solution, only to discover they have swapped one problem for another. Often this problem is that the analytical tool of choice has evolved. Equally, investing time and energy creating solutions completely in-house can take many years, proving costly in time, resources and budget. Focusing on data integration and working with what pharma and biotech organisations already have is the key to success. Creating an agile ecosystem allows entire organisations to access data to extract contextual knowledge, and really puts the power of the data into everyone’s hands.
AI and Machine Learning – But Are You Really Ready?
Artificial intelligence (AI) and machine learning (ML) are popular discussion topics, with many industry conferences and professionals considering how it may be possible to leverage both in the pharmaceutical sector. Indeed, many executives are beginning to look at AI as a possible solution to their data analysis challenges, with many large pharmaceutical companies already partnering with smaller companies whose core capability is in AI technology, such as AstraZeneca, UK and with Berg, Boston, US or the collaboration between GSK, UK and Exscientia, UK. This suggests that many organisations currently lack the infrastructure and company culture to build production machine learning systems.
Implementing the company-wide infrastructure needed to support AI and machine learning is difficult. Many executives are pitching for massive budgets to invest in AI, when in fact they are perhaps not yet ready for it. Implementing processes and standards across the organisation to address accessibility, data and analytics infrastructure, as well as engineering and business culture are critical. Focusing on this first will prepare organisations for the AI revolution that awaits. As stated at the beginning of the article, the quality of the data and the infrastructure will impact the quality of the knowledge extracted from its analysis.
Successfully implementing AI is dependent on improving data infrastructure to ensure you get the best output:
- Everything should start with a goal, not with a solution
If you are not dealing with big data, then you don’t need to incur the operational costs of creating a Hadoop architecture. If you are doing static analysis, then you don’t need real-time predictions or Spark. Start with the problem. What are the goals and metrics of success? Apply a way to test various hypotheses to better understand each solution. Understanding the limitations of a solution is just as important as where they succeed. Don’t swap one problem for another.
- More data is not better data
The common view is that AI can make analysis of vast amounts of data work, however, what really holds back AI technologies is the data quality. Internet giants like eBay and LinkedIn both employ large teams of data analysts who ensure the quality of data is high and feed this into learning algorithms – a process which takes time and patience, as well as expertise. In fact, Deepak Agarwal, VP of Engineering and Head of Relevance and AI at LinkedIn, corrects a common misconception: “We see a lot of media misreporting about AI taking over human roles. Most AI actually requires more human time, either in engineering or in reviewing, to ensure that the information the algorithms come up with [is] unbiased and accurate.”
- Data standards aligned from top to bottom
Many executives think the data they have is good, when in fact it is incomplete, biased and fragmented. Enforcing company-wide standards in data access and data / metadata standards is essential to streamlining analytics – creating a single point of truth, applying clear labelling and metadata, avoiding confusion with detailed documentation and enabling company-wide access. Lack of data access will cause delays, which costs money. The organisations capitalising on AI have well-documented internal APIs and visualisation tools that help non-technical leaders and employees gain valuable insights from data.
- Fact-based decision-making. Be data-driven
An executive culture built on opinion rather than data can thwart rational thinking. If an organisation is not open to new thinking and challenging assumptions or opinions, it may find itself cherry-picking data which aligns with those opinions. Refusing to acknowledge insights from data will take you further away from your goals.
You would be surprised at just how far discovery scientists have gone because they wanted their hypothesis to be true and subsequently, just how far they drifted from their original goal. One memorable case of this was back in 2016, when Nature released a study showing that “scientists have found the inherited gene for MS” [http://www.cell.com/neuron/fulltext/S0896-6273(16)30126-X]. I am sure the scientists behind the study were ready and eager for the media to broadcast the discovery and to bask in their success. However, they did not anticipate the scientific community questioning their findings.
So what was the big deal? The study showed seven cases of MS in two families which carried the same genetic variant, a mutation in NR1H3. The paper made the case for this variant to be the causal dominant variant for the disease. Firstly, it is not unusual for the same variant to appear in related individuals. Secondly the paper noted in the same two families, there were four healthy carriers. If this variant is indeed causal, it has incomplete penetrance, which detracts from the evidence about the gene being causal.
Concerned by the findings, researchers began looking for data to back the theory or discount it. Surely the study authors had tested the hypothesis with other available data? Turns out they hadn’t, because the same genetic variant was later found in no less than 21 healthy carriers. This data was not in some siloed repository, this data was found in ExAC. Suddenly, the evidence for a dominant variant now looks very weak.
The scientific community learned from this. To remove such repetitions, organisations can integrate with not only their own data, but public data and commercial LIMS to provide access to as much relevant data as possible in order to make fact-based
decisions, not what they want to see.
Many executives have started their analytics journey like alchemists looking for gold, driven by the single solution that they believe can solve all their problems. However, the organisations who have really excelled in improving data analytics and use it to drive their business have been the ones who realised that by searching for a gold standard analytics solution, they found something much greater: the process, infrastructure and culture to improve the quality, accessibility and usability of data, across their business.
Indeed, you could say I was once like an alchemist myself, looking for a solution to integrate and manage my Life Science Data. This ultimately led to me leaving EBI and starting my entrepreneurial journey and founding Genestack in 2011. As I understood more the complexities of the Life Science Data workflow across different organisations, I began to see the solution was not an all-encompassing one, but modular: one that could handle different formats of data; one that could integrate with existing analysis pipelines; one that was flexible and could meet needs not yet anticipated, like the rise of single cell. Thus began my personal mission, which is now the Genestack’s mission; to help our customers improve patient outcomes by enabling better utilisation of Life Science Data.
Originally published by International Biopharmaceutical Industry
- Written by Dr Misha Kapushesky
Dr. Kapushesky is the Founder and CEO of Genestack. Misha has over a decade of experience dealing with big data in genomics and has led international consortia applying bioinformatics to medical research. Before founding Genestack in 2011, Misha was a functional genomics team leader at the EBI, where his team developed bioinformatics data systems for academia and industry, such as the Gene Expression Atlas.