Rapid advances in biobanking and genomics have resulted in unprecedented data resources, capable of teasing apart the biology of complex disease.
The leading example of this genomic-biobank data revolution is the UK Biobank dataset, arguably the most valuable biological dataset generated to date. It boasts a massive collection of longitudinal, phenotypic and genomic data from 500,000 individuals, including their health records, cognitive / physical / lifestyle measures, imaging and biomarker data.
Extracting scientific value - a typical workflow
Extracting scientific value out of such genomic-biobank datasets, however, has been challenging. For example, as a researcher, a common (simplified!) workflow when investigating the genetic factors of a particular disease is the following:
- Make the data analysis/query-ready. Omics, biomarker, questionnaire data, etc. - you need to keep track of the myriad data types the individuals have, where they are, and how to extract / (pre-)process them.
- Browse and select cohorts of individuals that match your study criteria, ensuring balanced population characteristics.
- Perform association analysis (e.g. GWAS/pheWAS), correlating millions of variants with disease phenotypes, taking into account various demographic/environmental factors, ultimately to shed light on the disease’s biology, candidate drug targets, and drug portfolio repositioning.
Biobank data - unique opportunities & challenges
Each of the steps above poses a unique, non-trivial hurdle for genomic-biobank datasets. In the case of UKBB dataset:
Data management challenge
There is overlapping/conflicting information between fields (e.g. between self-reported vs hospital-provided diagnostic code) that should be identified and curated. And you need to keep up with the data - new data types (e.g. metabolLife Science Data for 127K UKBB individuals will be added this year), patient withdrawal, new data processing methods, etc. To do all these, you don’t want to work with thousands of disconnected files - it’d be far too time-consuming and error-prone!
Navigational challenge
There are 3.5K+ phenotypic fields from 500K individuals to construct the cohort from. This phenotypic data is big, longitudinal, and hierarchical. So, unfortunately, you can’t simply use Excel to explore this.
Scalability challenge
We are dealing with 10M+ genomic variants per individual. That’s 5T data points for 500K individuals. And this number will be an order of magnitude bigger with the looming WGS results. Not only will you need high-memory, expensive hardware to process this, but traditional genomics software would also simply take far too many days to process this dataset.
Sort out the data first - the insights will come much more easily
Navigational & scalability challenges have been mainly resolved; data management hasn’t. To get the maximum value from biobank data, you need to sort out the data first - the insights will come much more easily.
In recent years, much attention has been spent on the first two challenges summarized above - hence, navigational and scalability aspects have primarily been solved now. There are already various vendors that offer similar solutions with similar approaches: a combination of computational engine (e.g. based on Spark or a specialized database), pre-baked cohort/variant/association browsers, and Jupyter notebook integration.
However, the data management challenge remains unsolved. Too much time and effort are spent on these routine tasks:
- Keeping track of the relationships between patients, samples, and the various data types (omics, biomarker, health records, etc.), their metadata (processing parameters, genome version, etc.), and their downstream analyses (e.g. GWAS/pheWAS results).
- Identifying and curating overlapping/conflicting information among thousands of phenotypic fields and individuals; then saving/sharing these custom annotations.
- Integratively searching across data types, e.g. stratifying individuals by biomarkers, physical/cognitive measures, and genomic/metabolomics signatures. Handling patient dropout / consent withdrawal, data/metadata versioning, etc.
And this problem will only become more critical, as we go beyond looking into a single biobank dataset to jointly explore multiple biobank resources, e.g. Biobank Japan, Icelandic Biobank, FINNGEN, and “All of US”. To reliably combine and compare cohorts across biobanks, we need cross-study data harmonization and data access.
Without overcoming this data management challenge:
- You will spend more time on the repetitive task of organizing, cleaning, and turning data into query/analysis-ready form, rather than doing data science.
- The problem will become more painful, as biobank data resources grow in volume, complexity, and availability.
- Ultimately, it will lead to suboptimal data for analysis/interpretation.
Where are you on this journey?
So the question is - where are you on this journey? Are you still wrestling with navigational/scalability challenges or have you gone beyond that - that data management is now what’s holding you back? If our story resonates with you, you’re welcome to join our webinar on 23rd March 2021. We’ll be showcasing our solution, empowered by our flagship product, Genestack ODM, to help you unlock the maximum value from biobank data resources!
Webinar registration >
RELATED CONTENT:
> Webinar — How to get the maximum value from biobank data resources
> Genestack signs multi-year agreement with AstraZeneca to implement Genestack ODM