Complete our short form to continue

Genestack will process your personal data in accordance to its privacy policy which can be found here. This includes sending you updates by email about our products and content we think it would be of interest to you. You can unsubscribe at any time by clicking the link in the footer of any email we send. By clicking submit you agree that we process your information in accordance with these terms.
Opinion

Genomic-biobank data resources: democratizing access to accelerate personalized medicine

15.02.21

Rapid advances in biobanking and genomics have resulted in unprecedented data resources, capable of teasing apart the biology of complex disease.

The leading example of this genomic-biobank data revolution is the UK Biobank dataset, arguably the most valuable biological dataset generated to date. It boasts a massive collection of longitudinal, phenotypic and genomic data from 500,000 individuals, including their health records, cognitive / physical / lifestyle measures, imaging and biomarker data.

Extracting scientific value - a typical workflow

Extracting scientific value out of such genomic-biobank datasets, however, has been challenging. For example, as a researcher, a common (simplified!) workflow when investigating the genetic factors of a particular disease is the following:

  1. Make the data analysis/query-ready. Omics, biomarker, questionnaire data, etc. - you need to keep track of the myriad data types the individuals have, where they are, and how to extract / (pre-)process them.
  2. Browse and select cohorts of individuals that match your study criteria, ensuring balanced population characteristics.
  3. Perform association analysis (e.g. GWAS/pheWAS), correlating millions of variants with disease phenotypes, taking into account various demographic/environmental factors, ultimately to shed light on the disease’s biology, candidate drug targets, and drug portfolio repositioning.

Biobank data - unique opportunities & challenges

Each of the steps above poses a unique, non-trivial hurdle for genomic-biobank datasets. In the case of UKBB dataset:

Data management challenge

There is overlapping/conflicting information between fields (e.g. between self-reported vs hospital-provided diagnostic code) that should be identified and curated. And you need to keep up with the data - new data types (e.g. metabolLife Science Data for 127K UKBB individuals will be added this year), patient withdrawal, new data processing methods, etc. To do all these, you don’t want to work with thousands of disconnected files - it’d be far too time-consuming and error-prone!

Navigational challenge

There are 3.5K+ phenotypic fields from 500K individuals to construct the cohort from. This phenotypic data is big, longitudinal, and hierarchical. So, unfortunately, you can’t simply use Excel to explore this.

Scalability challenge

We are dealing with 10M+ genomic variants per individual. That’s 5T data points for 500K individuals. And this number will be an order of magnitude bigger with the looming WGS results. Not only will you need high-memory, expensive hardware to process this, but traditional genomics software would also simply take far too many days to process this dataset.

Extracting value from biobank data resources

Sort out the data first - the insights will come much more easily

Navigational & scalability challenges have been mainly resolved; data management hasn’t. To get the maximum value from biobank data, you need to sort out the data first - the insights will come much more easily.

In recent years, much attention has been spent on the first two challenges summarized above - hence, navigational and scalability aspects have primarily been solved now. There are already various vendors that offer similar solutions with similar approaches: a combination of computational engine (e.g. based on Spark or a specialized database), pre-baked cohort/variant/association browsers, and Jupyter notebook integration.

However, the data management challenge remains unsolved. Too much time and effort are spent on these routine tasks:

And this problem will only become more critical, as we go beyond looking into a single biobank dataset to jointly explore multiple biobank resources, e.g. Biobank Japan, Icelandic Biobank, FINNGEN, and “All of US”. To reliably combine and compare cohorts across biobanks, we need cross-study data harmonization and data access.

Without overcoming this data management challenge:

Where are you on this journey?

So the question is - where are you on this journey? Are you still wrestling with navigational/scalability challenges or have you gone beyond that - that data management is now what’s holding you back? If our story resonates with you, you’re welcome to join our webinar on 23rd March 2021. We’ll be showcasing our solution, empowered by our flagship product, Genestack ODM, to help you unlock the maximum value from biobank data resources!

Webinar registration >

 

RELATED CONTENT:

> Webinar — How to get the maximum value from biobank data resources

> Genestack signs multi-year agreement with AstraZeneca to implement Genestack ODM

 

15.02.21

Sign up for our newsletter