Want to democratize data science? Focus on the data and search services first, not the visual analytics

11.06.21

9 out of 10 scientists are unable to explore data effectively in a typical R&D organization. This is because most scientists lack the necessary computational skills, thus preventing them from dealing with the data directly and independently. They have to be constrained by the limited availability of data scientists (who, in turn spend most of their time doing data wrangling, rather than data science) or rigid self-service visualization tools, which can’t keep up with rapidly-evolving data and research needs.

Deal with the root cause: the data and the search services, not the visual analytics.

The key issue is that data visualization applications are only as good as the underlying data quality and the ability to query it. For example:

To know in which cell type / tissue / cell line a gene is expressed, the challenge is in harmonizing the metadata and expression data across your private/public experiments
To stratify individuals by their multi-omics profiles, the challenge is in tracking the diverse subject-sample-data relationships
To correlate large molecular and phenotypic data, the challenge is to integrate and index them in a scalable manner.

Data management and search services enable you to build diverse visual analytic apps much more rapidly

Your best bet is to focus on the underlying source of limitations first: data management and search services. You’ll then have the lever to build applications much faster and much more cheaply, to the point that they become disposable.

What does good data management and search services look like?

A good data management should help you break down data silos, clean messy metadata, and track complex relationships. It should also enable you to do this as early as possible, not retrospectively. Because collecting and cleaning data only at the point when you need it will be painstakingly difficult.

A good search service should help you to integratively query thousands of samples, millions of variants, hundreds of thousands of expression values and so on. It should be flexible enough, providing reusable building blocks for creating tailored visual analytic apps. Once you have good data management and search services in place, when it comes to building the visual analytic apps, consider lighter implementation in R/Python, rather than the more complex frontend stack in JavaScript. This will make the life of data scientists a lot simpler: they are more familiar with R/Python and they’ll be able to easily utilise bioinformatics packages.

What’s the impact of implementing this strategy?

Consider a typical scenario in a medium R&D department, where you have 20 data scientists and 200 biologists:

Data scientist spending 10 hrs/week building visual analytic apps + 10 hrs/week supporting biologists
Biologist: 10 hrs/week exploring data

If we can make self-service applications faster and cheaper to build, a conservative estimate would lead to saving each of these activities by 50%, which translates to annual cost savings of at least $5M dollars. This is not to mention the long-term impact from better science.

Case study: Expression atlas

Data diversity and volume has grown rapidly over the years: it’s not uncommon now to want to query a gene/protein of interest across thousands of private/public transcriptomics/proteomics samples. Traditionally, it’d take months to build such applications.

Using our flagship product, Genestack ODM, we are able to build a powerful proteomics/transcriptomics expression atlas in just a few days, by a single data scientist. Moreover, it’s purely written in R, with only a few hundreds lines of code, so it’s very-easily customisable and extendable to answer additional research questions.

But there’s no magic: this is only possible since all the hard work of integrating, harmonizing, and indexing the data has been taken care of by Genestack ODM, allowing the application to make just a few API calls to retrieve the right data and metadata, from the right sources, for the right questions.

> Wellcome Sanger Institute adopts Genestack’s Genestack ODM for Human Genetics datasets

11.06.21

By Davide Mantiero , Kevin Dialdestoro

Complete our short form to continue

Want to democratize data science? Focus on the data and search services first, not the visual analytics

Deal with the root cause: the data and the search services, not the visual analytics.

Data management and search services enable you to build diverse visual analytic apps much more rapidly

What’s the impact of implementing this strategy?

Case study: Expression atlas

Unlocking Neuroinformatics: EEG & Multi Omics Synergy with Genestack ODM

Six Steps to FAIR Data: How FAIR data enables searches to power your research and LLMs

Data Management Trends: Navigating the Future with Genestack

Sign up for our newsletter