FAIR data management system

Challenge

Big Pharma companies are facing major data paradigm shifts due to increased Life Science Data in clinical trials, massive data volumes, and data externalisation. With the advent of machine learning and AI, data has proven to be a major asset for decision making and exploring scientific hypotheses. However, it is also becoming apparent that leveraging data is still not an easy task, especially across department and unit boundaries.

The reasons for this are numerous: Relationships between data are often missing or hard to discover. Metadata is inconsistent between different sources, and is sometimes missing or incorrect. Data provenance is not always established.

These problems make it hard for scientists and AI/ML applications to find, re-use, aggregate, analyse and classify data from experimental studies. This in turn makes analyses slow, causes experiments to have to be re-run, important experimental data to be missed, statistical confidence in results to be lowered, and potentially important cross-study factors to be missed.

No complete solutions exist for the management of Life Science Data in life sciences R&D so currently these problems can only be partially mitigated - through costly hiring of additional data managers, diverting time from specialists in other fields, duplicating data, and re-running experiments to re-generate lost data. Previous software solutions that aim to address these are often not interoperable, adding burdens to IT management, user training and adoption, and, ultimately, increased cost.

The Solution

Guided by our client“s R&D needs, Genestack have successfully accelerated the development and application of technologies to address these problems through building a five-layer architecture spanning physical hardware, file system & workflow, Single Point of Truths, data integration, and user interface. Single Point of Truths are information sources for studies, samples, genomics, transcriptomics, and other data types that are stored only once, independently of each other, with linkage handled via references.

In this context Arvados, from Veritas Genetics, provides the underlying distributed and versioned virtual file system. Genestack then provides the key interoperable components for centralising data access across Arvados data locations and public repositories, harmonising metadata, and enabling integrative data mining through APIs and a graphical user interface. Together, we have built a FAIR-compliant and single-point-of-truth system that not only maximises data discovery but also provides the foundation and building blocks for advanced visual analytics and AI/ML applications, for pharma R&D and beyond.

Technology features

Specific technology features of the solution are listed below:

Technology / feature		Benefit
Genestack	Full-text/faceted metadata search powered by SOLR	Easy to find data via biological and technical attributes
	Data-type specific indexers: Clickhouse for TranscriptLife Science Data, Genestack proprietary indexer for GenLife Science Data	Fast search/retrieval even if data needs to be held at remote locations
	Relationship modelling via MySQL	Easy to model and traverse through studies, samples, Life Science Data and analysis. Ensures data provenance and reproducibility.
	Integrative and distributed metadata and data querying, using a combination of MySQL and in situ indexers	Integrated and federated Life Science Data queries.
	Public data repositories fully integrated and indexed, such as GEO, ArrayExpress	Centralised access to a wealth of data
	Ontologies/controlled vocabularies support and easy curation (ChEBI, Uberon, etc.)	Harmonised metadata, making it consistent, unambiguous, and valid, so that searching becomes easier and data can be re-used correctly
	Programmatic access via REST endpoints, Swagger documentation	Enables bioinformaticians to run standard pipelines and query data for custom analysis. Enables data managers to upload/update/link data programmatically. Enables system integration.
	Client libraries (Python/R), auto-generated using Mustache	Help bioinformaticians to more easily query data for custom analysis
	Third-party tools integration for standard workflows (like QC)	Allows workflows to be expanded with existing or new analyses
	Ability to integrate future modules and integrate new data types as new Single Point of Truths (SPoTs)	Allows future-proofing
	Single sign on	Enables convenient yet secure access
	Modular, service-oriented architecture	Enables system integration and addition of new data and tools
Genestack-Client interface		Seamless integration of client terminologies service for metadata curation
Genestack-Arvados interface		Allows permissions synchronisation and versioning through from Arvados to the Genestack platform.
Arvados		Provides scalability, non-duplication, performance, provenance and access through its underlying distributed and versioned virtual file system, together with user access control system
Docker containers		Allow deployment and modularity

Outcome

From the user’s, and an AI/ML, perspective, the pRED Data Commons strategy unleashes the full potential of all collected data by making it easy to find, access, interoperate and reuse - being aligned with FAIR principles.

Some of the specific advantages of this are summarised in the table below.

User		New capability
Organisation		Streamline collaboration between teams, maximise and increase speed of data utilisation, resulting in man-hours cost savings
Scientists		Find and reuse data, plan new experiments
Bioinformaticians		Query data easily for custom analysis
Data managers		Easily harmonise metadata, source new data
IT		Work with a flexible architecture, add new Single Points of Truth (SPoT) in the future, have distributed storage, compute, and federated queries
AI/ML		Access to all relevant data, improved data quality, cross-study comparability

The improved FAIRness of data in the clients data architeture not only brings savings to person-hours and storage (therefore costs) but also allows the discovery of new correlations and reduces the problem of false positives/statistical noise. Study aggregation enables cross-factor patterns to be more easily discovered, and improved metadata accuracy ensures that all the data in an analysis is positively contributing to hypothesis testing and accurate classification.

Summary

Our client

Our client is a multinational healthcare company that operates in the pharmaceuticals and diagnostics sectors. It is a strong player in the field of pharmaceuticals for cancer, viral and metabolic diseases, and one of the largest pharmaceutical companies worldwide.

Project

Finding, aggregating, and re-using Life Science Data are critical needs for life science R&D, yet no complete solution exists. Data remains locked in silos with unorganised metadata and relationships, making it extremely hard to utilise.

We have successfully addressed these problems through building the solution for the client. In this context Arvados provides the underlying distributed and versioned virtual file system, and Genestack provides the key interoperable components for centralising data access across Arvados data locations and public repositories, harmonising metadata, and enabling integrative data mining through APIs and a graphical user interface.

In summary we built a FAIR-compliant and single-point-of-truth system that not only maximises data discovery but also provides the foundation and building blocks for advanced visual analytics and AI/ML applications, for pharma R&D and beyond.

Background

The FAIR principles

The FAIR principles define a set of characteristics that data, tools, vocabularies and infrastructures should have in order to be findable, accessible, interoperable and reusable.

Findable

Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component.

Accessible

Once the user finds the required data, she/he needs to know how they can be accessed, possibly including authentication and authorisation.

Interoperable

The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

Reusable

Metadata and data should be well-described so that they can be replicated and/or combined in different settings.

PHARMA R&D

FAIR data management system

Challenge

The Solution

Technology features

Outcome

Let's Talk