Challenge
Big Pharma companies are facing major data paradigm shifts due to increased Life Science Data in clinical trials, massive data volumes, and data externalisation. With the advent of machine learning and AI, data has proven to be a major asset for decision making and exploring scientific hypotheses. However, it is also becoming apparent that leveraging data is still not an easy task, especially across department and unit boundaries.
The reasons for this are numerous: Relationships between data are often missing or hard to discover. Metadata is inconsistent between different sources, and is sometimes missing or incorrect. Data provenance is not always established.
These problems make it hard for scientists and AI/ML applications to find, re-use, aggregate, analyse and classify data from experimental studies. This in turn makes analyses slow, causes experiments to have to be re-run, important experimental data to be missed, statistical confidence in results to be lowered, and potentially important cross-study factors to be missed.
No complete solutions exist for the management of Life Science Data in life sciences R&D so currently these problems can only be partially mitigated - through costly hiring of additional data managers, diverting time from specialists in other fields, duplicating data, and re-running experiments to re-generate lost data. Previous software solutions that aim to address these are often not interoperable, adding burdens to IT management, user training and adoption, and, ultimately, increased cost.
The Solution
Guided by our client“s R&D needs, Genestack have successfully accelerated the development and application of technologies to address these problems through building a five-layer architecture spanning physical hardware, file system & workflow, Single Point of Truths, data integration, and user interface. Single Point of Truths are information sources for studies, samples, genomics, transcriptomics, and other data types that are stored only once, independently of each other, with linkage handled via references.
In this context Arvados, from Veritas Genetics, provides the underlying distributed and versioned virtual file system. Genestack then provides the key interoperable components for centralising data access across Arvados data locations and public repositories, harmonising metadata, and enabling integrative data mining through APIs and a graphical user interface. Together, we have built a FAIR-compliant and single-point-of-truth system that not only maximises data discovery but also provides the foundation and building blocks for advanced visual analytics and AI/ML applications, for pharma R&D and beyond.
Technology features
Specific technology features of the solution are listed below:
Technology / feature | Benefit | |
Genestack | Full-text/faceted metadata search powered by SOLR | Easy to find data via biological and technical attributes |
Data-type specific indexers: Clickhouse for TranscriptLife Science Data, Genestack proprietary indexer for GenLife Science Data | Fast search/retrieval even if data needs to be held at remote locations | |
Relationship modelling via MySQL | Easy to model and traverse through studies, samples, Life Science Data and analysis. Ensures data provenance and reproducibility. | |
Integrative and distributed metadata and data querying, using a combination of MySQL and in situ indexers | Integrated and federated Life Science Data queries. | |
Public data repositories fully integrated and indexed, such as GEO, ArrayExpress | Centralised access to a wealth of data | |
Ontologies/controlled vocabularies support and easy curation (ChEBI, Uberon, etc.) | Harmonised metadata, making it consistent, unambiguous, and valid, so that searching becomes easier and data can be re-used correctly | |
Programmatic access via REST endpoints, Swagger documentation | Enables bioinformaticians to run standard pipelines and query data for custom analysis. Enables data managers to upload/update/link data programmatically. Enables system integration. | |
Client libraries (Python/R), auto-generated using Mustache | Help bioinformaticians to more easily query data for custom analysis | |
Third-party tools integration for standard workflows (like QC) | Allows workflows to be expanded with existing or new analyses | |
Ability to integrate future modules and integrate new data types as new Single Point of Truths (SPoTs) | Allows future-proofing | |
Single sign on | Enables convenient yet secure access | |
Modular, service-oriented architecture | Enables system integration and addition of new data and tools | |
Genestack-Client interface | Seamless integration of client terminologies service for metadata curation | |
Genestack-Arvados interface | Allows permissions synchronisation and versioning through from Arvados to the Genestack platform. | |
Arvados | Provides scalability, non-duplication, performance, provenance and access through its underlying distributed and versioned virtual file system, together with user access control system | |
Docker containers | Allow deployment and modularity |
Outcome
From the user’s, and an AI/ML, perspective, the pRED Data Commons strategy unleashes the full potential of all collected data by making it easy to find, access, interoperate and reuse - being aligned with FAIR principles.
Some of the specific advantages of this are summarised in the table below.
User | New capability | |
Organisation | Streamline collaboration between teams, maximise and increase speed of data utilisation, resulting in man-hours cost savings | |
Scientists | Find and reuse data, plan new experiments | |
Bioinformaticians | Query data easily for custom analysis | |
Data managers | Easily harmonise metadata, source new data | |
IT | Work with a flexible architecture, add new Single Points of Truth (SPoT) in the future, have distributed storage, compute, and federated queries | |
AI/ML | Access to all relevant data, improved data quality, cross-study comparability |
The improved FAIRness of data in the clients data architeture not only brings savings to person-hours and storage (therefore costs) but also allows the discovery of new correlations and reduces the problem of false positives/statistical noise. Study aggregation enables cross-factor patterns to be more easily discovered, and improved metadata accuracy ensures that all the data in an analysis is positively contributing to hypothesis testing and accurate classification.